imlanal


SYNOPSIS

imlanal --mutation inserted_mutation --wtseqdoc path --wtseqfmt seqio_format --readnameconvention attributelist=expression --tagattribute attribute=valueform --outseqpath filepath --cloneattribute attributelist --revcompif attribute=pattern --threshold 97 --alignoutformat fmt < seqread > gff_results

top


DESCRIPTION

Determine the locations of mutation events in a (screened) insertion mutation library (iml) by analyzing selected sequence reads.

Given the wild-type sequence for a gene, a description of an insertion mutation, and the DNA sequence reads of clone library being screened insertion mutations of the gene, recover the location on each clone of the mutation and, if present, map it to the corresponding location on the wild-type (by way of an assembly). Produce as output two files: 1) a genbank file holding (a copy of) the wild-type sequence along with features corresponding to the recovered and mapped mutation locations; 2) a GFF file describing only the features

Include diagnostic message for any clones failing to produce a mutation.

Read the psuedo-code in the method for more details.

top


OPTIONS

Their case is ignored, and they may be abbreviated to uniqueness (i.e. --v instead of --verbose).

Options may be specified on the command line, and may optionally also be read from files by providing on the command line the path to the file preceded by a '@'. These option files provide simple access to typical calling scenarios (such as an analysis that is repeatedly invoked from the command line with the same parameters). Additionally, if the current directory contains a file named imlanal.config, it will be automatically used as an option file.

--mutation inserted_mutation
The sequence inserted in forming of the mutation.

--flanksize base count
The number of bases from the wild-type sequence that are reduplicated in the clone after the insertion of the mutation.

--wtseqdoc
A file containing a 'reference sequence' of the wild-type (not mutated) form of the sequence against which all seqreads will be aligned.

--wtseqfmt fasta|genbank|scf|raw|ace|fasta|bsml|game|gcg|genbank|pir|swiss
The format of sequence file named in wtseqdoc. All Bio::SeqIO formats are admitted. Default is fasta.

--informat fasta|genbank|scf|raw|ace|fasta|bsml|game|gcg|genbank|pir|swiss|phd
The sequence format of the sequences_reads appearing on standard input. Default is fasta.

--outseqpath pathname
Require location of resulting annotated reference sequence with all original features and new features gained by running this tool. File will be produced in genbank format.

Defaults to a filename gained from adding to wtseqdoc new extensions of this scripts name followed by ``.gb'' (i.e. by adding .imlanal.gb).

--readnameconvention attributelist=expression
A method of decoding attributes such as are typically encoded in the name of files holding seqreads.

Implemented as an association between a comma delimited list of attribute names and a perl expression which, when evaluated yeilds corresponding values for each attribute. The expression is evaluated in the context of $_ being the filename holding a seqread.

For example:

--readnameconvention='gene,library,plate,primer,platewellcoord=split /[-_.]/'

when reading the file: VPL3-15-I-BSB460_A01.seq

will assign to the current read attributes as follows:

        gene => VPL3
        library => 15
        plate => I
        primer => BSB460
        platewellcoord => A01

Named aliases exist as shorthand methods for decoding commonly used naming conventions. The following are predefined:

(TODO - IMPLEMENT ALIASES - THIS FEATURE IS NOT IMPLEMENTED!?!?!)

simr1
Short for: '(gene,library,plate,primer,platewellcoord)=split /[-_.]/'

--cloneattribute string
A comma delimited list of seqread attributes which distinctly identify the corresponding clone from which the read was made.

This is used to allow tracking whether a mutation has already been identified for the clone of the current read, allowing imlanal to skip processing such reads (in the name of efficiency).

For the simr1 naming convention, this should name all the attributes except for the primer.

--alignoutformat bl2seq|clustalw|emboss|fasta|mase|mega|meme|msf|nexus|pfam|phylip|prodom|psi|selex|stockholm
Optional format in which a trace of alignments should be dumped.

If supplied, the alignment between each readseq and the wild-type reference sequence are printed, as a trace, in the named format to the file whose name is gained by appending ``.aln.<alignoutformat>'' to the --outseqpath.

Note: The list of formats is that provided by Bio::AlignIO module.

--threshold percentage identity
An optional threshold which, if supplied, must be exceeded by the average percentage identity of the alignment for the alignment to be used (as computed by Bio::Alignment.

--revcompif attribute=expression
Method for declaring which sequence reads should be reverse complemented prior to alignment. Implemented as one or more optional expressions to be evaluated in the (perl) context of $_ being set to the correspondingly named attribute (as set by the readnameconvention).

For example:

--revcompif primer=m/BSB460|BSB458/
will cause those reads whose 'primer' attribute matches either BSB460 or BSB458 to be reverse complemented.

--revcompif primer=m/_R^/
will cause those reads whose primer attribute ends in '_R' to be reverse complemented.

--tagattribute attribute=expression
Method of defining additional attributes to assign to the output resulting insertion mutation location features.

The value of the attribute is the result of evaluating the expression in the context of $_ being set to a reference to a hash of other current read attributes (such as established using readnameconvention).

--help
Display command line usage with options.

--man
Display complete manual page and exit.

--verbose
Provides a trace of processing on STDERR.

top


EXAMPLE

Assuming your account is configured so that in your .tcshrc you source imlanal.tcshrc,

./imlanal.tcshrc

Futher assuming that you wish to use the aliases established there (that are useful especially to Vlad), and so want to use the mgs1 configuration options,

./config/mgs1.imlanal

then, change to a test directory (recursively) holding sequence reads:

$ cd /home/mec/src/imlanal/t/data/VAP/VPL3/data-VPL3_15_I/ (http://helix/~mec/src/imlanal/t/data/VAP/VPL3/data-VPL3_15_I/)

and analyze all the sequence read files below it:

$ imlmgslib

OR

top


INTEGRATION / ENVIRONMENT

The GFF files can be opened using vector NTI version 8 under windows. The implementation of reading GFF files ignores all but the label attribute (I have a bug report / feature request into them on this topic).

The genbank may also be viewed in vector NTI (any version), and will show ALL the feature labels generated.

top


VERSION

Not yet under version control.

top


CONTACT

Malcolm Cook mec@stowers-institute.org

top


DEPENDENCIES

perl, BioPerl, clustalw, argvFile

top


CGI OPERATION

If this script is invoked as a CGI program, it produces HTML to document itself.

It detects that it is running as a CGI by looking for '.cgi' as an extension on the name of the running script. Thus the script should be installed without an extension, and a symbolic link to it should be created with same name, only having .cgi as an extension.

top


AVAILABILITY

Email the author for sources.

...or get the source now!...or see the htmlized source!

top


METHOD

For each sequence read in the library:

 > Skip it if the reads belongs to a clone for which a mutation
 > location has already been recovered
 > Reverse complement it if needed (i.e. it is a reverse read)
 > Identify the number and location of subsequences matching the
 > insertion mutation
 > Filtering out any such subsequences whose flanking sequence do NOT
 > match
 > Recover the wild-type clone sequence by splicing out the inserted
 > sequence and one of the reduplicated falnking sequences
 > Align the recovered clone to the actual wild-type.
 > Skip it if the alignment is poor
 > Use the alignment to map the insertion location on the clone to the
 > wild-type
 > Create the necessary output data structures (GFF and Genbank
 > features)

top


REFERENCES

BioPerl - http://www.bioperl.org/
GFF: an Exchange Format for Feature Description - http://www.sanger.ac.uk/Software/formats/GFF/
Finnzymes Mutation Generation System MGS - http://www.finnzymes.fi/products/new_products/mgs_mutation_generation_system.htm
The DDBJ/EMBL/GenBank Feature Table Definition - http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html

top


TODO

sw engineering
put it under source code control,

create a distribution & installation package

release it
to the world (i.e. BioPerl community)

allow other alignment tools than clustalw
(i.e. TCOFFEE, etc)

submit patch to clustalw.pm
to redirect stdout output to stderr unless quiet (it goes to STDOUT!!!).

method of finding insertion mutation
grok biology better - should we really use BioPerls restriction enzyme, or rather seq pattern (or, even more simply, regular expression).

statistics
Be able to answer questions like
 >Is the location of insertion uniformly distributed across the sequence?
 > Provided a partitioning of the sequence, is the partitioning
 > predictive of the distribution of insertions?

top


ROADS NOT TAKEN:

trim sequence reads for quality (phred?) and or vector contamination (crossmatch?)

assemble (using phrap?) reads from the same clone together prior to alignment

align all reads against wildtype in one fell swoop - subsequently ensure that reads from the same clone

top

 imlanal