=pod
=head1 SYNOPSIS
C<< imlanal --mutation inserted_mutation --wtseqdoc path --wtseqfmt seqio_format --readnameconvention attributelist=expression --tagattribute attribute=valueform --outseqpath filepath --cloneattribute attributelist --revcompif attribute=pattern --threshold 97 --alignoutformat fmt < seqread > gff_results >>
=head1 DESCRIPTION
Determine the locations of mutation events in an insertion mutation
library (iml) by analyzing selected sequence reads.
Given the wild-type sequence for a gene, a description of an insertion
mutation, and the DNA sequence reads of clone library being screened
insertion mutations of the gene, recover the location on each clone of
the mutation and, if present, map it to the corresponding location on
the wild-type (by way of an assembly). Produce as output two files:
1) a genbank file holding (a copy of) the wild-type sequence along
with features corresponding to the recovered and mapped mutation
locations; 2) a GFF file describing only the features
Include diagnostic message for any clones failing to produce a mutation.
Read the psuedo-code in the L<"method"> for more details.
=head1 OPTIONS
Their case is ignored, and they may be abbreviated to uniqueness
(i.e. --v instead of --verbose).
Options may be specified on the command line, and may optionally also
be read from files by providing on the command line the path to the
file preceded by a '@'. Such option files provide simple access to
typical calling scenarios (such as an analysis that is repeatedly
invoked from the command line with the same parameters).
=over
=item B<--mutation> inserted_mutation
The sequence inserted in forming of the mutation.
=item B<--flanksize> base count
The number of bases from the wild-type sequence that are reduplicated
in the clone after the insertion of the mutation.
=item B<--wtseqdoc>
A file containing a 'reference sequence' of the wild-type (not
mutated) form of the sequence against which all seqreads will be
aligned.
=item B<--wtseqfmt> fasta|genbank|scf|raw|ace|fasta|bsml|game|gcg|genbank|pir|swiss
The format of sequence file named in wtseqdoc. All Bio::SeqIO
formats are admitted. Default is fasta.
=item B<--informat fasta|genbank|scf|raw|ace|fasta|bsml|game|gcg|genbank|pir|swiss|phd>
The sequence format of the sequences_reads appearing on standard input. Default is fasta.
=item B<--outseqpath> pathname
Require location of resulting annotated reference sequence with all original
features and new features gained by running this tool. File will be
produced in genbank format.
Defaults to a filename gained from adding to wtseqdoc new extensions
of this scripts name followed by ".gb" (i.e. by adding .imlanal.gb).
=item B<--readnameconvention> attributelist=expression
A method of decoding attributes such as are typically encoded in the
name of files holding seqreads.
Implemented as an association between a comma delimited list of
attribute names and a perl expression which, when evaluated yeilds
corresponding values for each attribute. The expression is evaluated
in the context of $_ being the filename holding a seqread.
For example:
C<--readnameconvention='gene,library,plate,primer,platewellcoord=split /[-_.]/'>
when reading the file: VPL3-15-I-BSB460_A01.seq
will assign to the current read attributes as follows:
gene => VPL3
library => 15
plate => I
primer => BSB460
platewellcoord => A01
Named aliases exist as shorthand methods for decoding commonly used
naming conventions. The following are predefined:
(TODO - IMPLEMENT ALIASES - THIS FEATURE IS NOT IMPLEMENTED!?!?!)
=over
=item B<simr1>
Short for: '(gene,library,plate,primer,platewellcoord)=split /[-_.]/'
=back
=item B<--cloneattribute> string
A comma delimited list of seqread attributes which distinctly identify
the corresponding clone from which the read was made.
This is used to allow tracking whether a mutation has already been
identified for the clone of the current read, allowing imlanal to skip
processing such reads (in the name of efficiency).
For the simr1 naming convention, this should name all the attributes
except for the primer.
=item B<--alignoutformat> bl2seq|clustalw|emboss|fasta|mase|mega|meme|msf|nexus|pfam|phylip|prodom|psi|selex|stockholm
Optional format in which a trace of alignments should be dumped.
If supplied, the alignment between each readseq and the wild-type
reference sequence are printed, as a trace, in the named format to the
file whose name is gained by appending ".aln.<alignoutformat>" to the
--outseqpath.
Note: The list of formats is that provided by Bio::AlignIO module.
=item B<--threshold> percentage identity
An optional threshold which, if supplied, must be exceeded by the
average percentage identity of the alignment for the alignment to be
used (as computed by Bio::Alignment.
=item B<--revcompif> attribute=expression
Method for declaring which sequence reads should be reverse
complemented prior to alignment. Implemented as one or more optional
expressions to be evaluated in the (perl) context of $_ being set to
the correspondingly named attribute (as set by the readnameconvention).
For example:
=over
=item C<--revcompif primer=m/BSB460|BSB458/>
will cause those reads whose 'primer' attribute matches either
BSB460 or BSB458 to be reverse complemented.
=item C<--revcompif primer=m/_R^/>
will cause those reads whose primer attribute ends in '_R' to be
reverse complemented.
=back
=item B<--tagattribute> attribute=expression
Method of defining additional attributes to assign to the output
resulting insertion mutation location features.
The value of the attribute is the result of evaluating the
expression in the context of $_ being set to a reference to a hash of
other current read attributes (such as established using readnameconvention).
=head1 EXAMPLE
Assuming your account is configured so that in your F<.tcshrc> you source imlanal.tcshrc,
=for html <A HREF="./imlanal.tcshrc">./imlanal.tcshrc</A>
Futher assuming that you wish to use the aliases established there
(that are useful especially to Vlad), and so want to use the mgs1 configuration options,
=for html <A HREF="./config/mgs1.imlanal">./config/mgs1.imlanal</A>
then, change to a test directory (recursively) holding sequence reads:
$ cd /home/mec/src/imlanal/t/data/VAP/VPL3/data-VPL3_15_I/ (L<http://helix/~mec/src/imlanal/t/data/VAP/VPL3/data-VPL3_15_I/>)
and analyze all the sequence read files below it:
$ imlmgslib
OR
=head1 INTEGRATION / ENVIRONMENT
The GFF files can be opened using vector NTI version 8 under windows.
The implementation of reading GFF files ignores all but the label
attribute (I have a bug report / feature request into them on this
topic).
The genbank may also be viewed in vector NTI (any version), and will
show ALL the feature labels generated.
=head1 VERSION
Not yet under version control.
=head1 CONTACT
Malcolm Cook (L<mec@stowers-institute.org>)
=head1 DEPENDENCIES
perl, BioPerl, clustalw
=head1 CGI OPERATION
If this script is invoked as a CGI program, it produces HTML to
document itself.
It detects that it is running as a CGI by looking for '.cgi' as an
extension on the name of the running script. Thus the script should
be installed without an extension, and a symbolic link to it should be
created with same name, only having .cgi as an extension.
=head1 AVAILABILITY
Run it on helix (out of ~mec/src).
Email the author for sources.
=for html ...or <A HREF="./imlanal">get the source now!</A>
=for html ...or <A HREF="./imlanal.html">see the htmlized source!</A>
=head1 METHOD
For each sequence read in the library:
> Skip it if the reads belongs to a clone for which a mutation
> location has already been recovered
> Reverse complement it if needed (i.e. it is a reverse read)
> Identify the number and location of subsequences matching the
> insertion mutation
> Filtering out any such subsequences whose flanking sequence do NOT
> match
> Recover the wild-type clone sequence by splicing out the inserted
> sequence and one of the reduplicated falnking sequences
> Align the recovered clone to the actual wild-type.
> Skip it if the alignment is poor
> Use the alignment to map the insertion location on the clone to the
> wild-type
> Create the necessary output data structures (GFF and Genbank
> features)
=head1 TODO
=over
=item sw engineering
put it under source code control,
create a distribution & installation package
=item release it
to the world (i.e. BioPerl community)
=item allow other alignment tools than clustalw
(i.e. TCOFFEE, etc)
=item submit patch to clustalw.pm
to redirect stdout output to stderr unless quiet (it goes to
STDOUT!!!).
=item method of finding insertion mutation
grok biology better - should we really use BioPerl's restriction
enzyme, or rather seq pattern (or, even more simply, regular
expression).
=item statistics
Be able to answer questions like
>Is the location of insertion uniformly distributed across the sequence?
> Provided a partitioning of the sequence, is the partitioning
> predictive of the distribution of insertions?
=head1 ROADS NOT TAKEN:
trim sequence reads for quality (phred?) and or vector contamination (crossmatch?)
assemble (using phrap?) reads from the same clone together prior to alignment
align all reads against wildtype in one fell swoop - subsequently ensure that reads from the same clone
=back
=cut
use strict;
use Pod::Usage;
use Getopt::ArgvFile qw(argvFile); #Allows options to be specified in
#files in addition to command line.
use Getopt::Long;
use File::Spec;
use FindBin;
use UNIVERSAL::require;
use Bio::Tools::RestrictionEnzyme;
use Bio::SeqIO;
use Bio::SeqFeature::Generic;
use Bio::Tools::Run::Alignment::Clustalw;
use Data::Dumper;
my $man = 0;
my $help = 0;
my $dupsallowed;
my $fofn;
my $verbose;
my $wtseqdoc;
my $wtseqfmt;
my $flanksize;
my $mutation;
my $outseqpath;
my $informat;
my @cloneattribute;
my $alignoutformat;
my $threshold;
my %readnameconvention;
my %revcompif = ();
my %tagattribute;
argvFile(
startupFilename => "$FindBin::Script.config",
current => 1,
default => 0,
home => 0,
);
GetOptions(
'help|?' => \$help,
'man' => \$man,
'verbose' => \$verbose,
'wtseqdoc:s' => \$wtseqdoc,
'wtseqfmt:s' => \$wtseqfmt,
'informat:s' => \$informat,
'flanksize:i' => \$flanksize,
'mutation:s' => \$mutation,
'outseqpath:s' => \$outseqpath,
'alignoutformat:s' => \$alignoutformat,
'threshold:i' => \$threshold,
'readnameconvention=s' => \%readnameconvention,
'cloneattribute:s' => \@cloneattribute,
'revcompif:s' => \%revcompif,
'tagattribute:s' => \%tagattribute,
) or pod2usage(2);
pod2usage(1) if $help;
pod2usage(-exitstatus => 0, -verbose => 2) if $man;
if ($FindBin::Script =~ m/.*\.cgi/) {
# If the script has been invoked as a CGI program, spew the manpage
# as HTML. CGI dectection is heuristic - it tests whether executing
# script has a .cgi extension (the script is presumably a symlink to
# the extensionless command line executable).
use CGI qw(:standard);
print header,`pod2html --backlink "top" --header $FindBin::RealScript`;
exit 0;
}
################################################################################
### Provide defaults and validate options
################################################################################
my @readnameconvention
# used only for validation of revcompifkey and cloneattribute;
= map {split ','} keys %readnameconvention;
my @revcompifkey = keys %revcompif;
my @invalidrevcompifkey = grep {my $att = $_; 0 == grep {$_ eq $att} @readnameconvention} @revcompifkey;
die "Illegal revcompif keys: @invalidrevcompifkey" if @invalidrevcompifkey;
@cloneattribute
#allow multiple attributes with single command options
= split ',' , join "," , @cloneattribute;
my @invalidcloneattribute
#validate them against members of readnameconvention:
= grep {my $att = $_; 0 == grep {$_ eq $att} @readnameconvention} @cloneattribute;
die "Illegal clone attribute: @invalidcloneattribute" if @invalidcloneattribute;
$wtseqdoc ||= '../wtseq.fasta';
$wtseqfmt ||= 'fasta';
$informat ||= 'fasta';
$outseqpath ||= "$wtseqdoc.$FindBin::Script.gb";
$flanksize ||= 5;
$mutation ||= 'NOTI--TGCGG^CCGCA';
$threshold ||= 0;
my $alignout;
$alignout = Bio::AlignIO->new(-file => ">$outseqpath.aln.$alignoutformat" , '-format' => $alignoutformat) if $alignoutformat;
my $factory = Bio::Tools::Run::Alignment::Clustalw->new(
'ktuple' => 2,
'matrix' => 'BLOSUM'
);
$factory->quiet(1) unless $verbose;
# Don't show clustal output (send it to >dev/null - in which case
my $renzyme = new Bio::Tools::RestrictionEnzyme( '-NAME' => $mutation,
'-MAKE' =>'custom'
);
my $seqin = Bio::SeqIO->newFh(-format => $informat);
my $wtseq = Bio::SeqIO->new(-file => $wtseqdoc,
-format => $wtseqfmt)->next_seq();
my ($readseq , $readseqid,
$cutlocations, $readposition, $left, $right, $recovered_readseq,
$aligned_wtseq, $aligned_recovered_readseq, $wtseq_insertlocation, $wtseq_newfeature,
$aln,$cloneid,
);
my $outseq = Bio::SeqIO->new(-format => 'GENBANK',
-file =>">$outseqpath",
);
my %attread;
my %seencloneid = ();
sub Bio::Tools::GFF::_gff2_string{
my ($gff, $feat) = @_;
my ($str,$score,$frame,$name,$strand);
if( $feat->can('score') ) {
$score = $feat->score();
}
$score = '.' unless defined $score;
if( $feat->can('frame') ) {
$frame = $feat->frame();
}
$frame = '.' unless defined $frame;
$strand = $feat->strand();
if(! $strand) {
$strand = ".";
} elsif( $strand == 1 ) {
$strand = '+';
} elsif ( $feat->strand == -1 ) {
$strand = '-';
}
if( $feat->can('seqname') ) {
$name = $feat->seqname();
$name ||= 'SEQ';
} else {
$name = 'SEQ';
}
$str = join("\t",
$name,
$feat->source_tag(),
$feat->primary_tag(),
$feat->start(),
$feat->end(),
$score,
$strand,
$frame);
my $valuestr;
if ($feat->all_tags){ $str .= "\t"; foreach my $tag ( $feat->all_tags ) {
my $valuestr; foreach my $value ( $feat->each_tag_value($tag) ) {
if ($value =~ /[^A-Za-z0-9_]/){
$value =~ s/\t/\\t/g; $value =~ s/\n/\\n/g; $value = '"' . $value . '" '} $value = "\"\"" unless defined $value; $valuestr .= $value . " "; }
$str .= "$tag $valuestr ; "; }
chop $str; chop $str }
return $str;
}
SEQUENCE: while (<$seqin>) {
$readseq = $_;
$readseqid = $readseq->display_id;
%attread = ();
while (my ($attributes, $valuesform) = each(%readnameconvention)) {
$_=$readseqid;
my @value = eval($valuesform);
if ($@) { die "Error determining read attributes: [$attributes] while evaluating: [$valuesform]...\n\t\t$@\n" };
my @att = split /,/ ,$attributes;
if (not $#att == $#value) {
warn "$readseqid SKIPPING! Number of values [$#value] does not comport with attributes [@att] ";
next SEQUENCE;
}
for (@att) {$attread{$_} = shift(@value)};
}
while (my ($attribute, $valueform) = each(%tagattribute)) {
$_=\%attread; $attread{$attribute} = eval($valueform);
if ($@) { die "Assigning attributes: [$attribute] while evaluating: [$valueform]...\n\t\t$@\n" };
}
if (@cloneattribute) {
$cloneid = join '_', map {$attread{$_}} @cloneattribute;
$seencloneid$seencloneid if (! exists($seencloneid{$cloneid})) {
undef $seencloneid{$cloneid};
}
if ($seencloneid{$cloneid}) {
print STDERR "$readseqid - SKIPPING! Already found clone [$cloneid] in read [", $seencloneid{$cloneid}, "]\n";
next SEQUENCE;
}
}
if (grep {my $fkey = $_;
$_ = $attread{$fkey};
my $test = $revcompif{$fkey};
eval $test;
}
map {$_} @revcompifkey
@revcompifkey ) {
$readseq = $readseq->revcom;
}
$cutlocations = $renzyme->cut_locations($readseq); print STDERR "$readseqid cutlocations: [@$cutlocations]\n" if $verbose;
$cutlocations = [grep {my $readposition = $_;
my ($left, $right) =
($readseq->subseq($readposition - $flanksize + 1, $readposition),
$readseq->subseq($readposition + $renzyme->seq->length + 1, $readposition + $renzyme->seq->length + $flanksize));
my $flankmatch = $left eq $right;
print STDERR $readseq->id . " - FILTERING recognized insertion point [$readposition] since flanking sequences don't match; left: [$left], right: [$right] \n" unless $flankmatch ;
$flankmatch
}
@$cutlocations];
if (1 != @$cutlocations) {
print STDERR "$readseqid - SKIPPING! Required exactly one cutlocation but found: [@$cutlocations]\n";
next SEQUENCE;
}
if (@cloneattribute) {
$seencloneid{$cloneid} = $readseqid;
}
$readposition = $cutlocations->[0];
($left, $right) =
($readseq->subseq($readposition - $flanksize + 1, $readposition),
$readseq->subseq($readposition + $renzyme->seq->length + 1, $readposition + $renzyme->seq->length + $flanksize));
$recovered_readseq
= Bio::Seq->new( -seq => ($readseq->subseq($readseq->start, $readposition) .
$readseq->subseq($readposition + $renzyme->seq->length + $flanksize, $readseq->end)),
-id => $readseq->id ."_unmutated",
);
$aln = $factory->align([$wtseq, $recovered_readseq]); $alignout->write_aln($aln) if $alignout;
if ($threshold >= $aln->average_percentage_identity()) {
print STDERR "$readseqid - SKIPPING! Threshold [",$threshold,"] > average_percentage_identity [", $aln->average_percentage_identity(),"]\n";
next SEQUENCE;
}
$aligned_wtseq = $aln->get_seq_by_pos(1); $aligned_recovered_readseq = $aln->get_seq_by_pos(2);
$wtseq_insertlocation = $aligned_wtseq->location_from_column($aligned_recovered_readseq->column_from_residue_number($readposition));
die "wtseq_insertlocation start <> end!!" unless $wtseq_insertlocation->start == $wtseq_insertlocation->end;
$wtseq_newfeature = new Bio::SeqFeature::Generic(
-start => $wtseq_insertlocation->start() - $flanksize + 1,
-end => $wtseq_insertlocation->end(),
-strand => 1,
-primary => 'misc_feature',
-tag => {
%attread,
},
-source => $FindBin::Script,
);
$wtseq_newfeature->seqname($wtseq->id()); $wtseq->add_SeqFeature($wtseq_newfeature); print $wtseq_newfeature->gff_string, "\n";
}
$outseq->write_seq($wtseq);
if (my @clones_without_mutations = grep {! defined $seencloneid{$_}} keys %seencloneid) {
print STDERR "\n\nWARNING: there were clones_without_mutations: @clones_without_mutations\n";
}
__DATA__