imlanal title

#!/usr/local/bin/perl  -w
=pod

=head1 SYNOPSIS

C<< imlanal --mutation inserted_mutation --wtseqdoc path --wtseqfmt seqio_format  --readnameconvention attributelist=expression --tagattribute attribute=valueform --outseqpath filepath --cloneattribute attributelist --revcompif attribute=pattern --threshold 97 --alignoutformat fmt  < seqread > gff_results >>

=head1 DESCRIPTION

Determine the locations of mutation events in an insertion mutation
library (iml) by analyzing selected sequence reads.

Given the wild-type sequence for a gene, a description of an insertion
mutation, and the DNA sequence reads of clone library being screened
insertion mutations of the gene, recover the location on each clone of
the mutation and, if present, map it to the corresponding location on
the wild-type (by way of an assembly).  Produce as output two files:
1) a genbank file holding (a copy of) the wild-type sequence along
with features corresponding to the recovered and mapped mutation
locations; 2) a GFF file describing only the features

Include diagnostic message for any clones failing to produce a mutation.

Read the psuedo-code in the L<"method"> for more details.

=head1 OPTIONS

Their case is ignored, and they may be abbreviated to uniqueness
(i.e. --v instead of --verbose).

Options may be specified on the command line, and may optionally also
be read from files by providing on the command line the path to the
file preceded by a '@'.  Such option files provide simple access to
typical calling scenarios (such as an analysis that is repeatedly
invoked from the command line with the same parameters).

=over

=item B<--mutation> inserted_mutation

The sequence inserted in forming of the mutation.

=item B<--flanksize> base count

The number of bases from the wild-type sequence that are reduplicated
in the clone after the insertion of the mutation.

=item B<--wtseqdoc>

A file containing a 'reference sequence' of the wild-type (not
mutated) form of the sequence against which all seqreads will be
aligned.

=item B<--wtseqfmt> fasta|genbank|scf|raw|ace|fasta|bsml|game|gcg|genbank|pir|swiss

The format of sequence file named in wtseqdoc.  All Bio::SeqIO
formats are admitted.  Default is fasta.

=item B<--informat  fasta|genbank|scf|raw|ace|fasta|bsml|game|gcg|genbank|pir|swiss|phd>

The sequence format of the sequences_reads appearing on standard input.  Default is fasta.

=item B<--outseqpath> pathname

Require location of resulting annotated reference sequence with all original
features and new features gained by running this tool.  File will be
produced in genbank format.  

Defaults to a filename gained from adding to wtseqdoc new extensions
of this scripts name followed by ".gb" (i.e. by adding .imlanal.gb).

=item B<--readnameconvention> attributelist=expression

A method of decoding attributes such as are typically encoded in the
name of files holding seqreads.

Implemented as an association between a comma delimited list of
attribute names and a perl expression which, when evaluated yeilds
corresponding values for each attribute.  The expression is evaluated
in the context of $_ being the filename holding a seqread.

For example:

   C<--readnameconvention='gene,library,plate,primer,platewellcoord=split /[-_.]/'>

when reading the file: VPL3-15-I-BSB460_A01.seq

will assign to the current read attributes as follows:

	gene => VPL3
	library => 15
	plate => I
	primer => BSB460
	platewellcoord => A01

Named aliases exist as shorthand methods for decoding commonly used
naming conventions.  The following are predefined: 

(TODO - IMPLEMENT ALIASES -  THIS FEATURE IS NOT IMPLEMENTED!?!?!)

=over

=item B<simr1>

Short for: '(gene,library,plate,primer,platewellcoord)=split /[-_.]/'

=back

=item B<--cloneattribute> string

A comma delimited list of seqread attributes which distinctly identify
the corresponding clone from which the read was made.

This is used to allow tracking whether a mutation has already been
identified for the clone of the current read, allowing imlanal to skip
processing such reads (in the name of efficiency).

For the simr1 naming convention, this should name all the attributes
except for the primer.

=item B<--alignoutformat>  bl2seq|clustalw|emboss|fasta|mase|mega|meme|msf|nexus|pfam|phylip|prodom|psi|selex|stockholm

Optional format in which a trace of alignments should be dumped.

If supplied, the alignment between each readseq and the wild-type
reference sequence are printed, as a trace, in the named format to the
file whose name is gained by appending ".aln.<alignoutformat>" to the
--outseqpath.

Note: The list of formats is that provided by Bio::AlignIO module.

=item B<--threshold> percentage identity

An optional threshold which, if supplied, must be exceeded by the
average percentage identity of the alignment for the alignment to be
used (as computed by Bio::Alignment.

=item B<--revcompif> attribute=expression

Method for declaring which sequence reads should be reverse
complemented prior to alignment.  Implemented as one or more optional
expressions to be evaluated in the (perl) context of $_ being set to
the correspondingly named attribute (as set by the readnameconvention).

For example:

=over

=item C<--revcompif primer=m/BSB460|BSB458/>

will cause those reads whose 'primer' attribute matches either
BSB460 or BSB458 to be reverse complemented.

=item C<--revcompif primer=m/_R^/>

will cause those reads whose primer attribute ends in '_R' to be
reverse complemented.

=back

=item B<--tagattribute> attribute=expression

Method of defining additional attributes to assign to the output
resulting insertion mutation location features.

The value of the attribute is the result of evaluating the
expression in the context of $_ being set to a reference to a hash of
other current read attributes (such as established using readnameconvention).

=head1 EXAMPLE

Assuming your account is configured so that in your F<.tcshrc> you source imlanal.tcshrc,

=for html <A HREF="./imlanal.tcshrc">./imlanal.tcshrc</A>

Futher assuming that you wish to use the aliases established there
(that are useful especially to Vlad), and so want to use the mgs1 configuration options,

=for html <A HREF="./config/mgs1.imlanal">./config/mgs1.imlanal</A>

then, change to a test directory (recursively) holding sequence reads:

$ cd /home/mec/src/imlanal/t/data/VAP/VPL3/data-VPL3_15_I/    (L<http://helix/~mec/src/imlanal/t/data/VAP/VPL3/data-VPL3_15_I/>)

and analyze all the sequence read files below it:

$ imlmgslib

OR

=head1 INTEGRATION / ENVIRONMENT

The GFF files can be opened using vector NTI version 8 under windows.
The implementation of reading GFF files ignores all but the label
attribute (I have a bug report / feature request into them on this
topic).

The genbank may also be viewed in vector NTI (any version), and will
show ALL the feature labels generated.

=head1 VERSION

Not yet under version control.

=head1 CONTACT

Malcolm Cook (L<mec@stowers-institute.org>)

=head1 DEPENDENCIES

perl, BioPerl, clustalw

=head1 CGI OPERATION

If this script is invoked as a CGI program, it produces HTML to
document itself.

It detects that it is running as a CGI by looking for '.cgi' as an
extension on the name of the running script.  Thus the script should
be installed without an extension, and a symbolic link to it should be
created with same name, only having .cgi as an extension.

=head1 AVAILABILITY

Run it on helix (out of ~mec/src).

Email the author for sources.

=for html  ...or <A HREF="./imlanal">get the source now!</A>

=for html  ...or <A HREF="./imlanal.html">see the htmlized source!</A>


=head1 METHOD

For each sequence read in the library:

 > Skip it if the reads belongs to a clone for which a mutation
 > location has already been recovered

 > Reverse complement it if needed (i.e. it is a reverse read)

 > Identify the number and location of subsequences matching the
 > insertion mutation

 > Filtering out any such subsequences whose flanking sequence do NOT
 > match

 > Recover the wild-type clone sequence by splicing out the inserted
 > sequence and one of the reduplicated falnking sequences

 > Align the recovered clone to the actual wild-type.

 > Skip it if the alignment is poor

 > Use the alignment to map the insertion location on the clone to the
 > wild-type

 > Create the necessary output data structures (GFF and Genbank
 > features)

=head1 TODO

=over

=item sw engineering

put it under source code control,

create a distribution & installation package

=item release it 

to the world (i.e. BioPerl community)

=item allow other alignment tools than clustalw 

(i.e. TCOFFEE, etc)

=item submit patch to clustalw.pm 

to redirect stdout output to stderr unless quiet (it goes to
STDOUT!!!).

=item method of finding insertion mutation 

grok biology better - should we really use BioPerl's restriction
enzyme, or rather seq pattern (or, even more simply, regular
expression).

=item statistics

Be able to answer questions like

 >Is the location of insertion uniformly distributed across the sequence?

 > Provided a partitioning of the sequence, is the partitioning
 > predictive of the distribution of insertions?

=head1 ROADS NOT TAKEN:

trim sequence reads for quality (phred?) and or vector contamination (crossmatch?)

assemble (using phrap?) reads from the same clone together prior to alignment

align all reads against wildtype in one fell swoop - subsequently ensure that reads from the same clone

=back

=cut

use strict;
use Pod::Usage;
use Getopt::ArgvFile qw(argvFile); #Allows options to be specified in
                                   #files in addition to command line.
use Getopt::Long;
use File::Spec;
use FindBin;
use UNIVERSAL::require;

use Bio::Tools::RestrictionEnzyme;
use Bio::SeqIO;
use Bio::SeqFeature::Generic;
use Bio::Tools::Run::Alignment::Clustalw;
use Data::Dumper;

my $man = 0;
my $help = 0;
my $dupsallowed;
my $fofn;
my $verbose;

my $wtseqdoc;
my $wtseqfmt;
my $flanksize;
my $mutation;
my $outseqpath;
my $informat;
my @cloneattribute;
my $alignoutformat;
my $threshold;
my %readnameconvention;
my %revcompif = ();
my %tagattribute;

argvFile(
	 startupFilename => "$FindBin::Script.config",
	 current => 1,
	 default => 0,
	 home    => 0,
	);


GetOptions(
	   'help|?'                => \$help,
	   'man'                   => \$man,
	   'verbose'               => \$verbose,

	   'wtseqdoc:s'            => \$wtseqdoc,
	   'wtseqfmt:s'            => \$wtseqfmt,
	   'informat:s'            => \$informat,
	   'flanksize:i'           => \$flanksize,
	   'mutation:s'         => \$mutation,
	   'outseqpath:s'          => \$outseqpath,
	   'alignoutformat:s'      => \$alignoutformat,
	   'threshold:i'           => \$threshold,
           'readnameconvention=s'   => \%readnameconvention,
	   'cloneattribute:s'      => \@cloneattribute,
	   'revcompif:s'           => \%revcompif,
	   'tagattribute:s'        => \%tagattribute,

	  ) or pod2usage(2);

pod2usage(1) if $help;
pod2usage(-exitstatus => 0, -verbose => 2) if $man;

if ($FindBin::Script =~  m/.*\.cgi/) { 
  # If the script has been invoked as a CGI program, spew the manpage
  # as HTML.  CGI dectection is heuristic - it tests whether executing
  # script has a .cgi extension (the script is presumably a symlink to
  # the extensionless command line executable).
  use CGI qw(:standard);
  print header,`pod2html --backlink "top" --header $FindBin::RealScript`;
  exit 0;
}


################################################################################
### Provide defaults and validate options
################################################################################

my @readnameconvention 
  # used only for validation of revcompifkey and cloneattribute;
  = map {split ','} keys %readnameconvention;

my @revcompifkey = keys %revcompif;

my @invalidrevcompifkey = grep {my $att = $_; 0 == grep {$_ eq $att} @readnameconvention} @revcompifkey;
die "Illegal revcompif keys: @invalidrevcompifkey" if @invalidrevcompifkey;

@cloneattribute
  #allow multiple attributes with single command options
  = split ',' , join "," , @cloneattribute;

my @invalidcloneattribute 
  #validate them against members of readnameconvention:
  = grep {my $att = $_; 0 == grep {$_ eq $att} @readnameconvention} @cloneattribute;
die "Illegal clone attribute: @invalidcloneattribute" if @invalidcloneattribute;

$wtseqdoc    ||= '../wtseq.fasta';
$wtseqfmt    ||= 'fasta';
$informat    ||= 'fasta';
$outseqpath  ||= "$wtseqdoc.$FindBin::Script.gb";
$flanksize   ||= 5;
$mutation    ||= 'NOTI--TGCGG^CCGCA';
$threshold   ||= 0;

my $alignout;
$alignout = Bio::AlignIO->new(-file => ">$outseqpath.aln.$alignoutformat" , '-format' => $alignoutformat) if $alignoutformat;

my $factory = Bio::Tools::Run::Alignment::Clustalw->new(
							'ktuple' => 2, 
							'matrix' => 'BLOSUM'
						       );

$factory->quiet(1) unless $verbose;	
# Don't show clustal output (send it to >dev/null - in which case
# clustalw output gets redirected to STDERR by the MY COPY OF
# Clustalw.pm).

my $renzyme = new Bio::Tools::RestrictionEnzyme( '-NAME' => $mutation,
						 '-MAKE' =>'custom'
					       );

my $seqin  = Bio::SeqIO->newFh(-format => $informat);

my $wtseq = Bio::SeqIO->new(-file => $wtseqdoc,
			     -format => $wtseqfmt)->next_seq();


my ($readseq ,  $readseqid,
    $cutlocations, $readposition, $left, $right, $recovered_readseq,
    $aligned_wtseq, $aligned_recovered_readseq, $wtseq_insertlocation, $wtseq_newfeature,
    $aln,$cloneid,
   );

my $outseq = Bio::SeqIO->new(-format => 'GENBANK',
			     -file =>">$outseqpath",
	       );

my %attread;
my %seencloneid = ();

# TODO: Remove following redefine when next rev of bioperl installed
# (which has similar patch) all patch does is allow zero (0) as a tag
# attribute (without xlating to empty string)

sub Bio::Tools::GFF::_gff2_string{
   my ($gff, $feat) = @_;
   my ($str,$score,$frame,$name,$strand);

   if( $feat->can('score') ) {
       $score = $feat->score();
   }
   $score = '.' unless defined $score;

   if( $feat->can('frame') ) {
       $frame = $feat->frame();
   }
   $frame = '.' unless defined $frame;

   $strand = $feat->strand();
   if(! $strand) {
       $strand = ".";
   } elsif( $strand == 1 ) {
       $strand = '+';
   } elsif ( $feat->strand == -1 ) {
       $strand = '-';
   }

   if( $feat->can('seqname') ) {
       $name = $feat->seqname();
       $name ||= 'SEQ';
   } else {
       $name = 'SEQ';
   }


   $str = join("\t",
                 $name,
		 $feat->source_tag(),
		 $feat->primary_tag(),
		 $feat->start(),
		 $feat->end(),
		 $score,
		 $strand,
		 $frame);

   # the routine below is the only modification I made to the original
   # ->gff_string routine (above) as on November 17th, 2000, the
   # Sanger webpage describing GFF2 format reads: "From version 2
   # onwards, the attribute field must have a tag value structure
   # following the syntax used within objects in a .ace file,
   # flattened onto one line by semicolon separators. Tags must be
   # standard identifiers ([A-Za-z][A-Za-z0-9_]*).  Free text values
   # must be quoted with double quotes".

   # MW

   my $valuestr;
   if ($feat->all_tags){  # only play this game if it is worth playing...
        $str .= "\t";     # my interpretation of the GFF2 specification suggests the need for this additional TAB character...??
        foreach my $tag ( $feat->all_tags ) {
            my $valuestr; # a string which will hold one or more values for this tag, with quoted free text and space-separated individual values.
            foreach my $value ( $feat->each_tag_value($tag) ) {
         		if ($value =~ /[^A-Za-z0-9_]/){
         			$value =~ s/\t/\\t/g;          # substitute tab and newline characters
         			$value =~ s/\n/\\n/g;          # to their UNIX equivalents
         			$value = '"' . $value . '" '}  # if the value contains anything other than valid tag/value characters, then quote it
				$value = "\"\"" unless defined $value;  # if it is completely empty, then just make empty double quotes
         		$valuestr .=  $value . " ";								# with a trailing space in case there are multiple values
         															# for this tag (allowed in GFF2 and .ace format)		
            }
            $str .= "$tag $valuestr ; ";                              # semicolon delimited with no '=' sign
        }
   		chop $str; chop $str  # remove the trailing semicolon and space
    }
   return $str;
}


SEQUENCE: while (<$seqin>) {
  $readseq = $_;
  $readseqid = $readseq->display_id;
  %attread = ();

  while (my ($attributes, $valuesform) = each(%readnameconvention)) {
    $_=$readseqid;
    my @value = eval($valuesform);
    if ($@) { die "Error determining read attributes: [$attributes] while evaluating: [$valuesform]...\n\t\t$@\n"  };
    my @att = split /,/ ,$attributes;
    if (not $#att == $#value) {
      warn "$readseqid SKIPPING! Number of values [$#value] does not comport with attributes [@att] ";
      next SEQUENCE;
    }
    for (@att) {$attread{$_} = shift(@value)};
  }

  
  while (my ($attribute, $valueform) = each(%tagattribute)) {
    $_=\%attread;		# establish eval context.
    $attread{$attribute} = eval($valueform);
    if ($@) { die "Assigning attributes: [$attribute] while evaluating: [$valueform]...\n\t\t$@\n"  };
  }


  #NB - the access above CREATES a hash entry (with undef value)

  if (@cloneattribute) {
    $cloneid = join '_', map {$attread{$_}} @cloneattribute;
    #undef $seencloneid{$cloneid} unless  exists($seencloneid{$cloneid});
    if (! exists($seencloneid{$cloneid})) {
      undef $seencloneid{$cloneid};
    }

    #print STDERR "$readseqid has cloneid: $cloneid \n";
    if ($seencloneid{$cloneid}) {
      print STDERR "$readseqid - SKIPPING! Already found clone [$cloneid] in read [", $seencloneid{$cloneid}, "]\n";
      next SEQUENCE;
    }
  }
  
  if (grep {my $fkey = $_;
	    $_ = $attread{$fkey};
	    my $test =  $revcompif{$fkey};
	    eval $test;
	  }
      map {$_} @revcompifkey 
      # TODO - @revcompifkey is being copied here to avoid destructive modification via $_ alias inside grep block. 
      # Also, grep returns all matches, and we want to fail if any sinlge revcompif fails to match.
      # Figure out a better way of doing this!
     ) {
    $readseq = $readseq->revcom;
  }

  $cutlocations = $renzyme->cut_locations($readseq); #Returns : Arrayref of starting locations where enzyme would cut
  print STDERR "$readseqid cutlocations: [@$cutlocations]\n" if $verbose;

  $cutlocations = [grep {my $readposition = $_;
			 my ($left, $right) = 
			   ($readseq->subseq($readposition - $flanksize + 1, $readposition),
			    $readseq->subseq($readposition + $renzyme->seq->length + 1, $readposition + $renzyme->seq->length + $flanksize));
			 my $flankmatch = $left eq $right;
			 print STDERR  $readseq->id . " - FILTERING recognized insertion point [$readposition] since flanking sequences don't match; left: [$left], right: [$right] \n" unless $flankmatch ;
			 $flankmatch
		       }
		   @$cutlocations];
  
  if (1 != @$cutlocations) {
    print STDERR "$readseqid - SKIPPING! Required exactly one cutlocation but found: [@$cutlocations]\n";
    next SEQUENCE;
  }

  if (@cloneattribute) {
    #since it won't be skipped, mark it as 'seen'.
    $seencloneid{$cloneid} = $readseqid;
  }

  $readposition = $cutlocations->[0];

  #splice out the insert
  ($left, $right) = 
    ($readseq->subseq($readposition - $flanksize + 1, $readposition),
     $readseq->subseq($readposition + $renzyme->seq->length + 1, $readposition + $renzyme->seq->length + $flanksize));

  $recovered_readseq
    = Bio::Seq->new( -seq => ($readseq->subseq($readseq->start, $readposition) .   
			      $readseq->subseq($readposition + $renzyme->seq->length + $flanksize, $readseq->end)),
		     -id => $readseq->id ."_unmutated",
		   );

  #produce alignment with reference sequence
  $aln = $factory->align([$wtseq, $recovered_readseq]); # a Bio::SimpleAlign
  $alignout->write_aln($aln) if $alignout;    

  if ($threshold >= $aln->average_percentage_identity()) {
    print STDERR "$readseqid - SKIPPING! Threshold [",$threshold,"] > average_percentage_identity [", $aln->average_percentage_identity(),"]\n";
    next SEQUENCE;
  }

  $aligned_wtseq = $aln->get_seq_by_pos(1); # a Bio::LocatableSeq
  $aligned_recovered_readseq =  $aln->get_seq_by_pos(2); # a Bio::LocatableSeq.

  $wtseq_insertlocation	# a  Bio::Location::Simple or Bio::Location::Fuzzy object or undef
    = $aligned_wtseq->location_from_column($aligned_recovered_readseq->column_from_residue_number($readposition));

  die "wtseq_insertlocation start <> end!!" unless $wtseq_insertlocation->start == $wtseq_insertlocation->end;

  $wtseq_newfeature =  new Bio::SeqFeature::Generic(
						     #-location =>  $wtseq_insertlocation,
						     # Vladaslav wants to 'mark' the duplicated sequence, rather than the insert location.  
						     -start => $wtseq_insertlocation->start() - $flanksize + 1,
						     -end => $wtseq_insertlocation->end(),
						     -strand => 1, 
						     -primary => 'misc_feature',
						     -tag  => {
							       %attread,
							      },
						     #-score   => 1,
						     -source  =>  $FindBin::Script,
						    );
  $wtseq_newfeature->seqname($wtseq->id()); # thus effecting the 1st column of gff_string output.
  $wtseq->add_SeqFeature($wtseq_newfeature); # Add the SeqFeature to the parent
  print $wtseq_newfeature->gff_string, "\n";
}

$outseq->write_seq($wtseq);

if (my @clones_without_mutations = grep {! defined $seencloneid{$_}} keys %seencloneid) {
  print STDERR "\n\nWARNING: there were  clones_without_mutations: @clones_without_mutations\n";
}


__DATA__