eFASTA utils

eFASTA utils are a set of scripts and file formats that I use to manage sequences. This is an extension to the traditional FASTA files that have been used for many years. FASTA files consist of a header line (which starts with a '>'), followed by any number of sequence lines. This can be repeated over and over to include multiple sequences in the same file. New sequences are usually separated by a blank line, but I'm not sure if this is a requirement.

Example:
>sequence 1
ATAGCATGCATCGATGCATCGATCGATCAGTC
ACTACTAGCATGCATCGATGCATCG

>seq 2
ATATCGATCTGACTAGCGTAGCTAGCATGCAT
ACTGATCGATTTTACTATCTACAAACTAGCTA

This is all well and good, but it doesn't do much other than store sequences. I like to manually alter my sequence files using a text editor, so I immediately added a few new features to the FASTA format, and made the appropriate scripts to convert these eFASTA files into the traditional FASTA format for use in other programs.

eFASTA really just adds one missing feature from FASTA... comments (hence the name enhanced FASTA).  Anything after a '#' is considered a comment.  by using comments, you can create an eFASTA file that is much more descriptive than normal sequence files, by allowing the researcher to annotate their sequences in place.  In my case, I like to define plasmid constructs using eFASTA format, thus allowing me to define primer binding sites, restriction enzyme cut sites, and where my inserted sequence starts and stops.

This could be taken an entire step further by standardizing on a set onotology for comments, such as @primer, or @organism = "Human", etc...  however, no effort has been placed into attempting to pull structured annotation data from the sequence file.

eFASTA could be used as a kind of inverse GFF file.  GFF files list annotations of sequence in a tab delimited format, followed by the sequence itself.  By using the eFASTA format, you could list annotations as a stream within the context of the sequence itself.  For certain cases (such as those requiring manual editing), this could prove more useful.

Example efasta file:

>Gata4-pBluescriptSK+
#vector
ctaaattgtaagcgttaatattttgttaaaattcgcgttaaatttttgttaaatcagctca
ttttttaaccaataggccgaaatcggcaaaatcccttataaatcaaaagaatagacc
gagatagggttgagtgttgttccagtttggaacaagagtccactattaaagaacgtg
gactccaacgtcaaagggcgaaaaaccgtctatcagggcgatggcccactacgtga
accatcaccctaatcaagttttttggggtcgaggtgccgtaaagcactaaatcggaac
cctaaagggagcccccgatttagagcttgacggggaaagccggcgaacgtggcga
gaaaggaagggaagaaagcgaaaggagcgggcgctagggcgctggcaagtg
tagcggtcacgctgcgcgtaaccaccacacccgccgcgcttaatgcgccgctacaggg
cgcgtcccattcgccattcaggctgcgcaactgttgggaagggcgatcggtgcgggcc
tcttcgctattacgccagctggcgaaagggggatgtgctgcaaggcgattaagttgg
gtaacgccagggttttcccagtcacgacgttgtaaaacgacggccagtgagcgcgcg
taatacgactcactatagggcgaat
#sequenced vector chunk...
tgggtaccgggccccccctcgaggtcgacggtatcgataagcttgatatcgaattcct
#insert sequence...
#GATA4-pBS 5'UTR
CTTCTGATTTGCTGATTGTCTCTAGACTTATCTCAGGCCTGGCTT
CCATGAGATTGGGGATACAACGTGAGATATCTGCCTGGTGGTGTT
ATTTCTCTTGAGGAACATTCTATACACACACACACACAGACTGTG
GATTTTGTGGATTGAGTTGCCAGGATCAGGGAACAACCTGGTCTT
AGCTTAACCGGAATTCCGG
#GATA4 coding region (w/o stop)
ATGTATCAGAGTATAGCTATGGCCACTAACCATGGTCCCTCTGGC
TATGAGGGGACTGGGAGCTTCATGCACAGTGCTACTGCTGCTACC
TCACCTGTCTATGTGCCCACCACCAGGGTCTCCTCCATGATCCAC
AGCCTACCTTACCTCCAAACCAGCGGCTCATCTCAACAAGGAAGC
CCAGTTTCTGGCCACAACATGTGGGCACAAGCCGGAGTGGAATCT
TCTGCCTACAACCCAGGAACTTCTCATCCCCCAGTGTCTCCCAGA
TTCACTTTCTCCTCCAGCCCCCCTATCACAGCACCCTCCAGCAGA
GAGGTCTCCTACAGTAGCCCCCTAGGCATCTCAGCTAATGGGAGA
GAGCAGTACAGCAGGGGGCTGGGTGCCACCTATGCAAGCCCTTAC
CCAGCCTATATGAGTCCAGACATGGGTGCTGCCTGGACTGCTTCT
CCCTTCGACAGCTCCATGCTCCACAACCTCCAGAACAGAGCAGTG
ACGTCAAGGCACCCAAACATAGAGTTTTTTGACGATTTTTCCGAGG
GCCGAGAATGTGTCAACTGTGGAGCAATGTCAACCCCACTTTGGA
GGCGGGATGGAACAGGCCACTATCTATGCAATGCTTGCGGATTGT
ACCATAAGATGAATGGGATCAACCGTCCTCTGATCAAGCCCCAAA
GGCGACTGTCTGCGTCTCGCCGGGTGGGTCTATCCTGTGCCAAC
TGTCATACAACCACTACTACACTCTGGCGTCGTAATGCTGAGGGG
GAACCTGTATGCAATGCGTGTGGCCTTTACATGAAGCTACACGGG
GTCCCTCGGCCGCTAGCAATGAAAAAGGAAGGGATCCAGACACGA
AAGCGCAAACCCAAGAACCTCAGCAAGTCTAAAACACTAACAGGC
CAAAGTGGCAGTGACAGCCTCACTCCTTCCACCAGCTCCACAAA
CTCCATGGGAGAGGAAATGCGTCCAATAAAGATTGAGCCAGGAC
TGTCCCCTCCATATGACCACTCAAATTCAATATCTCAGGCATCTG
CATTATCTACAATCACAAGCCATGGATCATCATATTACCCAATGC
CAAGCTTAAAACTCTCGCCACAGAATCACCACTCTACATTCAACC
CATCTCCACAAGCCAACTCCAAACATGACTCCTGGAACAACCTGG
TCTTAGCT
# GATA4-pBS 3' UTR? (w/stop)
TAAAGCACATGCCAACCACCGACCTCCGGACATTCTCTGTACTGA
TTACTTATGGTCGGTTACAATGATACAACTCATTGTTGAACATGTG
TAATTAAAGAAACCAAAGACTCGAACGTTTAAAAAAAAAAAGTAATA
AAAAAGACTTTTTTTAAAAAAAAAAAGAAGTAATTTAAGATTTTGCT
GTAATAGAATACATAAGACTATATCCATTGTGGAAGAGATGGAAAG
ACGTGTAGCTGGGATAGAAATTTGGCAACGCAATGAAGCTTCGTC
CTTCACACAACTTTGGAAACCCAACTTGTCAGTGGATAAACCCTC
TACAAAAGTCCTGATGTTGCGGCGTTACGCTGATTAACTCCCATT
CTTTCAC
# pBS-Downstream
GGAATTCCTGCAGCCCGGGGGATCCACTAG
# Unknown Downsteam (sequenced, not in vector map)
TTCTAGAGCGGCCGCCACCGCGGGAGCCCAG
#remaining vector
gcagcccgggggatccactagttctagagcggccgccaccgcggtggagctccagct
tttgttccctttagtgagggttaattgcgcgcttggcgtaatcatggtcatagctgtttcct
gtgtgaaattgttatccgctcacaattccacacaacatacgagccggaagcataaag
tgtaaagcctggggtgcctaatgagtgagctaactcacattaattgcgttgcgctcact
...
 

efastaconvert perl script

#!/usr/bin/perl
#
# efastaconvert.pl
# Converts an efasta formatted file into a FASTA formatted file.
#
#
# usage: 
# ./efastaconvert.pl in_file out_file
# ./efastaconvert.pl < in_file > out_file
#
sub convert_to_fasta {
    my ($fh,$fho) = @_;
    # line wrap length
    my $wrap=60;
    my $str="";
    my $printed=0;
    while (my $line=<$fh>) {
        chomp($line);
    
        # remove anything after a hash mark.
        $line=~s/^(.*)#.*$/$1/;
    
        if ($line=~/^>/) {
            if ($printed) {
                while (length($str)>$wrap) {
                    print $fho substr($str,0,$wrap)."\n";
                    $str = substr($str,$wrap);
                }
                print $fho $str."\n";
            }
            $str="";
            print $fho $line."\n";
            $printed=1;
        } else {
            $line=~s/\s//gs;
            $str.=$line;
            while (length($str)>$wrap) {
                print $fho substr($str,0,$wrap)."\n";
                $str = substr($str,$wrap);
            }
        }
    }
    while (length($str)>$wrap) {
        print $fho substr($str,0,$wrap)."\n";
        $str = substr($str,$wrap);
    }
    print $fho $str."\n";
}
my $fh;
my $fho;
if ($ARGV[0])
{
    my $filename=$ARGV[0];
    open(IN,$filename) or die "Couldn't open file: $filename";
    $fh=\*IN;
} else {
    $fh=\*STDIN;
}
if ($ARGV[1])
{
    my $filename=$ARGV[1];
    open(OUT,">$filename") or die "Couldn't open file: $filename";
    $fho=\*OUT;
} else {
    $fho=\*STDOUT;
}
convert_to_fasta($fh,$fho);
if ($ARGV[0]) { close IN; }
if ($ARGV[1]) { close OUT; }