bioseq.h File Reference

#include "codon.h"
#include <string>
#include <iostream>
#include <map>
#include <algorithm>
#include <list>
#include "Range.h"
#include <vector>
#include "Interval.h"

Go to the source code of this file.

Classes

class  greaterRangeLength
class  bioseqexception
class  bioseq
class  Protein
class  DNA
class  mRNA

Enumerations

enum  sequenceType {
  PROTEINSEQ = 1, DNASEQ = 2, RNASEQ = 4, NUCLEIC_ACID = 6,
  GENERIC = 7, UNKNOWN = 0
}

Functions

void printFasta (ostream &ous, const string &seq, int width=70)
string reverseComplement (const string &seq)
void reverseComplementInPlace (string &seq)
bool loadFastaIntoMap (const string &file, map< string, string > &store)
void translate (string &pep, const string &seq, int begin, int end=-1)
int countInternalStops (const string &seq)
void longestORFPlus (const string &rna, string &pep, int &b, int &e, int &f)
void longestORFPlus (const string &rna, string &pep, int &b, int &e)
void longestORFPlusSuffix (const string &rna, string &pep, int &b, int &e)
void longestORFPlusPrefix (const string &rna, string &pep, int &b, int &e)
bool longestNoStartORFPlus (const string &rna, pair< int, int > &nterm, string &npep, pair< int, int > &full, string &pep)
bool longestNoStopORFPlus (const string &rna, pair< int, int > &cterm, string &cpep, pair< int, int > &full, string &pep)
Range P2Rindex (const Range &ofb, const int frame)
Interval P2Rindex (const Interval &ofb, const int frame)
int P2Rindex (const int pos, const int frame)
bool overlayRVR (const Range &r, const vector< Range > &vr)
vector< RangefindAllORFIndex (const string &rna, int base=1, int pep_lencut=22, int HHcut=160, int HTcut=120, int TTcut=10)
void findAllPepORFIndex (list< Range > &orfrange, const string &ss, int cutoff)
int aachar2num (char a)
char aanum2char (int c)


Enumeration Type Documentation

This file also defines a sequence type as enum so we can save storage space

Enumerator:
PROTEINSEQ 
DNASEQ 
RNASEQ 
NUCLEIC_ACID 
GENERIC 
UNKNOWN 


Function Documentation

int aachar2num ( char  a  ) 

used by encode function

Referenced by Matrix::lookup(), and Matrix::read().

char aanum2char ( int  c  ) 

convert amino acid number to character

int countInternalStops ( const string &  seq  ) 

global function to be used in a more flexible way

Referenced by JGIModel::valid().

vector<Range> findAllORFIndex ( const string &  rna,
int  base,
int  peplen_cutoff,
int  HHcut,
int  HTcut,
int  TTcut 
)

Returns:
non-overlapping ORFs that are supposed to be more biological. some of the ORF might not be the longest, but the overall Coding is optimal. The algorithm is simply a packing algorithm by selecting the largest first, then select the next non-overlapping ORF. You need to do your own filtering. The returned range are sorted from small to large regardless of direction using the less operator of Range.
Parameters:
base [0,1] default 1
The Range returned is 0-based index. this method should be applied to sequence at least 100 nt long.

Returns:
non-overlapping ORFs that are supposed to be more biological. some of the ORF might not be the longest, but the overall Coding is optimal. The algorithm is simply a packing algorithm by selecting the largest first, then select the next non-overlapping ORF. You need to do your own filtering. The result is sorted from 5' to 3' direction if rna. This sorted order is essential for ESTAssembly::breakup method.

References farRVR(), findAllPepORFIndex(), P2Rindex(), reverseComplement(), and translate().

Referenced by ESTAssemblyid::breakup(), and testFindORF().

void findAllPepORFIndex ( list< Range > &  orfrange,
const string &  ss,
int  cutoff 
)

find all the ORF index bounds in 0-based index Number is in protein space minimum AA length to register. 25 aa is the minimum we are goint to register.

Parameters:
orfrange is the result of this operation. It will make it empty at the begining of the run if it was not. The output is in ss index. the end of the range is the '*' symbol for complete ORF or prefix ORF. The begin of the range is the index of 'M'.
ss is the input peptide sequence is a integer number. Peptide shorter than this are ignored in the searching phase. This parameter should not affect the real performance of this algorithm. It provided fine control. Currently I am using 25 aa. partial ORF on either end of the DNA are given 1/2*cutoff or 35 aa limits whichever is larger. This is based on the average random ORF length of 21 aa.
find all the ORF index bounds in 0-based index Number is in protein space minimum AA length to register. 25 aa is the minimum we are goint to register.
Parameters:
orfrange is the result of this operation. It will make it empty at the begining of the run if it was not. The output is in ss index. the end of the range is the '*' symbol for complete ORF or prefix ORF. The begin of the range is the index of 'M'.
ss is the input peptide sequence is a integer number. Peptide shorter than this are ignored in the searching phase. This parameter should not affect the real performance of this algorithm. It provided fine control. Currently I am using 25 aa.

References max.

Referenced by findAllORFIndex().

bool loadFastaIntoMap ( const string &  file,
map< string, string > &  store 
)

load all the sequences in the file into a map for later usage. Only the id is used, title information is not cached.

Returns:
true for success, false for failure.

References ifstream(), and bioseq::seq.

Referenced by main(), testFromFile(), and testLoad().

bool longestNoStartORFPlus ( const string &  rna,
pair< int, int > &  nterm,
string &  npep,
pair< int, int > &  full,
string &  pep 
)

Find the longest ORF of 1----*, and M---*, return both of them in one operation Let the user decide what to do with the two values. The * stands for stop codon. use 0-based index, inclusive [b,e] e is the third Base of the stop codon. b is the first base of the start codon if complete ORF. If nostart orf, the b is the frame, implying start from 0. b,e is packed into the pair data structure.

nterm, and full contain the [b,e] in RNA coordinates.

If the sequence does not contain NoStart or Full ORF, then return false;

Returns:
true if contain at least one of NoStart or Full ORF false if both are missing. In this situation, the pep will be set to the no-start-and-no-stop ORF is it exists and full will be set to NostartNostrop frame to the end of the rna seq. This usually happens when the sequence is short.

References bioseq::length(), and DNA::translate().

Referenced by ESTAssembly::breakPrefixModel(), ESTAssembly::breakSuffixModel(), and testLongestNMissingORF().

bool longestNoStopORFPlus ( const string &  rna,
pair< int, int > &  cterm,
string &  cpep,
pair< int, int > &  full,
string &  pep 
)

void longestORFPlus ( const string &  rna,
string &  pep,
int &  b,
int &  e 
)

void longestORFPlus ( const string &  rna,
string &  pep,
int &  b,
int &  e,
int &  f 
)

find all ORF (in the middle of pepseq M...*) or ...* or M... in all three reading frames of rna, and pick the longest one. Set the pep seq, and b, e as 0-based index inclusive [b,e] in RNA index

Parameters:
f is the frame [0,1,2] actually frame can be derived from b % 3 frame is always b3, actual begin is 0 if b < 3 and start is not M

References Interval::begin(), Interval::end(), find(), Interval::length(), bioseq::length(), Range::length(), P2Rindex(), bioseq::substr(), and DNA::translate().

void longestORFPlusPrefix ( const string &  rna,
string &  pep,
int &  b,
int &  e 
)

Prefix ORF is the one with stop but no start

References bioseq::length(), and DNA::translate().

void longestORFPlusSuffix ( const string &  rna,
string &  pep,
int &  b,
int &  e 
)

Suffix ORF is ORF with start but not stop

References bioseq::length(), and DNA::translate().

bool overlayRVR ( const Range r,
const vector< Range > &  vr 
)

helper used by findAllORFIndex

helper function for findAllORFIndex()

References Range::overlay().

int P2Rindex ( const int  pos,
const int  frame 
)

Interval P2Rindex ( const Interval ofb,
const int  frame 
)

Range P2Rindex ( const Range ofb,
const int  frame 
)

This translates full ORF and prefix ORF, for suffix ORF you need special treatment, from protein space to RNA space. For prefix, will use the start to encode frame information; if start < 3 then it contain frame info. Full ORF starts with M.

Parameters:
frame [0,1,2]
this translated full ORF for prefix and suffix ORF you need special treatment prefix is ok, will use the start to encode frame information; if start < 3 then it contain frame info.

References Range::begin(), and Range::end().

Referenced by findAllORFIndex(), and longestORFPlus().

void printFasta ( ostream &  ous,
const string &  seq,
int  width = 70 
)

class wide function that can be used without constructing a bioseq object. This will increase the performance if you are only interested in the operation but not in using the bioseq object and its derived classes

Referenced by extractInter(), main(), mRNAModel::show(), testLoad(), and ESTAssembly::write().

string reverseComplement ( const string &  seq  ) 

void reverseComplementInPlace ( string &  seq  ) 

void translate ( string &  pep,
const string &  seq,
int  begin,
int  end = -1 
)

Algorithms about ORF: prefix ORF ORF starting from 5'-END without start codon. suffix ORF ORF end at 3' end without stop codon full ORF ORF with start and stop codons Protein index to RNA index transformation: in 0-based index system Ri=frame + 3*Pi First base of codon For end of stop codon, you need to add 2 helper function to be used by other methods Uses more basic type, for convinience use 1-based index inclusive.

Parameters:
begin start position of translation, first base.
end end position of translation, third base of codon if not at the end of the sequence. This can be any of the codon bases. It just generate partial peptides with the last amino acid unspecified.

References ct, and DNA::getCodonTable().

Referenced by findAllORFIndex(), mRNAModel::mRNAModel(), mRNAModel::resetProtein(), and JGIModel::valid().


Generated on Wed Aug 10 11:57:00 2011 for Softwares from Orpara by  doxygen 1.5.6