<< Chapter < Page | Chapter >> Page > |
Here is a 1151 base pair segment of the Influenza B virus taken from the Entrez database. This segment has two genes: base pair 4 to 750, and 750 to 1079.
The start and stop codons for the two genes are marked with capital letters below.
1 aaaATGtcgc tgtttggaga cacaattgcc tacctgcttt cattgacaga agatggagaa
61 ggcaaagcag aactagcaga aaaattacac tgttggttcg gtgggaaaga atttgaccta121 gactctgcct tggaatggat aaaaaacaaa agatgcttaa ctgatataca aaaagcacta
181 attggtgcct ctatctgctt tttaaaaccc aaagaccagg aaaggaaaag aagattcatc241 acagagcctc tatcaggaat gggaacaaca gcaacaaaaa agaaaggcct gattctagct
301 gagagaaaaa tgagaagatg tgtgagcttt catgaagcat ttgaaatagc agaaggccat361 gaaagctcag cgctactata ttgtctcatg gtcatgtacc tgaatcctgg aaattattca
421 atgcaagtaa aactaggaac gctctgtgct ttgtgcgaga aacaagcatc acattcacac481 agggctcata gcagagcagc gagatcttca gtgcccggag tgagacgaga aatgcagatg
541 gtctcagcta tgaacacagc aaaaacaatg aatggaatgg gaaaaggaga agacgtccaa601 aagctggcag aagagctgca aagcaacatt ggagtattga gatctcttgg agcaagtcaa
661 aagaatgggg aaggaattgc aaaggatgta atggaagtgc taaagcagag ctctatggga721 aattcagctc ttgtgaagaa atatctaTAA TGctcgaacc atttcagatt ctttcaattt
781 gttcttttat cttatcagct ctccatttca tggcttggac aatagggcat ttgaatcaaa841 taaaaagagg agtaaacatg aaaatacgaa taaaaggtcc aaacaaagag acaataaaca
901 gagaggtatc aattttgaga cacagttacc aaaaagaaat ccaggccaaa gaaacaatga961 aggaagtact ctctgacaac atggaggtat tgagtgacca catagtgatt gaggggcttt
1021 ctgccgaaga gataataaaa atgggtgaaa cagttttgga gatagaagaa ttgcatTAAa1081 ttcaattttt tactgtattt cttattatgc atttaagcaa attgtaatca atgtcagcaa
1141 ataaactgga a
To get a feel for the process of finding ORFs, go to Artemis at the Sanger Institute. If you have Java Web start, you can launch Artemis by clicking here . Load the clbot.fsa fasta file onto your local machine. Follow the excellent tutorial here to find ORFs in a simple prokaryotic genome.
We now turn to ab initio approaches to gene finding in eukaryotic genomes. They rely on signals which are specific DNA sequences that indicate the presence of a nearby gene, or content which are statistical properties of the gene itself. Examples of signals include promotor and regulatory sequences that precede a gene, binding sites for the polyA tail at the end of a gene, as well as CpG islands (stretches of DNA with high GC content) that occur before the start of a gene.
Eukaryotic genes are considerably more complicated than their prokaryotic counterparts. First, a gene is no longer a contiguous stretch of bases between the start codon and a stop codon. It is broken or spliced into coding regions called exons with intervening non-coding sections called introns. Splicing mechanisms in the eukaryotic cell stitch the exons together before translation. Alternative splicing mechanisms allow the exons to be put together in a variety of ways --- thus a single gene can code for a variety of proteins. The one gene - one protein mapping that is characteristic of prokaryotes is lost. Second, most DNA in eukaryotes is non-coding; only about 3% codes for proteins. Finding the exons in a sea of introns and intergenic material is very difficult. In addition, many of the regulatory signals may be quite far from the start codon. These factors make eukaryotic gene finders much less successful than their prokaryotic counterparts like GLIMMER with prediction accuracies of 98%.
Notification Switch
Would you like to follow the 'Statistical machine learning for computational biology' conversation and receive update notifications?