<< Chapter < Page Chapter >> Page >

Here is a 1151 base pair segment of the Influenza B virus taken from the Entrez database. This segment has two genes: base pair 4 to 750, and 750 to 1079. The start and stop codons for the two genes are marked with capital letters below. 1 aaaATGtcgc tgtttggaga cacaattgcc tacctgcttt cattgacaga agatggagaa 61 ggcaaagcag aactagcaga aaaattacac tgttggttcg gtgggaaaga atttgaccta121 gactctgcct tggaatggat aaaaaacaaa agatgcttaa ctgatataca aaaagcacta 181 attggtgcct ctatctgctt tttaaaaccc aaagaccagg aaaggaaaag aagattcatc241 acagagcctc tatcaggaat gggaacaaca gcaacaaaaa agaaaggcct gattctagct 301 gagagaaaaa tgagaagatg tgtgagcttt catgaagcat ttgaaatagc agaaggccat361 gaaagctcag cgctactata ttgtctcatg gtcatgtacc tgaatcctgg aaattattca 421 atgcaagtaa aactaggaac gctctgtgct ttgtgcgaga aacaagcatc acattcacac481 agggctcata gcagagcagc gagatcttca gtgcccggag tgagacgaga aatgcagatg 541 gtctcagcta tgaacacagc aaaaacaatg aatggaatgg gaaaaggaga agacgtccaa601 aagctggcag aagagctgca aagcaacatt ggagtattga gatctcttgg agcaagtcaa 661 aagaatgggg aaggaattgc aaaggatgta atggaagtgc taaagcagag ctctatggga721 aattcagctc ttgtgaagaa atatctaTAA TGctcgaacc atttcagatt ctttcaattt 781 gttcttttat cttatcagct ctccatttca tggcttggac aatagggcat ttgaatcaaa841 taaaaagagg agtaaacatg aaaatacgaa taaaaggtcc aaacaaagag acaataaaca 901 gagaggtatc aattttgaga cacagttacc aaaaagaaat ccaggccaaa gaaacaatga961 aggaagtact ctctgacaac atggaggtat tgagtgacca catagtgatt gaggggcttt 1021 ctgccgaaga gataataaaa atgggtgaaa cagttttgga gatagaagaa ttgcatTAAa1081 ttcaattttt tactgtattt cttattatgc atttaagcaa attgtaatca atgtcagcaa 1141 ataaactgga a

To get a feel for the process of finding ORFs, go to Artemis at the Sanger Institute. If you have Java Web start, you can launch Artemis by clicking here . Load the clbot.fsa fasta file onto your local machine. Follow the excellent tutorial here to find ORFs in a simple prokaryotic genome.

Genefinding in eukaryotic genomes

We now turn to ab initio approaches to gene finding in eukaryotic genomes. They rely on signals which are specific DNA sequences that indicate the presence of a nearby gene, or content which are statistical properties of the gene itself. Examples of signals include promotor and regulatory sequences that precede a gene, binding sites for the polyA tail at the end of a gene, as well as CpG islands (stretches of DNA with high GC content) that occur before the start of a gene.

The structure of an eukaryotic gene

The structure of an eukaryotic gene (source: unknown).

Eukaryotic genes are considerably more complicated than their prokaryotic counterparts. First, a gene is no longer a contiguous stretch of bases between the start codon and a stop codon. It is broken or spliced into coding regions called exons with intervening non-coding sections called introns. Splicing mechanisms in the eukaryotic cell stitch the exons together before translation. Alternative splicing mechanisms allow the exons to be put together in a variety of ways --- thus a single gene can code for a variety of proteins. The one gene - one protein mapping that is characteristic of prokaryotes is lost. Second, most DNA in eukaryotes is non-coding; only about 3% codes for proteins. Finding the exons in a sea of introns and intergenic material is very difficult. In addition, many of the regulatory signals may be quite far from the start codon. These factors make eukaryotic gene finders much less successful than their prokaryotic counterparts like GLIMMER with prediction accuracies of 98%.

The exon-intron structure of the human psa gene

The exon-intron structure of the Human PSA (source: unknown). The magenta sections are exons embedded in a sea of black introns.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical machine learning for computational biology. OpenStax CNX. Oct 14, 2007 Download for free at http://cnx.org/content/col10455/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical machine learning for computational biology' conversation and receive update notifications?

Ask