<< Chapter < Page | Chapter >> Page > |
What is computational genefinding? Simply put, it is the developmentof computational procedures to locate protein coding regions in unprocessed genomic DNA sequence data. In reality however,pinpointing the mere location of a gene is part of a much larger challenge. The eukaryotic gene is a complicated and highly studiedbeast composed of a multitude of small coding regions and regulatory elements hidden amidst tens of thousands of base pairs ofintronic and non-signal DNA. In order to accurately predict gene locations we must first understand how the different functionalcomponents interact to create the dynamic and complex phenomena we have come to understand as 'a gene'.
Thus genefinding is a little bit of a misnomer: in order to find genes we must first understand the content and structure of the signal thegenes present to the cell's genetic machinery, and in doing this we must answer much broader questions than the seemingly facile question,"Where are the genes?" The goal of genefinding then is not simple gene prediction, but accurate modeling of the signal genes present tothe cell. Furthermore, because such information does not exist in a vacuum separate from it's interpretation, implicit in the assumptionof the ability to model the genetic signal is a furthering of our capacity to understand the deciphering of the genetic signal and ourunderstanding of the inner workings of the cell itself.
There are two basic approaches to gene finding: ab-initio and comparative. Ab-initio methods use statistical properties of the given genome, while comparative methods use annotations from previously analyzed genomes as an additional input. We will begin our discussion of gene finding with ab-initio methods as applied to simpler prokaryotic genomes. Examples of such genomes include H. influenzae (the influenza virus). Over 70% of H. influenzae codes for proteins. Genes in prokaryota are contiguous stretches of base pairs with no intronic breaks. There are untranslated regions (UTRs) that flank both ends of a gene: the 5' (5-prime) and 3' (3-prime end). Genes are directional -- they are read from the 5' to the 3' end. There are genes on both strands of the DNA double helix. Each gene starts with the amino acid methionine, specified by the three letter codon ATG. ATG is called the start codon. The end of a gene is signalled by one of three stop codons (TAA, TAG, TGA). The start codon signals the ribosomal machinery to start translating the bases in composites of three into amino acids until the stop codon is reached. Gene finding in prokaryota reduces to the problem of finding stretches of the genome with a start codon and a stop codons with no intervening stop codons. Such a stretch is called an open reading frame or ORF.
Given a sequence from {A,C,G,T}*, an open reading frame (ORF) is any subsequence that starts with the codon ATG and ends with a stop codon (TAA, TAG, TGA) with no stop codons in between. ORF finding algorithms are based on the following simple idea.Since coding regions are terminated by stopcodons, one needs to to look for long stretches of bases without a stop codon. Once a stop codon is found,we work backward to find the start codon corresponding to the gene. Why do we look for long stretches without stop codons?If nucleotide bases were drawn uniformly at random, then a stop codon is expectedonce every 64/3 (about 21)codons, or about 63 base pairs. By selecting an appropriate length threshold t (typically greater than 210 bases or 70 amino acids), we reduce the likelihood of picking a random sequence with a stop codon rather than an actual coding region. Modifications to this basic algorithm to handle very short genes andoverlapping genes have been developed. The most successful method for finding coding regions in prokaryotic genomes is one based on interpolated Markov models emboded in the program GLIMMER. It is available here . A 2007 Bioinformatics paper details how to use this tool.
Notification Switch
Would you like to follow the 'Statistical machine learning for computational biology' conversation and receive update notifications?