<< Chapter < Page | Chapter >> Page > |
Our task is to find coding regions in eukaryotic genomes. We have already studied the complexity of the structure of eukaryotic genes. if you need a refresher, check out this self-paced tutorial from the University of Glasgow.
One fundamental approach to finding genes is to detect functional sites in genomic DNA. Fixed length sites like splice sites, start and stop codons, polyA sites, ribosomal binding sites, and transcription factor binding sites are called signals, and algorithms that detect them are called signal sensors . Variable length regions like exons and introns in eukaryotic DNA are recognized by another family of methods called content sensors .
Show above is a sequence motif of size 6. The letters in each position are drawn in proportion to the probability of having that letter in that position. This probability information is summarized in a weight matrix shown below. Weight matrices are the simplest form of signal sensors.
Position | A | C | G | T |
---|---|---|---|---|
1 | 0.028 | 0.034 | 0.026 | 0.912 |
2 | 0.805 | 0.031 | 0.123 | 0.041 |
3 | 0.046 | 0.158 | 0.022 | 0.774 |
4 | 0.669 | 0.019 | 0.253 | 0.059 |
5 | 0.024 | 0.044 | 0.028 | 0.904 |
6 | 0.962 | 0.012 | 0.014 | 0.012 |
P(TATATA) = 0.912 * 0.805 * 0.774 * 0.669 * 0.904 * 0.962 = 0.33
P(ATATAT) = 0.028 * 0.041 * 0.046 * 0.059 * 0.024 * 0.012 = 8.9 * 10^(-10)
We can see that with respect to the sequence model described by the weight matrix, the string TATATA is overwhelmingly more likely than the string ATATAT. The weight matrix model assumes that the bases at each position are independent of each other. We will study more sophisticated models, called Markov models, which take dependencies between bases at different positions into account. The advantage of weight matrix models is their simplicity which allows them to be estimated with very little data. The disadvantage is that the models are quite rigid and do not accommodate the kind of variability seen in real biological sequences.
Content sensors include detectors of CpG islands, an example we will consider in detail later in this module. Exon and intron detectors are some of the most widely studied in the literature. The GRAIL system detects exons, polyAs,and CpG islands. You can submit a DNA sequence on the form linked above to check a content sensor program out.
Signal and content sensors alone cannot solve the genefinding problem. The statistical signals they are trying to recognize are too weak, and there are dependencies between signals and content that they cannot capture. Since the late nineties, attempts have been made to develop probabilistic systems that combine signal and content sensors to try to identify complete gene structure. One of the best known of these systems is Genscan, developed by Chris Burge and his advisor Samuel Karlin at Stanford University in 1997. Genscan is based on hidden Markov models.
Notification Switch
Would you like to follow the 'Statistical machine learning for computational biology' conversation and receive update notifications?