<< Chapter < Page Chapter >> Page >

Ab initio methods use information embedded in the genomic sequence exclusively to predict gene structure in eukaryotic genes. The problem of gene finding is usually posed as a probabilistic inference problem: find the structure G representing gene boundaries and internal gene structure which maximizes the probability of G given the genomic sequence. Hidden Markov models are the predominant generative method for the problem. Ab-initio methods allow for the prediction of novel genes, genes that are unlike any that are known. However, ab initio techniques are generally not effective in detecting alternately spliced forms, interleaved or overlapping genes. They also have difficulty in accurate identification of exon/intron boundaries. Almost all ab-initio gene finders generate large numbers of false positive predictions arising from learnign overfitted models on small training sets. With these caveats in mind, we embark on the study of Hidden Markov models for finding genes in complex eukaryotic genes.

A simple example: finding cpg islands

Normally a C (cytosine) followed immediately by a G (guanine) (a CpG) is rare in eukaryotic DNA because the Cs in such an arrangement tend to be methylated [Wikipedia] . This methylation helps distinguish the newly synthesized DNA strand from the parent strand, which aids in the final stages of DNA proofreading after duplication. However, over evolutionary time methylated Cs tend to turn into Ts because of spontaneous deamination. The result is that CpGs are relatively rare unless there is selective pressure to preserve them.In bulk human DNA CpG dinucleotides occur about five times less frequently than expected (Bird 1980, Jones et al 1992). CpG islands are thus unmethylated regions of the genomethat are associated with the 5’ ends of most house-keeping genes and many regulated genes. The absence of methylation slows CpG decay, and so CpG islands can be detected in DNA sequence as regions in which CpG pairs occur at close to the expected frequency.The fact that CpG islands can be detected in this way indicates that the corresponding germline DNA has been substantially hypomethylated for an extended period of time, and in fact about 80%of CpG islands are common to man and mouse (Antequera and Bird 1993). About 56% of human genes and 47% of mouse genes are associated with CpG islands (Antequera andBird 1993). Often CpG islands overlap the promoter and extend about 1000 base pairs downstream into the transcription unit. Identification of potential CpG islands during sequence analysis helpsto define the extreme 5’ ends of genes, something that is notoriously difficult with cDNA based approaches.

Cpg islands

CpG islands in the human genome (Chromosome 22, Entrez browser)

We follow the presentation of the excellent text by Eddy and Durbin, and pose two problems in this context.

  • Given a short DNA sequence, does it come from a CpG island or not?
  • Given a long DNA sequence, identify the CpG islands in that sequence.

How do we model the problem of recognizing CpG islands? If we look at examples of CpG islands in the human genome (say on Chromosome 22 available from Genbank), you will see that we are unlikely to come with a deterministic set of rules for classifying sequences as being parts of CpG islands or not. We are going to build a probabilistic recognizer; one that takes a sequence and returns a probability that it is part of a CpG island. We will delve into the theory of Markov chains to set up such a model. But before that, here is a brief interlude.

Steve Skiena of SUNY Stony Brook has a very interesting viewpoint on the cultural differences between computer scienctists and biologists. It is a bit of a caricature, but there is a lot of truth in it. Here is his list of contrasts.

  • Almost nothing is ever completely true or false in biology. Everything is either true or false in computer science.
  • Biologists strive to understand the complicated messy natural world. Computer scientists seek to build their own clean and organized virtual worlds.
  • Biologists are data driven. Computer scientists are more algorithm driven. Once consequence is that CS web pages have fancier graphics, while biology web pages have more content.
  • Biologists are obsessed with being the first to discover something. Computer scientists are obsessed with being the first to invent or prove something.
  • Biologists are comfortable with the idea that all data has errors. Computer scientists are not.
  • Computer scientists get high-paid jobs after graduation. Biologists have to complete one or more post-docs before getting a permanent job.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical machine learning for computational biology. OpenStax CNX. Oct 14, 2007 Download for free at http://cnx.org/content/col10455/1.2
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical machine learning for computational biology' conversation and receive update notifications?

Ask