<< Chapter < Page Chapter >> Page >
Overview of some of the whole-genome sequences available: Arabidopsis, C. elegans, Drosophila, H. sapiens, and M. musculus.

Introduction to the available data sets

The amount of whole-genomic data available is mountainous and growing at wholly un-geologic rate. Currently, over 1000 whole-genome data sets are either completed or in progress (whole genomes are considered 'finished' when they contain less than one error per 10,000 base-pairs).This amazing (and daunting) source of information includes genomes from bacteria, archaea, eukaryotes, as well as viruses and organelles. In the past four years alone, entire genomic sequences from C. elegans, D. melanogaster, H. sapiens, F. rubipes, A. gambiae, M. musculus, C. briggsae, R. norvegicus, A. thaliana, and C. intestinalis have been, or are nearly, completed (Birney et. al. 2003). This data set represents a sizeable portion of the commonly studied metazoan eukaryotes. Another current example of the high-throughput power of modern sequencing facilities is the fact that within weeks of the isolation of SARS, a preliminary genomic sequence was available. As automated high-throughput genome sequencing techniques continue to progress, more and more data sets of higher and higher quality will become available. Eventually we may even progress beyond a species-based view of the genome to whole genome sequencing of individual organism. The information is quickly outstripping our ability to analyze it; we need to develop sophisticated and sensitive informational analysis tools to apply to this new wealth of information.

Arabidopsis thaliana

Arabidopsis thaliana

Arabidopsis is the model for plant genetics research. It is a flowering plant and a member of the mustard family; its advantages as a research model include: short generation time, small size, large number of offspring, and relatively small nuclear genome. The genome was sequenced in 2000 by The Arabidopsis Genome Initiative (Nature 14 Dec. 2000). The genome has five chromosomes and a total size of 125mb. The Arabidopsis Genome Initiative in its original analysis predicted a total of 25,498 genes; this is much larger than both C. elegans (19,000) and Drosophila (13,601) and is in the range of the estimated number of genes for H. sapiens. The average gene length is around 2000bp with the average exon being 250bp in length (~5 per gene). The average intron is 180bp in length.

Caenorhabditis elegans

C. elegans

C. elegans was the first multicellular organism (it's a worm) to be completely sequenced and the second eukaryote (to yeast) to be sequenced. The genome was sequenced by The C. elegans Sequencing Consortium in 1998 (Science, 11 Dec. 1998). Before C. elegans the only other genomes to be sequenced were those of some viruses, bacteria and a yeast. The 97Mb sequence contains 19,099 predicted protein-coding genes (GENEFINDER was used to predict genes). The genome has 5 chromosomes plus an X chromosomes. Each gene has an average of 5 introns. 27% of the genome resides in predicted exons (this is much higher than human's ~5%) and 26% of the genome resides in predicted introns. GC content in the genome is remarkably constant across all of the chromosomes (36%). Relative to higher-order metazoan eukaryotes, especially as compared to vertebrates, C. elegans presents a clean genome with a low level of repeat sequences or other low complexity regions (although they definitely do exist, ~6%).

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Genefinding. OpenStax CNX. Jun 17, 2003 Download for free at http://cnx.org/content/col10205/1.1
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Genefinding' conversation and receive update notifications?

Ask