Advances in sequencing technologies provide the opportunity to rapidly produce vast genomic data sets at a low price. In particular, the recently developed, ultra-fast 454 pyrosequencing has dramatically reduced the cost and time requirements per sequenced base pair. Furthermore, novel methods have been developed that enable sequencing of the 99 percent of microbes that are difficult to access with conventional, culture-dependent approaches. These culture-independent methods launch the exciting field of metagenomics - the study of the collective genomes (metagenomes) of free-living microbial communities.
In light of the immense data sets produced, sequence analysis is still a contemporary, ongoing challenge in computational biology. Additionally, new demands arise from the short length of sequence reads produced by ultra-fast sequencing techniques. In the field of whole-genome research, accurate methods are required for identifying and functionally characterizing the gene content of organisms, thus reducing the required manual effort while at the same time producing high-quality annotations. On the other hand, metagenomes are nowadays routinely sequenced, and an increasing number of metagenomic projects is expected in the near future (Pennisi, 2007), but their computational analysis is still in its infancy.
In the presented thesis, state-of-the-art machine learning techniques as well as algorithmic and statistical methods are employed for the high-throughput analysis and characterization of large genomic data sets. First, the gene finding software GISMO was developed, which combines the search for protein domains using profile hidden Markov models (pHMMs) with a sequence composition-based classification using a Support Vector Machine. This combined strategy is able to unveil almost the complete gene content of prokaryotic genomes in a fully automated manner. GISMO has already been extensively employed in the international effort to "Annotate a Thousand Genomes" as well as in various genome annotation and re-annotation projects.
Furthermore, a novel gene finding algorithm for metagenomic data sets was developed. It is robust for most problems encountered when predicting genes in metagenomes, including short sequence length and low sequence quality. Thereby, the algorithm allows to hunt for novel, unknown genes carried by organisms that cannot be sequenced using conventional, culture dependent techniques.
Finally, methods were devised for characterizing short-read metagenomes obtained by pyrosequencing. Following the pHMM-based identification of gene fragments, the latter are categorized into functional groups. Additionally, the source organisms (taxonomic origins) of gene fragments are predicted. The resulting genetic and taxonomic profiles can in turn be used to unveil important trends in the gene content, metabolism, and species composition of the underlying microbial communities.