TY - THES AB - The main goal of this dissertation is to develop a classifier for assigning environmental genomic fragments to the closest known source organism. This has been achieved by the development of a novel method for the TAxonomic COmposition Analysis - TACOA - of environmental genomic fragments using a kernelized nearest neighbor approach. A combination of machine learning techniques has been employed to realize a classifier that exploits the wealth of knowledge deposited in public databases. The developed classifier uses as features oligonucleotide frequencies which carry the so called genomic signature. A key advantage of the use of genomic signatures is that they enable sequence comparison without alignment. A central assumption of the genomic signature is that oligonucleotide compositions of DNA sequences from the same or closely related organisms are prone to be more similar than those from far related ones. This work embodies one of the first attempts to tackle the problem of taxonomic classification of metagenomic data. Moreover, it is the first of its kind using a kernelized nearest neighbor approach. The use of the k-nearest neighbor algorithm in the TACOA strategy assures that the realized classifier is in its nature multi-class. In addition, this approach has the advantage of not making any assumptions about the distribution of the input data and the classification results can easily be interpreted. However, the traditional k-NN algorithm has the drawback of running into problems when dealing with high dimensional input data (called curse of dimensionality). In the kernelized extension presented herein, this problem is overcome by the incorporation of a Gauss kernel into its architecture. Furthermore, the developed software can easily be installed and run on a desktop computer offering more independence in the analysis of metagenomic data sets. The reference set used by the proposed classifier can be easily updated with newly sequenced genomes, a very desirable feature on this situation of continuing expansion of genomic databases. The novel strategy presented was extensively evaluated using genomic fragments of variable length (800bp – 50Kbp) from 373 completely sequenced genomes. As a whole, the classification accuracy at five different taxonomic ranks was evaluated: superkingdom, phylum, class, order and genus. TACOA is able to classify genomic fragments of length 800bp and 1Kbp with high accuracy until rank class. For fragments longer than 3Kbp accurate predictions are made even at deeper taxonomic ranks (order and genus). TACOA compares well to the latest intrinsic classifier PhyloPythia. For fragments of length 800bp and 1Kbp the overall accuracy of TACOA is higher than that obtained by PhyloPythia across all taxonomic ranks. For all fragment lengths, both methods achieved comparable high specificity results up to rank class and low false negative rates. DA - 2010 LA - eng PY - 2010 TI - Taxonomic classification of genomic sequences : from whole genomes to environmental genomic fragments UR - https://nbn-resolving.org/urn:nbn:de:hbz:361-16855 Y2 - 2024-11-21T21:39:26 ER -