Collections of Expressed Sequence Tags (ESTs) provide the most extensive available survey of the transcriptome of an organism and with it evidence for the existence of genes. They are indispensable for gene discovery, gene structure prediction, and genomic mapping. The price of the low-cost high-throughput data is that ESTs contain high error rates and are not very well annotated. The low quality sequence data can be improved by several processing steps and by clustering into gene-oriented clusters, which again can be assembled to contig sequences for further analyses.
The first part of this thesis describes an EST clustering pipeline that makes use of enhanced suffix arrays as implemented in the software tool Vmatch. Enhanced suffix arrays are a data structure that has been shown to be as powerful as suffix trees, with the advantage of a reduced space requirement and reduced processing time. Further on, enhanced suffix arrays have been shown to be superior to other matching tools for a variety of applications. We will validate the clustering results based on a "gold-standard" EST data set of Arabidopsis thaliana and compare the result to other widely used clustering tools. The implemented clustering pipeline takes advantage of the underlying database and enables unique batch functionality of mapping results from other organisms to the species of interest.
For some species, EST projects provide the only information about their gene content. One of these species is the African clawed frog Xenopus laevis. Research using this model system has provided critical insights into the mechanisms of early vertebrate development and cell biology. Despite of the interest in this model organism, no genome project is currently ongoing, and EST and cDNA sequences are the only resource available. To further improve Xenopus as a non-mammalian model system, one of the goals of highest priority is the generation of ESTs and full length cDNA collections, as they facilitate functional assays, one of the particular strengths of Xenopus.
We have applied the EST clustering pipeline described in the first part of this thesis to Xenopus laevis ESTs, both to identify full length protein encoding sequences and full length cDNA clones. The unique database system XenDB supports comparative approaches between Xenopus laevis and other model systems, and enables the retrieval of their potential full length clones.