Completing the genome sequence of a given organism is just the beginning of a series of subsequent tasks, namely, the discovery of the corresponding sequence complexities. For eukaryotic organisms, this refers especially to the documentation of the coding exons of each gene, as well as non-coding and regulatory sequences. In spite of the extensive research done in the area, the production of genomic data remains far ahead of the ability to reliably predict such features computationally.
Eukaryotic genes consist predominantly of exons and introns; while the introns are non-coding regions that are removed from the primary transcript during RNA splicing, the exons are the coding regions that are spliced together during the mRNA maturation process. The borders between exons and introns present conserved dinucleotides, and are called splice sites; the intron-exon boundary is referred to as acceptor splice site and the exon-intron boundary as donor splice site.
Accurate eukaryotic gene prediction clearly requires exact splice site prediction methods. There are different pattern recognition techniques available to assess the quality of candidate splice sites; most of them proceed by computing a score derived from models about the distribution of the nucleotides in the neighbourhood of a splice site consensus sequence. Using Support Vector Machines (SVM) and a kernel particularly designed for the study of biological sequences, we investigate which pattern occurrences in which positions are relevant for splice site prediction. In addition, we propose a splice site prediction method that improves traditional position-specific weight matrices. Finally, we contribute with an intron model that relies on secondary structure information and demonstrates the ability to distinguish intron sequences from random data.