In order to improve the effectiveness of remote homology analysis, namely probabilistic protein family models (HMMs), in this thesis new approaches for stochastic protein family modeling were developed. Therefore, the problem of protein sequence processing was consequently treated as some general pattern recognition task. Based on this more abstract point of view, various new techniques were presented which were motivated from alternative pattern classification applications like automatic speech recognition. Consequently, a generally new and very effective kind of protein sequence analysis was developed. By means of the new approaches presented here substantial improvements for remote homology analysis could be achieved.
In order to explicitly capture the biochemical properties, the fundamental innovation of the thesis describes a paradigm shift towards processing protein data in a richer representation. Using a sliding window technique, frames based on 16 consecutive residues are created which consist of a multi-channel signal-like numerical representation of certain biochemical properties (obtained by exploiting amino-acid indices) of the amino acids covered by the local context. In order to concentrate on the corresponding relevant essentials which determine the actual protein family affiliation, features are extracted from the new signal-like representation. Therefore, pattern recognition techniques are applied, namely a Discrete Wavelet Transformation as well as a Principal Components Analysis, aiming at the extraction of meaningful feature vectors which sufficiently describe the general protein signal shape. When applying this procedure, protein sequences' residues are converted into a 99-dimensional feature vector representation which is used for remote homology analysis.
Based on the feature representation of protein sequences, semi-continuous protein family HMMs were developed. Their basic advantage is the principle possibility to separate the estimation of a feature space representation (using a Gaussian mixture density) from the model training itself. Only for the actual model estimation moderate amounts of target specific data are required. The general feature space representation can be obtained by applying mixture density estimation techniques to large amounts of general protein data. Only small amounts of protein family specific data are required for model training. In order to specialize the mixture density based feature space representation with respect to a particular target family, adaptation techniques are applied. Using the family specific sample data for either MAP or MLLR adaptation, the focus of the general feature space can be effectively concentrated on a particular protein family which results in robust model estimation even for small training sets.
Compared to the rather complex three-state topology of Profile HMMs required when processing discrete amino acid data, by means of the new protein sequence representation, protein family HMMs with reduced complexity become possible. Their basic advantage is the smaller number of parameters required which need to be trained using representative sample data. In addition to feature based Profile HMMs, in this thesis two variants of protein family models with reduced complexity were developed. First, global protein family models as currently used are created containing a Bounded Left-Right (BLR) architecture. The second variant represents a general paradigm shift in protein family modeling. Target family models are estimated using automatically derived building blocks, so-called Sub-Protein Units (SPUs). Similar to the BLR models, the number of parameters which needs to be trained is substantially smaller compared to state-of-the-art Profile HMMs.
For a further reduction of the number of false positive predictions, which is especially relevant for e.g. pharmaceutical applications, in addition to the abovementioned log-odd scoring technique semi-continuous feature based protein family HMMs are evaluated competitively to a so-called Universal Background Model (UBM). Such an UBM explicitly covers general protein data and its application can thus be interpreted as some kind of pre-filtering stage.
Furthermore, the evaluation of protein family models was accelerated algorithmically. Among others the HMM state-space which actually needs to be explored can be substantially reduced by applying pruning techniques like the Beam-Search algorithm. Additionally, the evaluation of mixture densities can also be severely pruned.
The capabilities of the new modeling techniques were evaluated in numerous experiments. Using public data on representative tasks, it could be shown that state-of-the-art discrete Profile HMMs, which are in fact the currently most promising approach for remote homology analysis, are significantly outperformed when using the new techniques developed in this thesis.