We investigate several mathematical and algorithmical aspects of peptide mass fingerprinting (PMF). In a PMF experiment, a purified protein sample is digested by a protease using an enzymatic cleavage reaction, the masses of the resulting peptides are measured by mass spectrometry, yielding the mass fingerprint, and compared to predicted mass fingerprints of reference protein sequences.
In the first part, we examine several statistics of PMFs. We introduce random weighted strings over probabilistically weighted alphabets and cleavage schemes as mathematical models for random protein sequences, their molecular mass, and their mass fingerprints. We examine weighted hidden Markov models and Markov additive chains as a general computational framework for the stochastics of protein fragments and their masses. In parallel, we present recurrence equations for the description of these fragments.
Using these models, the distribution of fragment lengths, the distribution of fragment masses and the distribution of the number of fragments in a random protein are examined under a random string model of independent, identically distributed characters. We derive the occurrence probability of a certain fragment mass in a random protein sequence of either fixed length or fixed mass.
We present efficient dynamic programming algorithms and their time and space complexity for most of the statistics and compare all statistics with their empirical counterparts estimated from an in-silico tryptic digest of the Swiss-Prot protein sequence database.
In the second part, we develop a general algorithmic framework for identification of PMFs with a protein sequence database search. We formalize the identification of a mass spectrum as an alignment problem and modify well-known methods developed for sequence analysis to an efficient algorithm for computing an optimal alignment of a measured spectrum and a predicted spectrum. The alignment is based on scoring schemes that allow flexible and consistent incorporation of a multitude of experimental parameters such as accuracy of the measured masses, mass error distribution, sample contamination, and ionization efficiency into the identification procedure.
Using the fragment statistics of the first part, we estimate the significance of a protein identification under a well-defined statistical null-model.
Finally, we present a family of scoring schemes and show how additional information such as intensity values of measured peaks can be incorporated into an alignment scoring. We demonstrate the applicability of the alignment framework on a real proteomics dataset and compare our results with the standard software MASCOT.