Peptide mass fingerprinting is an important technique that allows to identify a protein from its fragment masses obtained by mass spectrometry after enzymatic fragmentation: An experimental mass fingerprint is compared with or aligned to several reference fingerprints obtained from protein databases using in-silico digestion. Recently, much attention has been given to the questions of how to score such an alignment of mass spectra and how to evaluate its significance; results have been developed mostly from a combinatorial perspective. In particular, existing methods generally do not (or only at the price of a combinatorial explosion) capture the fact that the same amino acid can have different masses because of, e.g., isotopic distributions or variable chemical modifications.
We offer several new contributions to the field: We introduce the notions of a probabilistically weighted alphabet, where each character can have different masses according to a specified probability distribution, and the notion of a random weighted string as a fundamental model for a random protein. We then develop a general computational framework, which we call weighted HMMs for various length and mass statistics of cleavage fragments of random proteins. We obtain general formulas for the length distribution of a fragment, the number of fragments, the joint length-mass distribution, and for fragment mass occurrence probabilities, and special results for so-called standard cleavage schemes (e.g., for Trypsin). We also discuss how to efficiently implement the probability computations. Computational results are provided, as well as a comparison to proteins from the SwissProt database.