During the last decades, the active development of high-throughput methods led to the discovery of numerous associations between biomolecules and human diseases. This is a great advance for science and medicine since it helps to unravel the mechanisms of the diseases and gives clues for new treatment approaches. At the same time, a tremendous amount of raw experimental data and biomedical knowledge became a great challenge for the researchers. This stimulated the development of automatic solutions for discovery, storage, retrieval, integration, and analysis of biological data.
Associations between genes and diseases have attracted a special interest of researchers. To date, thousands of rare inheritable diseases caused by disruptions of individual genes are described. At the same time, the most widespread disorders are multifactorial, i.e. develop in the result of complex interactions between multiple genetic and environmental factors. Such multifactorial diseases are therefore called complex, and their mechanisms remain far from being understood in our days. To achieve a complete mechanistic picture of a complex disease, it is first necessary to establish a comprehensive list of all pathological changes. This task is complicated by the fact that complex disease may be heterogeneous, i.e. symptomatically similar, but caused by distinct molecular lesions.
The investigation of complex diseases not only brings us closer to the understanding of their mechanisms but also yields a number of useful intermediate results, e.g. the discovery of clinically relevant biomarkers and disease subtypes. This thesis starts from the discussion of the approaches for the discovery and prioritization of gene-diseases associations, and their relevance for complex and heterogeneous diseases. It further focuses on biclustering methods which seem to be very promising in the context of disease heterogeneity. They are capable of identifying genes with a similar expression pattern in a previously unknown subset of samples.
After an overview of existing biclustering methods, this thesis presents a novel biclustering method called DESMOND. Two factors distinguish DESMOND from most of the related works. First, it searches for differentially expressed biclusters, rather than biclusters with co-expression. Second, it performs a network-constrained search when the majority of biclustering methods are unconstrained.
DESMOND and nine previously published biclustering methods have been applied to simulated data and real breast cancer expression profiles. All the evaluated methods produced very diverse but biologically meaningful biclusters. On the breast cancer datasets DESMOND tended to produce more biologically singnificant gene clusters than the competitors. Compared to baselines, DESMOND and QUBIC identified more similar OS-associated biclusters in two independent breast cancer studies than other methods, possibly owing to their ability to consider gene interactions. Interestingly, these replicated biclusters found by DESMOND and QUBIC were composed of different genes and samples. Such OS-associated biclusters replicated in independent datasets may represent clinically different disease subtypes and are promising biomarker candidates.