This thesis addresses the problem of relating spoken utterances to the simultaneously perceived visual scene context. The development of systems that integrate verbal and visual information is an extending field of research. It is pushed by various applications like the indexing and querying of video databases, service robotics, augmented reality, document analysis, documentation systems with multi-modal interfaces, or other multi-media systems. Each of these applications has to relate two or more different input modalities, which is also known as the correspondence problem.
The task of relating realistic inputs like speech or images is complicated by the fact that the interpretations of the surface modalities are often erroneous or incomplete such that an integration component must consider noisy and partial interpretations. As a consequence, this thesis treats the correspondence problem as a probabilistic decoding process. This perspective distinguishes this approach from other approaches that propose rule-based translation schemes or integrated knowledge bases and assume that a visual representation can be logically transformed into a verbal representation and vice versa.
This thesis successfully applies Bayesian networks to the task of integrating speech and images. The correspondence problem is solved in the language of Bayesian networks in a consistent and efficient way by using a novel combination of conditioning and elimination techniques. The experimental study identifies Bayesian networks as an adequate formalism for speech and image integration tasks. The mental models of the speaker are partially reconstructed by estimating conditional probabilities from the data of psycholinguistic experiments. Context dependent shifts of word meanings are modeled by the structure of the network.
The proposed Bayesian network scheme for integrating multi-modal input has been applied to a construction scenario. A robot is instructed by a speaker to grasp objects from a table, join them together, and put them down again. In this thesis an integration component is realized that is able to identify objects in the visual scene that are verbally referred to by the speaker. This task is successfully performed despite of vague descriptions, erroneous recognition results, and the use of names with unknown semantics. Several interaction tasks have been implemented that perform multi-modal object recognition, link unknown object names to scene objects, disambiguate alternative interpretations of utterances, predict undetected mounting relations, or determine the selected reference frame of the speaker.