When humans communicate, we use deictic expressions to refer to objects in our surrounding and put them in the context of our actions. In face to face interaction, we can complement verbal expressions with gestures and, hence, we do not need to be too precise in our verbal protocols. Our interlocutors hear our speaking; see our gestures and they even read our eyes. They interpret our deictic expressions, try to identify the referents and -- normally -- they will understand. If only machines could do alike.
The driving vision behind the research in this thesis are multimodal conversational interfaces where humans are engaged in natural dialogues with computer systems. The embodied conversational agent Max developed in the A.I. group at Bielefeld University is an example of such an interface. Max is already able to produce multimodal deictic expressions using speech, gaze and gestures, but his capabilities to understand humans are not on par. If he was able to resolve multimodal deictic expressions, his understanding of humans would increase and interacting with him would become more natural.
Following this vision, we as scientists are confronted with several challenges. First, accurate models for human pointing have to be found. Second, precise data on multimodal interactions has to be collected, integrated and analyzed in order to create these models. This data is multimodal (transcripts, voice and video recordings, annotations) and not directly accessible for analysis (voice and video recordings). Third, technologies have to be developed to support the integration and the analysis of the multimodal data. Fourth, the created models have to be implemented, evaluated and optimized until they allow a natural interaction with the conversational interface.
To this ends, this work aims to deepen our knowledge of human non-verbal deixis, specifically of manual and gaze pointing, and to apply this knowledge in conversational interfaces. At the core of the theoretical and empirical investigations of this thesis are models for the interpretation of pointing gestures to objects. These models address the following questions: When are we pointing? Where are we pointing to? Which objects are we pointing at? With respect to these questions, this thesis makes the following three contributions:
First, gaze-based interaction technology for 3D environments: Gaze plays an important role in human communication, not only in deictic reference. Yet, technology for gaze interaction is still less developed than technology for manual interaction. In this thesis, we have developed components for real-time tracking of eye movements and of the point of regard in 3D space and integrated them in a framework for DRIVE. DRIVE provides viable information about human communicative behavior in real-time. This data can be used to investigate and to design processes on higher cognitive levels, such as turn-taking, check- backs, shared attention and resolving deictic reference.
Second, data-driven modeling: We answer the theoretical questions about timing, direction, accuracy and dereferential power of pointing by data-driven modeling. As empirical basis for the simulations, we created a substantial corpus with high-precision data from an extensive study on multimodal pointing. Two further studies complemented this effort with substantial data on gaze pointing in 3D. Based on this data, we have developed several models of pointing and successfully created a model for the interpretation of manual pointing that achieves a human-like performance level.
Third, new methodologies for research on multimodal deixis in the fields of linguistics and computer science: The experimental-simulative approach to modeling -- which we follow in this thesis -- requires large collections of heterogeneous data to be recorded, integrated, analyzed and resimulated. To support the researcher in these tasks, we developed the Interactive Augmented Data Explorer. IADE is an innovative tool for research on multimodal interaction based on virtual reality technology. It allows researchers to literally immerse into multimodal data and interactively explore them in real-time and in virtual space. With IADE we have also extended established approaches for scientific visualization of linguistic data to 3D, which previously existed only for 2D methods of analysis (e.g. video recordings or computer screen experiments). By this means, we extended McNeill's 2D depiction of the gesture space to gesture space volumes expanding in time and space. Similarly, we created attention volumes, a new way to visualize the distribution of attention in 3D environments.