TY - THES AB - The primary concern of this thesis is to model the resolution of spoken referring expressions made in order to identify objects; in particular, everyday objects that can be perceived visually and distinctly from other objects. The practical goal of such a model is for it to be implemented as a component for use in a live, interactive, autonomous spoken dialogue system. The requirement of interaction imposes an added complication; one that has been ignored in previous models and approaches to automatic reference resolution: the model must attempt to resolve the reference incrementally as it unfolds–not wait until the end of the referring expression to begin the resolution process. Beyond components in dialogue systems, reference has been a major player in the philosophy of meaning for longer than a century. For example, Gottlob Frege (1892) has distinguished between Sinn (sense) and Bedeutung (reference), discussed how they are related and how they relate to the meaning of words and expressions. It has furthermore been argued (e.g., Dahlgren (1976)) that reference to entities in the actual world is not just a fundamental notion of semantic theory, but the fundamental notion; for an individual acquiring a language, understanding the meaning of many words and concepts is done via the task of reference, beginning in early childhood. In this thesis, we pursue an account of word meaning that is based on perception of objects; for example, the meaning of the word red is based on visual features that are selected as distinguishing red objects from non-red ones. This thesis proposes two statistical models of incremental reference resolution. Given ex- amples of referring expressions and visual aspects of the objects to which those expressions referred, both model components learn a functional mapping between the words of the refer- ring expressions and the visual aspects. A generative model, the simple incremental update model, presented in Chapter 5, uses a mediating variable to learn the mapping, whereas a dis- criminative model, the words-as-classifiers model, presented in Chapter 6, learns the mapping directly and improves over the generative model. Both models have been evaluated in various reference resolution tasks to objects in virtual scenes as well as real, tangible objects. This thesis shows that both models work robustly and are able to resolve referring expressions made in reference to visually present objects despite realistic, noisy conditions of speech and object recognition. A theoretical and practical comparison is also provided. Special emphasis is given to the discriminative model in this thesis because of its simplicity and ability to represent word meanings. It is in the learning and application of this model that gives credence to the above claim that reference is the fundamental notion for semantic theory and that meanings of (visual) words is done through experiencing referring expressions made to objects that are visually perceivable. DA - 2016 LA - eng PY - 2016 TI - Incrementally resolving references in order to identify visually present objects in a situated dialogue setting UR - https://nbn-resolving.org/urn:nbn:de:hbz:361-29021947 Y2 - 2024-11-24T10:09:24 ER -