Up to now, the focus in gesture research has long been on the production of speech-accompanying
gestures and on how speech-gesture utterances contribute to communication. An issue that has mostly
been neglected is in how far listeners even perceive the gesture-part of a multimodal utterance. For
instance, there has been a major focus on the lexico-semiotic connection between spontaneously coproduced
gestures and speech in gesture research (e.g., de Ruiter, 2007; Kita & Özyürek, 2003; Krauss,
Chen & Gottesman, 2000). Due to the rather precise timing between the prosodic peak in speech with the
most prominent stroke of the gesture phrase in production, Schegloff (1984) and Krauss, Morrel-Samuels
and Colasante (1991; also Rauscher, Krauss & Chen, 1996), among others, coined the phenomenon of
lexical affiliation. By following Krauss et al. (1991), the first empirical study of this dissertation investigates
the nature of the semiotic relation between speech and gestures, focusing on its applicability to temporal
perception and comprehension. When speech and lip movements diverge too far from the original
production synchrony, this can be highly irritating to the viewer, even when audio and video stem from the
same original recording (e.g., Vatakis, Navarra, Soto-Faraco & Spence, 2008; Feyereisen, 2007) – there
is only a small temporal window of audiovisual integration (AVI) within which viewer-listeners can
internally align discrepancies between lip movements and the speech supposedly produced by these
(e.g. McGurk & MacDonald, 1976). Several studies in the area of psychophysics (e.g., Nishida, 2006;
Fujisaki & Nishida, 2005) found that there is also a time window for the perceptual alignment of nonspeech
visual and auditory signals. These and further studies on the AVI of speech-lip asynchronies have
inspired research on the perception of speech-gesture utterances. McNeill, Cassell, and McCullough
(1994; Cassell, McNeill & McCullough, 1999), for instance, discovered that listeners take up information
even from artificially combined speech and gestures. More recent studies researching the AVI of speech
and gestures have employed event-related potential (ERP) monitoring as a methodological means to
investigate the perception of multimodal utterances (e.g., Gullberg & Holmqvist, 1999; 2006; Özyürek,
Willems, Kita & Hagoort, 2007; Habets, Kita, Shao, Özyürek & Hagoort, 2011). While the aforementioned
studies from the fields of psychophysics and speech-only and speech-gesture research have contributed
greatly to theories of how listeners perceive multimodal signals, there has been a lack of explorations of
natural data and of dyadic situations. This dissertation investigates the perception of naturally produced
speech-gesture utterances by having participants rate the naturalness of synchronous and asynchronous
versions of speech-gesture utterances using different qualitative and quantitative methodologies such as
an online rating study and a preference task. Drawing, for example, from speech-gesture production
models based on Levelt's (1989) model of speech production (e.g., de Ruiter, 1998; 2007; Krauss et al.,
2000; Kita & Özyürek, 2003) and founding on the results and analyses of the studies conducted for this
dissertation, I finally propose a model draft of a possible transmission cycle between Growth Point (e.g.,
McNeill, 1985; 1992) and Shrink Point, the perceptual counterpart to the Growth Point. This model
includes the temporal and semantic alignment of speech and different gesture types as well as their
audiovisual and conceptual integration during perception. The perceptual studies conducted within the
scope of this dissertation have revealed varying temporal ranges in which an asynchrony in speechgesture
utterances is integrable by the listener, especially iconic gestures.