When humans describe the shape of objects, they often use iconic gestures to depict what they want to convey to a listener. Gesturing gives them the ability to express spatial concepts directly in the spatial medium and thus provides an important communicative resource for spatial language. In order to harvest this resource in language comprehension systems, the composite signal conveyed in two different media has to be re-integrated to a common, unified meaning. In a corpus study, we examined the morphological variety of shape-related iconic gestures and the kind of shape information they express. We distinguish four sub-types of iconic gestures and show that the most frequent type, called dimensional gestures, and the lexical affiliates they co-occur with, contain information about an object’s spatial extent, the course of its boundary, and the spatial relations between object parts. An analysis of the verbal utterances shows that adjectives and nouns are predominant among the lexical affiliates in our scenario. Based on the empirical results, a computational model for the representation and processing of multimodal shape descriptions is proposed.