Action and language learning in robotics requires flexible methods, since it is not possible to predetermine all possible tasks a robot would be involved in. Future systems need to be able to acquire this knowledge through communication with humans. Children are able to learn new actions although they have limited experience with the events they observe. More specifically, they seem to be able to identify which parts of an action are relevant and adapt this newly-won knowledge to new situations. Typically this does not happen in an isolated way but in an interaction with an adult. In these interactions, multiple modalities are used concurrently and redundantly. Research on child development has shown that the temporal relations of events in the acoustic and visual modality have a significant impact on how this information is processed. Specifically, synchrony between action and language was assumed to be beneficial for finding relevant parts and extracting first knowledge from action demonstrations. This idea has been proposed by Hirsh-Pasek and Golinkoff (1996) as acoustic packaging. They suggest that acoustic information, typically in the form of narration, overlaps with action sequences and provides infants with a bottom-up guide to attend to relevant parts and to find structure within them. The central contribution of this thesis comprises the conception, further development, and implementation of a model that has been inspired by the general idea of acoustic packaging. The resulting model of acoustic packaging is able to segment action demonstrations into multimodal units which are called acoustic packages. These units facilitate measuring the level of structuring in action demonstrations. In addition to action segmentation, the acoustic packaging system is able to flexibly integrate additional sensory cues to acquire first knowledge about the content of action demonstrations. Furthermore, the system was designed to process input online, which enables it to provide feedback to users engaging in an interaction with a robot. The model of acoustic packaging was evaluated on a corpus of adult-adult and adult-child interactions within a cup stacking scenario. The analyses focus on differences between the structure of child-directed and adult-directed interactions as well as developmental trends which are reflected in the statistical properties of acoustic packages. In addition to adult-child interaction, results on a corpus from a similar scenario with a simulated robot are presented. The results indicate that adult-robot interaction exhibits a similar structure compared to adult-child interaction. Furthermore, tests on the iCub robot showed that semantic information on color terms can be extracted from acoustic packages. These results were supported by further analysis of adult-child interactions, which verified that a substantial amount of semantic information can be gathered by exploiting this connection. Envisioning a continuous interaction between a tutor and the learning robot, acoustic packages provide an initial representation of action structure in interaction.