Speech understanding requires the ability to parse spoken utterances into words. But this ability is not innate and needs to be developed by infants within the first years of their life. So far almost all computational speech processing systems neglected this bootstrapping process. Here we propose a model for early infant speech structure acquisition implemented as layered architecture comprising phones, syllables and words. Our model processes raw acoustic speech as input and aims to capture its structure unsupervised on different levels of granularity.
Most previous models assumed different kinds of innate language-specific predispositions. We drop such unlikely assumptions and propose a model that is developmentally plausible. We condense findings from developmental psychology down to a few basic principles that our model is aiming to reflect on a functional level. By doing so our proposed model learns the structure of speech by a multitude of coupled self-regulated bootstrapping processes.
We evaluate our model on speech corpora that have some of the properties of infant-directed speech. To further validate our approach we outline how the proposed model integrates into an embodied multi-modal learning and interaction framework running on Honda's ASIMO robot. Finally, we propose an integrated model for speech structure and imitation learning through interaction, that enables our robot to learn to speak with an own voice.