Most theories and models of language acquisition so far have adopted a ‘mapping’ paradigm according to which novel words or constructions are ‘mapped’ onto existing, priorly acquired or innate concepts. Departing from this mapping approach, the thesis develops a computational model of the co-emergence of linguistic and conceptual structures with a particular focus on the case of action verbs. The model is inspired by emergentist theories of language acquisition and transfers the underlying ideas also to the domain of action learning. The emergentist cross-modal learning process spells out how a learner can distill the essence of the meaning of a verbal construction as a process of incremental generalization of the meaning of action verbs, starting from a meaning that is specific to a certain situation in which the verb has been encountered. The meaning of action verbs is understood as evoking a grounded simulation rather than a static concept. We show that cross-modal learning can provide an advantage over uni-modal models especially when observations are ambiguous and hard to differentiate. The connection between the theoretical foundation and the technical implementation is bidirectional. On the one hand, the technical properties of the model such as the fully incremental and data-driven approach to learn concepts within a modality that are grounded in concepts with similar semantics of another modality are relevant in many technical applications that are common in human-computer interaction and robotics. On the other hand technical implementations that are closely based on key concepts of theoretical frameworks can as well serve as computer implemented theories that allow other researchers to interactively test hypothesis that arise from the theoretical foundation. The thesis details connections to ongoing interdisciplinary research in linguistics and cognitive sciences.