Even if only the acoustic channel is considered, human communication is
highly multi-modal. Non-lexical cues provide a variety of information such as
emotion or agreement. The ability to process such cues is highly relevant for
spoken dialog systems, especially in assistance systems. In this paper we focus
on the recognition of non-lexical confirmations such as "mhm", as they enhance
the system's ability to accurately interpret human intent in natural
communication. The architecture uses a Support Vector Machine to detect
confirmations based on acoustic features. In a systematic comparison, several
feature sets were evaluated for their performance on a corpus of human-agent
interaction in a setting with naive users including elderly and cognitively
impaired people. Our results show that using stacked formants as features yield
an accuracy of 84% outperforming regular formants and MFCC or pitch based
features for online classification.