Denominating objects is a common task in everyday communication between humans. For the localization and identification of the objects the communication partners have mental models, which are constantly updated and improved during the communication. Both the evaluation of visual and speech information is considered at the same time for this task. If one of the communication partners references unknown objects, the appropriate models must be newly assembled. This takes place frequently interactively by means of showing and demonstrating the objects or pointing at them.
For a service robot these are worthwhile goals, too, if it should be able to successfully act in a scenario not strongly restricted like for example the household. The human should be able to simply buy a new robot and show and describe it any objects which are encountered during normal use of the robot. This should be possible in a natural and convenient way. In this work a complete system is presented, which is taking a first step in the direction of interactive, multimodal, visual based learning of unknown objects.
A scene is observed by a color camera. The user can use spoken language in order to reference objects in a dialog based fashion. The system tries to locate and identify the referenced objects. Deictic gestures and accomplished grasp actions of the user are integrated as additional sources of information into the analysis. In the gesture module motion and color are used for tracking the hands. For the recognition of deictic gestures and grasping actions speech information and hand trajectories are used jointly. Additionally the skin color detector can be initialized at any time with the help of the dialog component.
Unknown objects are learned in interaction with the user. The user removes the object from the scene and thus permits the system to segment the up to now unknown object from the background. During this an appearance based representation of the objects is constructed by feature extraction. As features color and texture histograms and graphs based on color regions and their neighborhoods are used. To be independent from the background the search for objects is accomplished without previously segmenting the scene. Incorrect behavior of the system can be interactively corrected by the user, whereas the object models involved in the inquiry are improved at the same time. All successfully accomplished object recognitions lead automatically to an optimization of the learned object models.