In this paper we propose an architecture of an image understanding system for a situated
arti cial communicator realizing human-machine interaction. Starting with sensor input the
processing is initially carried out in separate pathways using di erent schemes of image seg-
mentation. Subsequently, a hybrid technique for 2D-object recognition is employed. The nal
model based 3D-reconstruction yields a 3D-scene representation. Intermediate results are linked
over time in memory moduls to enhance e ciency of processing on image sequences. Results
of the individual moduls will be presented and discussed.