Sensory information about the state of the world is generally ambiguous. Understanding how the nervous system resolves such ambiguities to infer the actual state of the world is a central quest for sensory neuroscience. However, the computational principles of perceptual disambiguation are still poorly understood: What drives perceptual decision-making between multiple equally valid solutions? Here we investigate how humans gather and combine sensory information–within and across modalities–to disambiguate motion perception in an ambiguous audiovisual display, where two moving stimuli could appear as either streaming through, or bouncing off each other. By combining psychophysical classification tasks with reverse correlation analyses, we identified the particular spatiotemporal stimulus patterns that elicit a stream or a bounce percept, respectively. From that, we developed and tested a computational model for uni- and multi-sensory perceptual disambiguation that tightly replicates human performance. Specifically, disambiguation relies on knowledge of prototypical bouncing events that contain characteristic patterns of motion energy in the dynamic visual display. Next, the visual information is linearly integrated with auditory cues and prior knowledge about the history of recent perceptual interpretations. What is more, we demonstrate that perceptual decision-making with ambiguous displays is systematically driven by noise, whose random patterns not only promote alternation, but also provide signal-like information that biases perception in highly predictable fashion.