Spatial terms such as “above”, “in front of”, and “on the left of” are all essential for describing the location of one object relative to another object in everyday communication. Apprehending such spatial relations involves relating linguistic to object representations by means of attention. This requires at least one attentional shift, and models such as the Attentional Vector Sum (AVS) predict the direction of that attention shift, from the sausage to the box for spatial utterances such as “The box is above the sausage”. To the extent that this prediction generalizes to overt gaze shifts, a listener’s visual attention should shift from the sausage to the box. However, listeners tend to rapidly look at referents in their order of mention and even anticipate them based on linguistic cues, a behavior that predicts a converse attentional shift from the box to the sausage. Four eye-tracking experiments assessed the role of overt attention in spatial language comprehension by examining to which extent visual attention is guided by words in the utterance and to which extent it also shifts “against the grain” of the unfolding sentence. The outcome suggests that comprehenders’ visual attention is predominantly guided by their interpretation of the spatial description. Visual shifts against the grain occurred only when comprehenders had some extra time, and their absence did not affect comprehension accuracy. However, the timing of this reverse gaze shift on a trial correlated with that trial’s verification time. Thus, while the timing of these gaze shifts is subtly related to the verification time, their presence is not necessary for successful verification of spatial relations