We propose a commonsense theory of space and motion for the high-level semantic interpretation of dynamic scenes. The theory provides primitives for commonsense representation and reasoning with qualitative spatial relations, depth profiles, and spatio-temporal change; these may be combined with probabilistic methods for modelling and hypothesising event and object relations. The proposed framework has been implemented as a general activity abstraction and reasoning engine, which we demonstrate by generating declaratively grounded visuo-spatial narratives of perceptual input from vision and depth sensors for a bench-mark scenario. Our long-term goal is to provide general tools (integrating different aspects of space, action, and change) necessary for tasks such as real-time human activity interpretation and dynamic sensor control within the purview of cognitive vision, interaction, and control