Understanding human behavior is a key skill for intelligent systems that share physical and emotional spaces with humans. One of the main challenges to this end is the ability of such systems to make accurate predictions of human motion. This is a difficult task as human motion is influenced by a large variety of internal and external stimuli, such as own actions, the presence and actions of surrounding agents, social relations, rules and norms between them, or the environment with its topology, geometry, semantics and affordances.
This thesis systematically addresses human motion prediction for autonomous systems by surveying the field, the different requirements to the prediction task, problem formulations and solution classes, and its application domains. Overviewing three decades of prior research from different communities, this thesis proposes a unifying taxonomy for motion prediction methods based on the modeling approach and level of contextual information used, and provides a review of the existing datasets and performance metrics. Furthermore, it discusses limitations of the state of the art and outlines directions for further research.
Predicting human motion in complex dynamic and cluttered environments is particularly challenging due to the high level of required contextual awareness. To acquire, represent and incorporate a large variety of contextual cues is still an open challenge which is why in this thesis, we also make several methodological contributions. We present a planning-based approach that accounts for maps of obstacles and local interactions with social grouping constraints. This method accommodates many desired properties, such as predicting for an arbitrary number of observed people, estimating multi-modal probability distributions, reasoning over intentions, and supporting semantic map input. Apart from reaching state-of-the-art performance, this single method bridges the gap between short-term motion prediction, where social interaction is the most informative cue, and long-term prediction, where goal-orientation and obstacle geometry typically determine people’s motion trajectories.
Along the same line, and in addition to contextual cues of the dynamic environment and the topometric map, semantic information about the environment is a highly informative cue for motion prediction. We address the less explored problem of predicting collision risks by inferring occupancy priors of human motion using only semantic maps as input. The proposed method, based on Convolutional Neural Networks, shows superior performance over the state of the art and demonstrates a novel way to use and apply semantics for the prediction task.
Datasets that contain relevant qualities and quantities of difficulty are critical for benchmarking autonomous systems in general and for motion prediction in particular. Surprisingly, the commonly used datasets are rather limited in that they typically consider simple to almost trivial scenarios, contain little contextual cues and partly suffer from annotation issues. To address these issues, this thesis proposes a weakly-scripted data collection protocol for recording diverse and accurate trajectories of people and robots in interactive scenarios. The protocol includes social roles with simple instructions for the participants, dynamically-allocated goals, group motion and varied obstacle positioning. The data, recorded according to the introduced collection protocol, is used in a motion prediction benchmark, designed for thorough performance evaluation in a variety of experiments: accuracy conditioned on several key factors (e.g. prediction horizon, observation length), evaluation of knowledge transfer to a new environment, testing robustness against perception noise.
The results presented in this thesis are relevant for a broad range of prediction problems with applications in robotics, autonomous driving or video surveillance. With the first systematic taxonomy of prediction approaches, new experiments for benchmarking and novel methods that account for particularly rich contextual cues, we contribute to the field by fostering cross-domain exchange and comparison, and by laying the foundations for various directions of future research.
Örebro: Örebro University , 2021. , p. 197
2021-09-03, Örebro universitet, Långhuset, Hörsal L2, Fakultetsgatan 1, Örebro, 09:00 (English)
I den fysiska avhandlingen är annan opponent angiven, pga förhinder skedde ett byte till ovan angivna opponent.