Vision-based Perception For Autonomous Robotic Manipulation
2021 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]
In order to safely and effectively operate in real-world unstructured environments where a priori knowledge of the surroundings is not available, robots must have adequate perceptual capabilities. This thesis is concerned with several important aspects of vision-based perception for autonomous robotic manipulation. With a focus on topics related to scene reconstruction, object pose estimation and grasp configuration generation, we aim at helping robots to better understand their surroundings, to avoid undesirable contacts with the environment and to accurately grasp selected objects.
With the wide availability of affordable RGB-D cameras, research on visual SLAM (Simultaneous Localization and Mapping) or scene reconstruction has made giant strides in development. As a key element of an RGB-D reconstruction system, a large number of registration algorithms have been proposed in the context of RGB-D Tracking and Mapping (TAM). The state-of-the-art methods rely on color and depth information to track camera poses. Besides depth and color images, semantic information is now often available due to the advancement of image segmentation driven by deep learning. We are interested to explore to what extent the use of semantic cues can increase the robustness of camera pose tracking. This leads to the first contribution of this dissertation. A method for reliable camera tracking using an objective function that combines geometric, appearance, and semantic cues with adaptive weights.
Beyond the purely geometric model of the environment produced by classical reconstruction systems, the inclusion of rich semantic information and 6D poses of object instances within a dense map is useful for robots to effectively operate and interact with objects. Therefore, the second contribution of this thesis is an approach for recognizing objects present in a scene and estimating their full pose by means of an accurate 3D semantic reconstruction. Our framework deploys simultaneously a 3D mapping algorithm to reconstruct a semantic model of the environment, and an incremental 6D object pose recovery algorithm that carries out predictions using the reconstructed model. We demonstrate that we can exploit multiple viewpoints around the same object to achieve robust and stable 6D pose estimation in the presence of heavy clutter and occlusion.
The methods taking RGB-D images as input have achieved state-of-the-art performance on the object pose estimation task. However, in a number of cases, color information may not be available — for example, when the input is point cloud data from laser range finders or industrial high-resolution 3D sensors. Therefore, besides methods using RGB-D images, studies on recovering the 6D pose of rigid objects from 3D point clouds containing only geometric information are necessary. The third contribution of this dissertation is a novel deep learning architecture to address the problem of estimating the 6D pose of multiple rigid objects in a cluttered scene, using only a 3D point cloud of the scene as an input. The proposed architecture pools geometric features together using a self-attention mechanism and adopts a deep Hough voting scheme for pose proposal generation. We show that by exploiting the correlation between poses of object instances and object parts we can improve the performance of object pose estimation.
By applying a 6D object pose estimation algorithm, robots can perform grasping known objects where the 3D model of objects is available and a grasp database is pre-defined. What if we want to grasp novel objects? The fourth contribution of this thesis is a method for robust manipulation of novel objects in cluttered environments. we develop an end-to-end deep learning approach for generating grasp configurations for a two-finger parallel jaw gripper, based on 3D point cloud observations of the scene. The proposed model generates candidates by casting votes to accumulate evidence for feasible grasp configurations. We exploit contextual information by encoding the dependency of objects in the scene into features to boost the performance of grasp generation.
Place, publisher, year, edition, pages
Örebro: Örebro University , 2021. , p. 140
Series
Örebro Studies in Technology, ISSN 1650-8580 ; 93
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:oru:diva-95168ISBN: 9789175294148 (print)OAI: oai:DiVA.org:oru-95168DiVA, id: diva2:1605796
Public defence
2021-12-17, Örebro universitet, Långhuset, Hörsal L3, Fakultetsgatan 1, Örebro, 09:00 (English)
Opponent
Supervisors
2021-10-252021-10-252021-11-25Bibliographically approved