Abstract
We present a new paradigm for real-time object-
oriented SLAM with a monocular camera. Contrary to previous
approaches, that rely on object-level models, we construct
category-level models from CAD collections which are now
widely available. To alleviate the need for huge amounts of
labeled data, we develop a rendering pipeline that enables
synthesis of large datasets from a limited amount of manually
labeled data. Using data thus synthesized, we learn category-
level models for object deformations in 3D, as well as dis-
criminative object features in 2D. These category models are
instance-independent and aid in the design of object landmark
observations that can be incorporated into a generic monocular
SLAM framework. Where typical object-SLAM approaches
usually solve only for object and camera poses, we also estimate
object shape on-the-fly, allowing for a wide range of objects
from the category to be present in the scene. Moreover,
since our 2D object features are learned discriminatively, the
proposed object-SLAM system succeeds in several scenarios
where sparse feature-based monocular SLAM fails due to
insufficient features or parallax. Also, the proposed category-
models help in object instance retrieval, useful for Augmented
Reality (AR) applications. We evaluate the proposed framework
on multiple challenging real-world scenes and show — to the
best of our knowledge — first results of an instance-independent
monocular object-SLAM system and the benefits it enjoys over
feature-based SLAM methods.