ECCV 2024

1Istituto Italiano di Tecnologia, Genoa, Italy
2University of Genoa, Genoa, Italy

We equip an agent with an off-the-shelf MaskRCNN detector. The agent explores new environments and collects a set of noisy detections using a learned exploration policy. Such detections are then used for finetuning the detector.

Video

Abstract

Object detectors often experience a drop in performance when new environmental conditions are insufficiently represented in the training data. This paper studies how to automatically fine-tune a preexisting object detector while exploring and acquiring images in a new environment without relying on human intervention, i.e., in an utterly self-supervised fashion. In our setting, an agent initially learns to explore the environment using a pre-trained off-the-shelf detector to locate objects and associate pseudo-labels. By assuming that pseudo-labels for the same object must be consistent across different views, we learn an exploration policy mining hard samples and we devise a novel mechanism for producing refined predictions from the consensus among observations. Our approach outperforms the current state-of-the-art, and it closes the performance gap against a fully supervised setting without relying on ground-truth annotations. We also compare various exploration policies for the agent to gather more informative observations.

Approach

Action loop

Our policy predicts long-term goals for the agent. The agent builds a semantically consistent voxel map of the environment by projecting detected objects into a 3D voxel-map. The voxel-map is down-projected onto a top-down view and a disagreement map is computed by assigning a disagreement score value to each cell based on one of two measures.The disagreement map is the input of the policy network. The policy is trained to predict the goal that maximizes the total disagreement.

Reprojecting detections onto 2D frames

During exploration, detections are aggregated into the semantic voxel-map. Inconsistencies in the voxel-map are solved by assigning to each voxel the class with the maximum score among the predictions belonging to the voxel. Then, the semantic voxel-map is reprojected onto each observation, obtaining a set of consistent pseudo-labels. Each pseudo-label is associated to an object instance via a unique identifier and contains a consistent logits vector.

Instance-matching loss

The instance-matching loss exploits disagreements between predictions for the same object. In fact, it enforces feature vectors belonging to the same object to be close in the feature space, while enforcing feature vectors of different objects to be farther away.

-->

BibTeX

@inproceedings{lookaround2024,
      title={Look around and learn: self-improving object detection by exploration}, 
      author={Gianluca Scarpellini and Stefano Rosa and Pietro Morerio and Lorenzo Natale and Alessio Del Bue},
      year={2024},
      eprint={2302.03566},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      booktitle={European Conference on Computer Vision},
      }