FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

RSS 2026
1The Hong Kong University of Science and Technology (Guangzhou)   2MBZUAI
*Equal contribution. Corresponding author.

Abstract

Existing learning-based occupancy prediction methods rely on large-scale 3D annotations and generalize poorly across environments. We present FreeOcc, a training-free framework for open-vocabulary occupancy prediction from monocular or RGB-D sequences. Unlike prior approaches that require voxel-level supervision and ground-truth camera poses, FreeOcc operates without 3D annotations, pose ground truth, or any learning stage. FreeOcc incrementally builds a globally consistent occupancy map via a four-layer pipeline: a SLAM backbone estimates poses and sparse geometry; a geometrically consistent Gaussian update constructs dense 3D Gaussian maps; open-vocabulary semantics from off-the-shelf vision-language models are associated with Gaussian primitives; and a probabilistic Gaussian-to-occupancy projection produces dense voxel occupancy. Despite being entirely training-free and pose-agnostic, FreeOcc achieves over 2x improvements in IoU and mIoU on EmbodiedOcc-ScanNet compared to prior self-supervised methods. We further introduce ReplicaOcc, a benchmark for indoor open-vocabulary occupancy prediction, and show that FreeOcc transfers zero-shot to novel environments, substantially outperforming both supervised and self-supervised baselines.

Highlights

  1. Training-free occupancy prediction. FreeOcc requires no occupancy annotations, semantic annotations, ground-truth poses, or task-specific training.
  2. Open-vocabulary 3D mapping. Language-aligned features are associated with 3D Gaussian primitives and propagated to occupancy voxels.
  3. Geometrically consistent Gaussian construction. SLAM-guided initialization and anchored Gaussian updates improve spatial consistency for occupancy mapping.
  4. ReplicaOcc benchmark. A test-only benchmark evaluates zero-shot open-vocabulary occupancy prediction in indoor embodied environments.
FreeOcc teaser figure
FreeOcc is a training-free paradigm for open-vocabulary occupancy prediction. It eliminates the need for occupancy, pose, and semantic annotations, and incrementally constructs four-layer maps using only monocular or RGB-D image sequences. The right panel illustrates the benefit of open-vocabulary reasoning on EmbodiedOcc-ScanNet: the green boxes corresponding to “window” and “chair” are correctly identified and localized by FreeOcc, whereas the ground-truth occupancy labels (red boxes) coarsely classify them as “wall” and “floor,” respectively, despite clear visual evidence.

Methodology

FreeOcc builds a four-layer representation for online open-vocabulary occupancy prediction.

Framework overview of FreeOcc
Framework Overview of FreeOcc. FreeOcc incrementally constructs a multi-layer map for online open-vocabulary occupancy prediction. Layer 1: A SLAM backbone processes monocular or RGB-D image sequences to estimate camera poses and sparse/semi-dense point cloud maps. Layer 2: Dense 3D Gaussian Splatting (3DGS) maps are constructed via SLAM-guided point initialization and a geometrically consistent Gaussian update strategy. Layer 3: Open-vocabulary semantic features are associated with Gaussian primitives using a vision-language model, forming a language-embedded 3D Gaussian semantic map. Layer 4: The semantic Gaussian map is projected into a dense voxel occupancy representation through probabilistic Gaussian-to-occupancy splatting, enabling online open-vocabulary querying and semantic localization in 3D scenes.

Quantitative Results

Table II performance comparison on EmbodiedOcc-ScanNet
Performance comparison on EmbodiedOcc-ScanNet. Label requirements are reported by task type: Geo. indicates required geometric supervision, and Sem. indicates semantic supervision. We report IoU and per-class mIoU.
Table III zero-shot generalization results on ReplicaOcc
Zero-shot generalization results on the ReplicaOcc benchmark. The evaluation protocol and categorization follow those in Table II.
Table IV geometric IoU comparison of 3DGS-based SLAM backbones
Geometric IoU comparison of 3DGS-based SLAM backbones for occupancy prediction on ReplicaOcc and EmbodiedOcc-ScanNet-mini.
Table V ablation results
We report average results of IoU/mIoU and FPS on ReplicaOcc and EmbodiedOcc-ScanNet. GAGU: Geometrically Anchored Gaussian Updates. G-ini: Geometry-aware initialization.

Qualitative Results

Qualitative occupancy prediction results. (A) Comparisons with learning-based occupancy predictors on “scene0470” and “room2”. (B) Results of the two 3DGS-SLAM methods with the highest geometric accuracy on “scene0006” and “office0”.
Open-vocabulary query results on ReplicaOcc, demonstrating semantic occupancy retrieval for different input vocabulary words.

Real-World Deployment with RGB-D Sensor

Real-world open-vocabulary occupancy prediction
This figure demonstrates the visualization results of FreeOcc's open-vocabulary occupancy prediction in real-world indoor and outdoor scenes. For real-time results, please refer to the uploaded video.

Exploratory Experiments

Table VI influence of individual network components
Influence of individual network components on EmbodiedOcc-ScanNet. We evaluate different SLAM backbones and open-vocabulary semantic segmentation models under the monocular setting.
Table VII performance gap analysis
Performance gap analysis under the RGB-D setting on EmbodiedOcc-ScanNet. We evaluate the influence of camera pose accuracy and the type of semantic prediction.
Table VIII open-vocabulary validation results
Quantitative open-vocabulary validation results on ReplicaOcc. Categories are sorted by frequency from high to low, and mIoU is reported over the top-K categories.
Real-world red-and-yellow cup experiment
Real-world red-and-yellow cup experiment. FreeOcc correctly localizes and distinguishes visually similar objects according to open-vocabulary text queries, demonstrating its applicability to fine-grained real-world scene understanding.

Benchmark Details

Representative EmbodiedOcc-ScanNet and ReplicaOcc scenes
Visualization of a representative scene from EmbodiedOcc-ScanNet and ReplicaOcc with similar geometric layouts. While EmbodiedOcc-ScanNet contains 11 semantic categories, ReplicaOcc includes 44 categories per scene to maintain semantic diversity during evaluation.
Visualization results for all ReplicaOcc scenes
Visualization results for all ReplicaOcc scenes. In Replica, the true scene nearly touches the ceiling, so we reduced transparency to 0.3 for all ReplicaOcc visualization results.

Incremental Visualization Results

Incremental results of multi-layer map construction for scene0000
This figure displays the incremental results of multi-layer map construction for “scene0000” in ScanNet.
Incremental results for outdoor real-world deployment
This figure displays the incremental results of multi-layer map construction for an outdoor scene in real-world deployment with RGB-D input.

BibTeX

@inproceedings{jiang2026freeocc,
  title     = {FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction},
  author    = {Jiang, Zeyu and Zhou, Changqing and Zuo, Xingxing and Chen, Changhao},
  booktitle = {Robotics: Science and Systems (RSS)},
  year      = {2026}
}