FreeOcc

FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

RSS 2026

¹The Hong Kong University of Science and Technology (Guangzhou) ²MBZUAI

^*Equal contribution. ^✉Corresponding author.

Abstract

Existing learning-based occupancy prediction methods rely on large-scale 3D annotations and generalize poorly across environments. We present FreeOcc, a training-free framework for open-vocabulary occupancy prediction from monocular or RGB-D sequences. Unlike prior approaches that require voxel-level supervision and ground-truth camera poses, FreeOcc operates without 3D annotations, pose ground truth, or any learning stage. FreeOcc incrementally builds a globally consistent occupancy map via a four-layer pipeline: a SLAM backbone estimates poses and sparse geometry; a geometrically consistent Gaussian update constructs dense 3D Gaussian maps; open-vocabulary semantics from off-the-shelf vision-language models are associated with Gaussian primitives; and a probabilistic Gaussian-to-occupancy projection produces dense voxel occupancy. Despite being entirely training-free and pose-agnostic, FreeOcc achieves over 2x improvements in IoU and mIoU on EmbodiedOcc-ScanNet compared to prior self-supervised methods. We further introduce ReplicaOcc, a benchmark for indoor open-vocabulary occupancy prediction, and show that FreeOcc transfers zero-shot to novel environments, substantially outperforming both supervised and self-supervised baselines.

Highlights

Training-free occupancy prediction. FreeOcc requires no occupancy annotations, semantic annotations, ground-truth poses, or task-specific training.

Open-vocabulary 3D mapping. Language-aligned features are associated with 3D Gaussian primitives and propagated to occupancy voxels.

Geometrically consistent Gaussian construction. SLAM-guided initialization and anchored Gaussian updates improve spatial consistency for occupancy mapping.

ReplicaOcc benchmark. A test-only benchmark evaluates zero-shot open-vocabulary occupancy prediction in indoor embodied environments.

Main Paper

FreeOcc teaser figure — FreeOcc is a training-free paradigm for open-vocabulary occupancy prediction. It eliminates the need for occupancy, pose, and semantic annotations, and incrementally constructs four-layer maps using only monocular or RGB-D image sequences. The right panel illustrates the benefit of open-vocabulary reasoning on EmbodiedOcc-ScanNet: the green boxes corresponding to “window” and “chair” are correctly identified and localized by FreeOcc, whereas the ground-truth occupancy labels (red boxes) coarsely classify them as “wall” and “floor,” respectively, despite clear visual evidence.

Methodology

FreeOcc builds a four-layer representation for online open-vocabulary occupancy prediction.

Framework overview of FreeOcc — **Framework Overview of FreeOcc.** FreeOcc incrementally constructs a multi-layer map for online open-vocabulary occupancy prediction. **Layer 1:** A SLAM backbone processes monocular or RGB-D image sequences to estimate camera poses and sparse/semi-dense point cloud maps. **Layer 2:** Dense 3D Gaussian Splatting (3DGS) maps are constructed via SLAM-guided point initialization and a geometrically consistent Gaussian update strategy. **Layer 3:** Open-vocabulary semantic features are associated with Gaussian primitives using a vision-language model, forming a language-embedded 3D Gaussian semantic map. **Layer 4:** The semantic Gaussian map is projected into a dense voxel occupancy representation through probabilistic Gaussian-to-occupancy splatting, enabling online open-vocabulary querying and semantic localization in 3D scenes.

Quantitative Results

Table II performance comparison on EmbodiedOcc-ScanNet — Performance comparison on EmbodiedOcc-ScanNet. Label requirements are reported by task type: **Geo.** indicates required geometric supervision, and **Sem.** indicates semantic supervision. We report IoU and per-class mIoU.

Table III zero-shot generalization results on ReplicaOcc — Zero-shot generalization results on the ReplicaOcc benchmark. The evaluation protocol and categorization follow those in Table II.

Table IV geometric IoU comparison of 3DGS-based SLAM backbones — Geometric IoU comparison of 3DGS-based SLAM backbones for occupancy prediction on ReplicaOcc and EmbodiedOcc-ScanNet-mini.

Table V ablation results — We report average results of IoU/mIoU and FPS on ReplicaOcc and EmbodiedOcc-ScanNet. **GAGU**: Geometrically Anchored Gaussian Updates. **G-ini**: Geometry-aware initialization.

Qualitative Results

Qualitative occupancy prediction results. **(A)** Comparisons with learning-based occupancy predictors on “scene0470” and “room2”. **(B)** Results of the two 3DGS-SLAM methods with the highest geometric accuracy on “scene0006” and “office0”.

Open-vocabulary query results on ReplicaOcc, demonstrating semantic occupancy retrieval for different input vocabulary words.

Real-World Deployment with RGB-D Sensor

Real-world open-vocabulary occupancy prediction

This figure demonstrates the visualization results of FreeOcc's open-vocabulary occupancy prediction in real-world indoor and outdoor scenes. For real-time results, please refer to the uploaded video.

Benchmark Details

Representative EmbodiedOcc-ScanNet and ReplicaOcc scenes

Visualization of a representative scene from EmbodiedOcc-ScanNet and ReplicaOcc with similar geometric layouts. While EmbodiedOcc-ScanNet contains 11 semantic categories, ReplicaOcc includes 44 categories per scene to maintain semantic diversity during evaluation.

Visualization results for all ReplicaOcc scenes. In Replica, the true scene nearly touches the ceiling, so we reduced transparency to 0.3 for all ReplicaOcc visualization results.

Incremental Visualization Results

Incremental results of multi-layer map construction for scene0000 — This figure displays the incremental results of multi-layer map construction for “scene0000” in ScanNet.

Incremental results for outdoor real-world deployment — This figure displays the incremental results of multi-layer map construction for an outdoor scene in real-world deployment with RGB-D input.

@inproceedings{jiang2026freeocc, title = {FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction}, author = {Jiang, Zeyu and Zhou, Changqing and Zuo, Xingxing and Chen, Changhao}, booktitle = {Robotics: Science and Systems (RSS)}, year = {2026} }