Point Cloud as a Foreign Language for Multi-modal Large Language Model

CVPR 2026 (main conference, poster)

Concordia University, Canada
SAGE teaser: encoder-free 3D MLLM architecture
Figure 1. Our proposed encoder-free 3D Multimodal Large Language Model efficiently captures 3D information from point clouds without relying on any pretrained 3D encoder.

Abstract

Multi-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead.

In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens — treating 3D data as a foreign language that naturally extends the LLM's vocabulary.

Furthermore, to enhance the model's reasoning capability on complex 3D tasks, we propose a preference optimization training strategy with a semantic alignment–based reward, specifically designed for open-ended 3D question answering where responses are descriptive. Extensive experiments across diverse 3D understanding benchmarks demonstrate that our end-to-end approach outperforms existing encoder-based methods while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations.

Key Contributions

Method

SAGE treats a point cloud as a foreign language. A lightweight, trainable tokenizer projects raw point clouds into the LLM's input space via three steps:

  1. Geometric sampling and grouping. Farthest Point Sampling selects representative centres; K-nearest neighbours form local sub-clouds; a local geometry aggregation module produces spatially-contextualized features.
  2. Projection to LLM space. A learnable projection matrix maps geometric features into the LLM's embedding space.
  3. Vector quantization. A learnable codebook discretizes the projected features into a finite vocabulary of 3D tokens — extending the LLM's tokenizer to the 3D domain.
SAGE architecture
Figure 2. Architecture of the proposed encoder-free 3D Multimodal Large Language Model.

Three-Stage Training

  1. Stage 1 — 3D tokenizer warm-up. The tokenizer learns to produce meaningful discrete 3D tokens.
  2. Stage 2 — Instruction tuning. The full model is trained on 3D instruction-following data.
  3. Stage 3 — GRPO-based tuning. Preference optimization with a semantic-alignment reward enhances complex 3D reasoning.
Three-stage training pipeline
Figure 3. Three-stage training pipeline.

Results

SAGE is evaluated across diverse 3D understanding benchmarks including captioning and question answering. We report two variants:

Both variants outperform existing encoder-based 3D MLLMs while offering substantial efficiency gains. See the paper for full quantitative comparisons.

Quantitative results

BibTeX

@article{paul2025sage,
  title   = {Point Cloud as a Foreign Language for Multi-modal Large Language Model},
  author  = {Paul, Sneha and Patterson, Zachary and Bouguila, Nizar},
  journal = {arXiv preprint},
  year    = {2025}
}