SAGE: Point Cloud as a Foreign Language for Multi-modal Large Language Model

SAGE teaser: encoder-free 3D MLLM architecture — **Figure 1.** Our proposed encoder-free 3D Multimodal Large Language Model efficiently captures 3D information from point clouds without relying on any pretrained 3D encoder.

Abstract

Multi-modal large language models (MLLMs) have shown remarkable progress in integrating visual and linguistic understanding. Recent efforts have extended these capabilities to 3D understanding through encoder-based architectures that rely on pre-trained 3D encoders to extract geometric features. However, such approaches suffer from semantic misalignment between geometric and linguistic spaces, resolution sensitivity, and substantial computational overhead.

In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. Our approach introduces a lightweight 3D tokenizer that combines geometric sampling and neighbourhood aggregation with vector quantization to convert point clouds into discrete tokens — treating 3D data as a foreign language that naturally extends the LLM's vocabulary.

Furthermore, to enhance the model's reasoning capability on complex 3D tasks, we propose a preference optimization training strategy with a semantic alignment–based reward, specifically designed for open-ended 3D question answering where responses are descriptive. Extensive experiments across diverse 3D understanding benchmarks demonstrate that our end-to-end approach outperforms existing encoder-based methods while offering significant advantages in computational efficiency, generalization across LLM backbones, and robustness to input resolution variations.

Key Contributions

Point cloud as a foreign language. A novel 3D tokenizer combines Farthest Point Sampling, K-nearest-neighbour grouping, and vector quantization with a learnable codebook — treating discrete 3D tokens as an extension of the LLM's vocabulary.
Preference optimization with semantic-alignment reward. An RL-based training strategy that replaces correctness-based rewards (used in GRPO for verifiable tasks) with a semantic alignment reward for open-ended 3D question answering.
Strong empirical results. Outperforms encoder-based 3D MLLMs across diverse 3D understanding benchmarks, while offering greater computational efficiency, robustness to varying input resolutions, and generalization across different LLM backbones.

Method

SAGE treats a point cloud as a foreign language. A lightweight, trainable tokenizer projects raw point clouds into the LLM's input space via three steps:

Geometric sampling and grouping. Farthest Point Sampling selects representative centres; K-nearest neighbours form local sub-clouds; a local geometry aggregation module produces spatially-contextualized features.
Projection to LLM space. A learnable projection matrix maps geometric features into the LLM's embedding space.
Vector quantization. A learnable codebook discretizes the projected features into a finite vocabulary of 3D tokens — extending the LLM's tokenizer to the 3D domain.

SAGE architecture — **Figure 2.** Architecture of the proposed encoder-free 3D Multimodal Large Language Model.

Three-Stage Training

Stage 1 — 3D tokenizer warm-up. The tokenizer learns to produce meaningful discrete 3D tokens.
Stage 2 — Instruction tuning. The full model is trained on 3D instruction-following data.
Stage 3 — GRPO-based tuning. Preference optimization with a semantic-alignment reward enhances complex 3D reasoning.

Results

SAGE is evaluated across diverse 3D understanding benchmarks including captioning and question answering. We report two variants:

SAGE — the full model with preference optimization (Stage 3).
SAGE* — the variant without preference optimization, trained under the standard two-stage protocol.

Both variants outperform existing encoder-based 3D MLLMs while offering substantial efficiency gains. See the paper for full quantitative comparisons.

BibTeX

@article{paul2025sage,
  title   = {Point Cloud as a Foreign Language for Multi-modal Large Language Model},
  author  = {Paul, Sneha and Patterson, Zachary and Bouguila, Nizar},
  journal = {arXiv preprint},
  year    = {2025}
}