Jiahao Yang
Zihan Wang
Xiangyang Li
Xing Zhu
Yujun Shen
Yinghao Xu†
Shuqiang Jiang†
🌟 Official implementation of GA-VLN, a geometry-aware BEV representation framework designed for Vision-Language Navigation (VLN).
- Efficient BEV Representation: Compresses dense multi-view visual observations into a unified BEV space for efficient representation.
- Robust Spatial Reasoning: Integrates 3D foundation model to enhance geometry-aware perception.
- Real-World Resilience: Demonstrates high robustness against sensor noise modeled after real-world error.
1. Create environment
conda create -n gavln python=3.9
conda activate gavln
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121 -r requirements.txt2. Install Habitat
conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat
git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab
pip install -e habitat-baselinesDownload the VLN-CE data and SigLIP & VGGT backbones. Your directory tree should look like this:
GA-VLN/
├── vln_data/
│ └── scene_datasets/
│ └── datasets/
├── model/
│ ├── siglip-so400m-patch14-384/
│ └── VGGT-1B/
├── checkpoints/
│ └── gavln_official/
├── ...
bash scripts/gavln_eval.sh@article{yang2026ga,
title={GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation},
author={Yang, Jiahao and Wang, Zihan and Li, Xiangyang and Zhu, Xing and Shen, Yujun and Xu, Yinghao and Jiang, Shuqiang},
journal={arXiv preprint arXiv:2605.22036},
year={2026}
}Our code is based on StreamVLN and VGGT. Thanks for their great works!