Lightweight head for depth estimation using DINOv3 as backbone

This repository provides a lightweight object detection head designed to run on top of Meta’s DINOv3 backbone. The model adopts a lightweight DPT-inspired decoder that progressively fuses multi-scale features reassembled from DINOv3. It has been trained using the NYU Depth Dataset V2.

This head is part of the dinov3_ros project, where it enables real-time depth estimation in ROS 2 by reusing backbone features across multiple perception tasks.

Installation

We recommend using a fresh conda environment to keep dependencies isolated. DINOv3 requires Python 3.11, so we set that explicitly.

conda create -n depth_dinov3 python=3.11
conda activate depth_dinov3
git clone --recurse-submodules https://github.com/Raessan/depth_dinov3
cd depth_dinov3
pip install -e .

The only package that has to be installed separately is PyTorch, due to its dependence with the CUDA version. For example:

pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu129

Finally, we provide weights for the lightweight heads developed by us in the weights folder, but the DINOv3 backbone weights should be requested and obtained from their repo. Its default placement is in dinov3_weights folder. The presented head has been trained using the vits16plus model from DINOv3 as a backbone.

Training also requires the NYU Depth Dataset V2 to be installed on disk.

Model and loss function

This repository implements a lightweight DPT-inspired depth estimation head that attaches to the DINOv3 backbone (or any ViT producing a single spatial feature map). The design has three main components:

Model architecture

Multi-branch reassembly

Projects the DINOv3 feature map into four branches: upsampled ×4, upsampled ×2, identity, and downsampled ×2.
Each branch is mapped into a common channel dimension using depthwise-separable convolutions (cheap but expressive).

Progressive feature fusion

Coarse-to-fine fusion through a sequence of lightweight fusion blocks.
Skip connections progressively refine spatial details while preserving global context.

Depth prediction head

A small convolutional decoder upsamples fused features to the target resolution.
Outputs a single-channel depth map, passed through a Softplus activation to ensure positive depth values.

Loss functions

Training uses a combination of three complementary losses:

Data term (masked L1 loss)

Penalizes absolute differences between predicted and ground-truth depth values, while ignoring invalid pixels.

Scale-invariant loss (SigLoss)

Encourages the model to predict depth distributions consistent with ground truth, focusing on relative depth rather than absolute scale.

Gradient loss

Enforces sharp depth boundaries by aligning gradients of predicted and ground-truth maps at multiple scales.

The total loss is a weighted sum:

$$ L_{total} = \lambda_{data} L_{data} + \lambda_{sig} L_{sig} + \lambda_{grad} L_{grad} $$

Usage

There are three main folders and files that the user should use:

config/config.py: This file allows the user to configure the model, loss, training step or inference step. The parameters are described in the file.

train/train_depther.ipynb: Jupyter notebook for training the depth estimator. It can load and/or save checkpoints depending on the configuration in config.py.

inference/inference.py: Script for running inference with a trained model on new images.

Additionally, the repository includes a src folder that contains the backend components: dataset utilities, backbone/head model definitions, and helper scripts. In particular:

common.py: general-purpose functions that can be reused across different task-specific heads.
utils.py: utilities tailored specifically for depth estimation (e.g., creating colored-map images from depth).

The depth estimator was trained for a total of 150 epochs: first for 100 epochs with a learning rate of 1e-4 using data augmentation, followed by 50 epochs with a reduced learning rate of 1e-5 without augmentation. No dropout was used. The final weights have been placed in the weights folder.

Our main objective was not to surpass state-of-the-art models, but to train a head with solid results that enables collaboration and contributes to building a more refined dinov3_ros. This effort is particularly important because Meta has not released lightweight task-specific heads. For this reason, we welcome contributions — whether it’s improving this depth head, adding new features, or experimenting with alternative model architectures. Feel free to open an issue or submit a pull request! See the Integration with dinov3_ros section to be compliant with the dinov3_ros project.

Integration with dinov3_ros

This repository is designed to be easily integrated into dinov3_ros. To enable plug-and-play usage, the following files must be exported from the src folder to the dinov3_toolkit/head_depth folder in dinov3_ros:

model_head.py: defines the detection head architecture.
utils.py: task-specific utilites for object detection.

Additionally, we provide our chosen weights in weights/model.pth.

Any modification or extension of this repository should maintain these files and remain self-contained, so that the head can be directly plugged into dinov3_ros without additional dependencies.

Update: To run with dinov3_ros_tensorrt, we created an onnx exporter inside inference/export_model.py. The resulting .onnx models are required instead of the model_head.py and weights/model.pth to run with TensorRT.

Demo

License

Code in this repo: Apache-2.0.
DINOv3 submodule: licensed separately by Meta (see its LICENSE).
We don't distribute DINO weights. Follow upstream instructions to obtain them.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
config		config
dinov3 @ a3a8b2f		dinov3 @ a3a8b2f
inference		inference
src		src
train		train
weights		weights
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.cff		CITATION.cff
LICENSE.md		LICENSE.md
NOTICE.md		NOTICE.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lightweight head for depth estimation using DINOv3 as backbone

Table of Contents

Installation

Model and loss function

Model architecture

Multi-branch reassembly

Progressive feature fusion

Depth prediction head

Loss functions

Data term (masked L1 loss)

Scale-invariant loss (SigLoss)

Gradient loss

Usage

Integration with dinov3_ros

Demo

License

References

About

Uh oh!

Releases

Packages

Languages

License

Raessan/depth_dinov3

Folders and files

Latest commit

History

Repository files navigation

Lightweight head for depth estimation using DINOv3 as backbone

Table of Contents

Installation

Model and loss function

Model architecture

Multi-branch reassembly

Progressive feature fusion

Depth prediction head

Loss functions

Data term (masked L1 loss)

Scale-invariant loss (SigLoss)

Gradient loss

Usage

Integration with dinov3_ros

Demo

License

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages