File size: 4,589 Bytes
e4f0918
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bc815ca
8f74552
e4f0918
 
 
 
 
 
46daeed
 
 
 
 
 
e4f0918
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75b6011
e4f0918
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- visual-document-retrieval
- cross-modal-distillation
- knowledge-distillation
- nanovdr
base_model: answerdotai/ModernBERT-base
language:
- en
license: apache-2.0
datasets:
- openbmb/VisRAG-Ret-Train-Synthetic-data
- openbmb/VisRAG-Ret-Train-In-domain-data
- vidore/colpali_train_set
- llamaindex/vdr-multilingual-train
model-index:
- name: NanoVDR-L
  results:
  - task:
      type: retrieval
    dataset:
      name: ViDoRe v1
      type: vidore/vidore-benchmark-667173f98e70a1c0fa4d
    metrics:
    - name: NDCG@5
      type: ndcg_at_5
      value: 82.4
  - task:
      type: retrieval
    dataset:
      name: ViDoRe v2
      type: vidore/vidore-benchmark-v2
    metrics:
    - name: NDCG@5
      type: ndcg_at_5
      value: 61.5
---

<p align="center">
  <img width="560" src="banner.png" alt="NanoVDR"/>
</p>

> **Paper**: [NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval](https://arxiv.org/abs/2603.12824) | [Blog](https://huggingface.co/blog/Ryenhails/nanovdr)

# NanoVDR-L

**ModernBERT-base ablation variant.** For production use, we recommend **[NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi)**.

NanoVDR-L is a 151M-parameter text-only query encoder for visual document retrieval, trained via asymmetric cross-modal distillation from [Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B). It uses ModernBERT-base + a 2-layer MLP projector and achieves the highest v1 score (82.4) among all NanoVDR variants.

### Highlights

- **Single-vector retrieval** — queries and documents share the same 2048-dim embedding space as [Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B); retrieval is a plain dot product, FAISS-compatible, **4 KB per page** (float16)
- **Lightweight on storage** — 612 MB model; doc index costs 64× less than ColPali's multi-vector patches
- **Asymmetric setup** — tiny 151M text encoder at query time; large VLM indexes documents offline once

## Results

| Model | Params | ViDoRe v1 | ViDoRe v2 | ViDoRe v3 | Avg Retention |
|-------|--------|-----------|-----------|-----------|---------------|
| Qwen3-VL-Emb (Teacher) | 2.0B | 84.3 | 65.3 | 50.0 | — |
| **NanoVDR-L** | **151M** | **82.4** | **61.5** | **44.2** | **93.4%** |
| NanoVDR-S-Multi | 69M | 82.2 | 61.9 | 46.5 | 95.1% |

<sub>NDCG@5 (×100). Retention = Student / Teacher averaged across v1/v2/v3.</sub>

## Usage

> **Prerequisite:** Documents must be indexed offline using [Qwen3-VL-Embedding-2B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-2B) (the teacher model). See the [NanoVDR-S-Multi model page](https://huggingface.co/nanovdr/NanoVDR-S-Multi#prerequisites-document-indexing-with-teacher-model) for a complete indexing guide.

```python
from sentence_transformers import SentenceTransformer
import numpy as np

# doc_embeddings: (N, 2048) from teacher indexing (see prerequisite above)

model = SentenceTransformer("nanovdr/NanoVDR-L")
query_embeddings = model.encode(["What was the revenue growth in Q3?"])  # (1, 2048)

scores = query_embeddings @ doc_embeddings.T
top_k_indices = np.argsort(scores[0])[-5:][::-1]
```

## Training Details

| | Value |
|--|-------|
| Architecture | ModernBERT-base (149M) + MLP projector (768 → 768 → 2048, 2.4M) = 151M |
| Objective | Pointwise cosine alignment with teacher query embeddings |
| Data | 711K query-document pairs |
| Epochs / lr | 20 / 2e-4 |
| Training cost | ~11.7 GPU-hours (1× H200) |
| CPU query latency | 109 ms |

## All NanoVDR Models

| Model | Backbone | Params | v1 | v2 | v3 | Retention |
|-------|----------|--------|----|----|----| ----------|
| **[NanoVDR-S-Multi](https://huggingface.co/nanovdr/NanoVDR-S-Multi)** | **DistilBERT** | **69M** | **82.2** | **61.9** | **46.5** | **95.1%** |
| [NanoVDR-S](https://huggingface.co/nanovdr/NanoVDR-S) | DistilBERT | 69M | 82.2 | 60.5 | 43.5 | 92.4% |
| [NanoVDR-M](https://huggingface.co/nanovdr/NanoVDR-M) | BERT-base | 112M | 82.1 | 62.2 | 44.7 | 94.0% |
| [NanoVDR-L](https://huggingface.co/nanovdr/NanoVDR-L) | ModernBERT | 151M | 82.4 | 61.5 | 44.2 | 93.4% |

## Citation

```bibtex
@article{nanovdr2026,
  title={NanoVDR: Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval},
  author={Liu, Zhuchenyang and Zhang, Yao and Xiao, Yu},
  journal={arXiv preprint arXiv:2603.12824},
  year={2026}
}
```

## License

Apache 2.0