SigLIP2 Base 256 — Food/Not-Food Classifier v2

Binary image classifier: food_or_drink vs not_food_or_drink.

Part of the Nutrify pipeline. Role: Highest accuracy.

v2 Improvement

v2 adds 560,836 human-labeled FoodVision images to the 2,952,644 DataComp training set. FoodVision samples use hard cross-entropy loss; DataComp samples use KL distillation from SigLIP2-so400m soft labels.

Version	FoodVision Acc	FoodVision F1	Training Data
v2	98.21%	0.9883	DataComp 2,952,644 + FoodVision 560,836
v1	0.00%	0.0000	DataComp only
Δ	+98.21%	+0.9883

Cross-Model Comparison (v2, FoodVision Test — 153K images)

Model	Params	FV Accuracy	FV F1	Role

| **SigLIP2 Base 256** | 92.9M | 98.21% | 0.9883 | Highest accuracy |
| CSATv2 11M | 10.7M | 97.99% | 0.9869 | Fastest throughput |
| NextViT Small 384 | 30.7M | 97.84% | 0.9859 | CoreML deployable |

Quick Start

import timm
from PIL import Image
import torch

# Load model
model = timm.create_model("vit_base_patch16_siglip_256.v2_webli", pretrained=False, num_classes=2)

# Load weights
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download

weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model.safetensors")
model.load_state_dict(load_file(weights_path))
model.eval()

# Get transforms
data_cfg = timm.data.resolve_data_config(model.pretrained_cfg)
transform = timm.data.create_transform(**data_cfg, is_training=False)

# Predict
img = Image.open("your_image.jpg").convert("RGB")
x = transform(img).unsqueeze(0)
with torch.no_grad():
    logits = model(x)
    pred = logits.argmax(dim=1).item()

labels = {0: "food_or_drink", 1: "not_food_or_drink"}
print(f"Prediction: {labels[pred]}")

Training Details

Architecture: vit_base_patch16_siglip_256.v2_webli (92.9M parameters)
Input size: 256px
Training data: DataComp 2,952,644 (soft KL labels) + FoodVision 560,836 (hard CE labels)
Epochs: 5 (best blended at epoch 3)
Peak inference throughput: 2096.3 img/s
Optimizer: AdamW (head LR=1e-4, backbone LR=1e-5 after 0.5 epoch warmup)
Loss: DataComp: 0.7×KL(T=3) + 0.3×CE | FoodVision: CE

Weight Variants

Three weight files are included, each optimized for a different metric:

File	Selects by	FV Acc	DC Acc	Blended	Epoch	Use case
`model.safetensors` (default)	Best blended (50/50)	98.21%	92.42%	95.31%	3	Balanced — good at everything
`model_best_fv.safetensors`	Best FoodVision test	98.34%	92.28%	95.31%	5	On-device Nutrify deployment
`model_best_dc.safetensors`	Best DataComp val	98.21%	92.42%	95.31%	3	Scale-up filtering (menus, panels, recipes)

To load a specific variant:

# Default (blended)
weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model.safetensors")

# Best for Nutrify on-device
weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model_best_fv.safetensors")

# Best for scale-up filtering
weights_path = hf_hub_download("mrdbourke/food-not-food-classifier-siglip2-v2", "model_best_dc.safetensors")

Related Models

Version	Repo
v2 (this)	mrdbourke/food-not-food-classifier-siglip2-v2
v1	mrdbourke/food-not-food-classifier-siglip2-v1

Dataset

Training images from DataComp-1B-food-and-drink-3M and the Nutrify FoodVision dataset (714K human-labeled images).

License

Apache 2.0

Downloads last month: 491

Safetensors

Model size

92.9M params

Tensor type

F32

mrdbourke
/

food-not-food-classifier-siglip2-v2