Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
- Website
- Community
- Solutions
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2501.04001

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Paper • 2401.09985 • Published Jan 18, 2024 • 18
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Paper • 2401.09962 • Published Jan 18, 2024 • 9
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Paper • 2401.10404 • Published Jan 18, 2024 • 10
ActAnywhere: Subject-Aware Video Background Generation

Paper • 2401.10822 • Published Jan 19, 2024 • 13

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31, 2025 • 305
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 311
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Paper • 2503.24235 • Published Mar 31, 2025 • 55
Seedream 3.0 Technical Report

Paper • 2504.11346 • Published Apr 15, 2025 • 70

Object segmentation 🧩

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 49
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 123

Image-Video General Tasks

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 49
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Paper • 2501.07556 • Published Jan 13, 2025 • 7

Sa2VA Model Zoo

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research

ByteDance/Sa2VA-4B

Image-Text-to-Text • 4B • Updated Sep 8, 2025 • 1.4k • 99
ByteDance/Sa2VA-8B

Image-Text-to-Text • 8B • Updated Sep 8, 2025 • 2.91k • 66
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 892 • 30
ByteDance/Sa2VA-26B

Image-Text-to-Text • 26B • Updated Sep 8, 2025 • 137 • 32

ByteDance/Sa2VA-4B

Image-Text-to-Text • 4B • Updated Sep 8, 2025 • 1.4k • 99
Finnish-NLP/Ahma-2-4B-Instruct

Text Generation • 4B • Updated Nov 25, 2025 • 615 • 4
black-forest-labs/FLUX.2-dev

Image-to-Image • Updated Feb 17 • 284k • • 1.7k
mistralai/Mistral-Large-Instruct-2407

123B • Updated Jul 28, 2025 • 7.39k • 860

Multimodal-Grounding

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published Jan 10, 2025 • 29
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 49
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Paper • 2512.14699 • Published Dec 16, 2025 • 28

image-segmentation

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 49
facebook/sapiens-seg-1b-torchscript

Image Segmentation • Updated Oct 6, 2024 • 1.46k • 5
sayeed99/segformer_b3_clothes

Image Segmentation • 47.2M • Updated Feb 27, 2024 • 3.24k • • 36
allenai/WildDet3D

Object Detection • Updated Apr 13 • 43 • 41

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 49
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7, 2025 • 27
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10, 2025 • 75

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 27
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Paper • 2401.09985 • Published Jan 18, 2024 • 18
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Paper • 2401.09962 • Published Jan 18, 2024 • 9
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Paper • 2401.10404 • Published Jan 18, 2024 • 10
ActAnywhere: Subject-Aware Video Background Generation

Paper • 2401.10822 • Published Jan 19, 2024 • 13

ByteDance/Sa2VA-4B

Image-Text-to-Text • 4B • Updated Sep 8, 2025 • 1.4k • 99
Finnish-NLP/Ahma-2-4B-Instruct

Text Generation • 4B • Updated Nov 25, 2025 • 615 • 4
black-forest-labs/FLUX.2-dev

Image-to-Image • Updated Feb 17 • 284k • • 1.7k
mistralai/Mistral-Large-Instruct-2407

123B • Updated Jul 28, 2025 • 7.39k • 860

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31, 2025 • 305
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 311
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Paper • 2503.24235 • Published Mar 31, 2025 • 55
Seedream 3.0 Technical Report

Paper • 2504.11346 • Published Apr 15, 2025 • 70

Multimodal-Grounding

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published Jan 10, 2025 • 29
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 49
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Paper • 2512.14699 • Published Dec 16, 2025 • 28

Object segmentation 🧩

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 49
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 123

image-segmentation

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 49
facebook/sapiens-seg-1b-torchscript

Image Segmentation • Updated Oct 6, 2024 • 1.46k • 5
sayeed99/segformer_b3_clothes

Image Segmentation • 47.2M • Updated Feb 27, 2024 • 3.24k • • 36
allenai/WildDet3D

Object Detection • Updated Apr 13 • 43 • 41

Image-Video General Tasks

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 49
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Paper • 2501.07556 • Published Jan 13, 2025 • 7

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 49
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7, 2025 • 27
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10, 2025 • 75

Sa2VA Model Zoo

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research

ByteDance/Sa2VA-4B

Image-Text-to-Text • 4B • Updated Sep 8, 2025 • 1.4k • 99
ByteDance/Sa2VA-8B

Image-Text-to-Text • 8B • Updated Sep 8, 2025 • 2.91k • 66
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 892 • 30
ByteDance/Sa2VA-26B

Image-Text-to-Text • 26B • Updated Sep 8, 2025 • 137 • 32

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 27
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

Previous
1
2
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs