Instructions to use keras/videoprism_public_v1_base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- KerasHub
How to use keras/videoprism_public_v1_base with KerasHub:
import keras_hub # Create a Backbone model unspecialized for any task backbone = keras_hub.models.Backbone.from_preset("hf://keras/videoprism_public_v1_base") - Keras
How to use keras/videoprism_public_v1_base with Keras:
# Available backend options are: "jax", "torch", "tensorflow". import os os.environ["KERAS_BACKEND"] = "jax" import keras model = keras.saving.load_model("hf://keras/videoprism_public_v1_base") - Notebooks
- Google Colab
- Kaggle
Model Overview
Model Summary
VideoPrism is a family of foundational video-encoder models from Google Research, designed to be a universal "prism" for understanding the diverse facets of video content. Built on a massive scale of 36 million high-quality video-caption pairs and 582 million video clips, VideoPrism is engineered to excel across a wide range of video understanding tasks, including classification, localization, retrieval, and captioning. VideoPrism models utilize a Vision Transformer (ViT) architecture and are pre-trained using a combination of video-text contrastive learning and masked video modeling. This dual approach allows the model to capture both global semantic meaning and fine-grained spatio-temporal details, making it a powerful backbone for state-of-the-art video AI applications.
Links
- VideoPrism Technical Paper
- VideoPrism API Documentation
- VideoPrism on Hugging Face
- KerasHub Beginner Guide
- KerasHub Model Publishing Guide
Installation
Keras and KerasHub can be installed with:
pip install -U -q keras-hub
pip install -U -q keras>=3
JAX, TensorFlow, and Torch come pre-installed in Kaggle Notebooks. For instructions on installing them in another environment, see the Keras Getting Started page.
Presets
The following model checkpoints are provided by the Keras team. For the Video-Text (LvT) variants, both the video encoder and the text encoder are provided to enable multimodal tasks like zero-shot retrieval.
| Preset name | Parameters | Description |
|---|---|---|
videoprism_public_v1_base |
114.00M | 114 million parameter, 12-layer ViT-B, 16-frame, 288x288 resolution, video-only encoder for spatio-temporal representation. |
videoprism_public_v1_large |
354.00M | 354 million parameter, 24-layer ViT-L, 16-frame, 288x288 resolution, video-only encoder for spatio-temporal representation. |
videoprism_lvt_public_v1_base |
248.00M | 248 million parameter, 12-layer ViT-B video encoder + text encoder, 16-frame, 288x288 resolution, for multimodal video-language tasks. |
videoprism_lvt_public_v1_large |
580.00M | 580 million parameter, 24-layer ViT-L video encoder + text encoder, 16-frame, 288x288 resolution, for multimodal video-language tasks. |
- Downloads last month
- 31