Model Overview

Model Summary

VideoPrism is a family of foundational video-encoder models from Google Research, designed to be a universal "prism" for understanding the diverse facets of video content. Built on a massive scale of 36 million high-quality video-caption pairs and 582 million video clips, VideoPrism is engineered to excel across a wide range of video understanding tasks, including classification, localization, retrieval, and captioning. VideoPrism models utilize a Vision Transformer (ViT) architecture and are pre-trained using a combination of video-text contrastive learning and masked video modeling. This dual approach allows the model to capture both global semantic meaning and fine-grained spatio-temporal details, making it a powerful backbone for state-of-the-art video AI applications.

Installation

Keras and KerasHub can be installed with:

pip install -U -q keras-hub
pip install -U -q keras>=3

JAX, TensorFlow, and Torch come pre-installed in Kaggle Notebooks. For instructions on installing them in another environment, see the Keras Getting Started page.

Presets

The following model checkpoints are provided by the Keras team. For the Video-Text (LvT) variants, both the video encoder and the text encoder are provided to enable multimodal tasks like zero-shot retrieval.

Preset name	Parameters	Description
`videoprism_public_v1_base`	114.00M	114 million parameter, 12-layer ViT-B, 16-frame, 288x288 resolution, video-only encoder for spatio-temporal representation.
`videoprism_public_v1_large`	354.00M	354 million parameter, 24-layer ViT-L, 16-frame, 288x288 resolution, video-only encoder for spatio-temporal representation.
`videoprism_lvt_public_v1_base`	248.00M	248 million parameter, 12-layer ViT-B video encoder + text encoder, 16-frame, 288x288 resolution, for multimodal video-language tasks.
`videoprism_lvt_public_v1_large`	580.00M	580 million parameter, 24-layer ViT-L video encoder + text encoder, 16-frame, 288x288 resolution, for multimodal video-language tasks.

Downloads last month: 31

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for keras/videoprism_public_v1_base

Video ReCap: Recursive Captioning of Hour-Long Videos

Paper • 2402.13250 • Published Feb 20, 2024 • 27

keras
/

videoprism_public_v1_base