Video Understanding

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding featured image

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is …

Shuming liu
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning featured image

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixellevel details …

Fida mohammad thoker
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning featured image

SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by …

Fida mohammad thoker
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection featured image

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection

Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field …

Shuming liu
Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos featured image

Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos

CLIP is a powerful spatial feature extractor trained on a large dataset of image-text pairs. It exhibits strong generalization when extended to other domains and modalities. …

Fatimah zohra
End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames featured image

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited …

Shuming liu