1

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding featured image

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is …

Shuming liu
OSMamba: Omnidirectional Spectral Mamba with Dual-Domain Prior Generator for Exposure Correction featured image

OSMamba: Omnidirectional Spectral Mamba with Dual-Domain Prior Generator for Exposure Correction

Exposure correction is a fundamental problem in computer vision and image processing. Recently, frequency domainbased methods have achieved impressive improvement, yet they still …

Gehui li
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning featured image

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixellevel details …

Fida mohammad thoker
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning featured image

SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by …

Fida mohammad thoker
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection featured image

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection

Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field …

Shuming liu
Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos featured image

Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos

CLIP is a powerful spatial feature extractor trained on a large dataset of image-text pairs. It exhibits strong generalization when extended to other domains and modalities. …

Fatimah zohra
Ego4D: Around the World in 3,000 Hours of Egocentric Video featured image

Ego4D: Around the World in 3,000 Hours of Egocentric Video

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (house-hold, …

avatar
Chen Zhao
Towards Automated Movie Trailer Generation featured image

Towards Automated Movie Trailer Generation

Movie trailers are an essential tool for promoting films and attracting audiences. However the process of creating trailers can be time-consuming and expensive. To streamline this …

Dawit mureja argaw
Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning featured image

Dr<sup>2</sup>Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning, which is highly …

avatar
Chen Zhao
Ego-Exo4D: Understanding Skilled Human Activity from First-and Third-Person Perspectives featured image

Ego-Exo4D: Understanding Skilled Human Activity from First-and Third-Person Perspectives

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric …

avatar
Chen Zhao