Bernard ghanem

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding featured image

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is …

Shuming liu
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning featured image

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixellevel details …

Fida mohammad thoker
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning featured image

SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by …

Fida mohammad thoker
OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection featured image

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection

Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field …

Shuming liu
Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos featured image

Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos

CLIP is a powerful spatial feature extractor trained on a large dataset of image-text pairs. It exhibits strong generalization when extended to other domains and modalities. …

Fatimah zohra
Towards Automated Movie Trailer Generation featured image

Towards Automated Movie Trailer Generation

Movie trailers are an essential tool for promoting films and attracting audiences. However the process of creating trailers can be time-consuming and expensive. To streamline this …

Dawit mureja argaw
Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning featured image

Dr<sup>2</sup>Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning, which is highly …

avatar
Chen Zhao
End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames featured image

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited …

Shuming liu
Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization featured image

Re<sup>2</sup>TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content. Given limited GPU memory, training TAL end to end …

avatar
Chen Zhao
FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model featured image

FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model

Recently, conditional diffusion models have gained popularity in numerous applications due to their exceptional generation ability. However, many existing methods are …

Jiwen yu