1

Beta-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment

CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, …

Fatimah zohra

• Feb 21, 2026 • 1 min read

Large Multimodal Model

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding

Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is …

Shuming liu

• Oct 29, 2025 • 1 min read

Self-Supervised Learning

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixellevel details …

Fida mohammad thoker

• Jul 30, 2025 • 1 min read

Mamba

OSMamba: Omnidirectional Spectral Mamba with Dual-Domain Prior Generator for Exposure Correction

Exposure correction is a fundamental problem in computer vision and image processing. Recently, frequency domainbased methods have achieved impressive improvement, yet they still …

Gehui li

• Jul 29, 2025 • 1 min read

Video Representation Learning

SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning

Continued advances in self-supervised learning have led to significant progress in video representation learning, offering a scalable alternative to supervised approaches by …

Fida mohammad thoker

• Apr 8, 2025 • 1 min read

Deep Learning

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection

Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field …

Shuming liu

• Mar 2, 2025 • 1 min read

Deep Learning

Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos

CLIP is a powerful spatial feature extractor trained on a large dataset of image-text pairs. It exhibits strong generalization when extended to other domains and modalities. …

Fatimah zohra

• Mar 2, 2025 • 1 min read

Ego4D: Around the World in 3,000 Hours of Egocentric Video

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (house-hold, …

Chen Zhao

• Feb 2, 2025 • 1 min read

Deep Learning

Towards Automated Movie Trailer Generation

Movie trailers are an essential tool for promoting films and attracting audiences. However the process of creating trailers can be time-consuming and expensive. To streamline this …

Dawit mureja argaw

• Jun 4, 2024 • 1 min read

Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning featured image

Deep Learning

Dr<sup>2</sup>Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning, which is highly …

Chen Zhao

• Jan 4, 2024 • 1 min read

No results found

1