Fatimah zohra

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection featured image

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection

Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field …

Shuming liu
Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos featured image

Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos

CLIP is a powerful spatial feature extractor trained on a large dataset of image-text pairs. It exhibits strong generalization when extended to other domains and modalities. …

Fatimah zohra
Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning featured image

Dr<sup>2</sup>Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning, which is highly …

avatar
Chen Zhao