Silvio giancola

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection featured image

OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection

Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field …

Shuming liu
EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries featured image

EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries

With the recent advances in video and 3D understanding, novel 4D spatio-temporal methods fusing both concepts have emerged. Towards this direction, the Ego4D Episodic Memory …

Jinjie mai
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions featured image

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In …

Mattia soldan