1

End-to-End Active Speaker Detection featured image

End-to-End Active Speaker Detection

Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process -- feature extraction and spatio-temporal context aggregation. In this paper, we …

Juan leon alcazar
When NAS Meets Trees: An Efficient Algorithm for Neural Architecture Search featured image

When NAS Meets Trees: An Efficient Algorithm for Neural Architecture Search

The key challenge in neural architecture search (NAS) is designing how to explore wisely in the huge search space. We propose a new NAS method called TNAS (NAS with trees), which …

Guocheng qian
MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions featured image

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In …

Mattia soldan
Ego4D: Around the World in 3,000 Hours of Egocentric Video featured image

Ego4D: Around the World in 3,000 Hours of Egocentric Video

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (house-hold, …

avatar
Chen Zhao
SegTAD: Precise Temporal Action Detection via Semantic Segmentation featured image

SegTAD: Precise Temporal Action Detection via Semantic Segmentation

Temporal action detection (TAD) is an important yet challenging task in video analysis. Most existing works draw inspiration from image object detection and tend to reformulate it …

avatar
Chen Zhao
Video Self‑Stitching Graph Network for Temporal Action Localization featured image

Video Self‑Stitching Graph Network for Temporal Action Localization

Short actions are critical and challenging in the task of action localization. We target this problem and propose a video self-stitching graph network (VSGN), which enhances …

avatar
Chen Zhao
ThumbNet: One Thumbnail Image Contains All You Need for Recognition featured image

ThumbNet: One Thumbnail Image Contains All You Need for Recognition

Tackle the problem of network compression and acceleration in a novel perspective: enabling inference on thumbnail images without compromising accuracy. Propose supervised image …

avatar
Chen Zhao
G‑TAD: Sub‑Graph Localization for Temporal Action Detection featured image

G‑TAD: Sub‑Graph Localization for Temporal Action Detection

Temporal action detection is a fundamental yet challenging task in video understanding. Video context is a critical cue to effectively detect actions, but current works mainly …

Mengmeng xu