What is ACAV100M?

We present an automated curation pipeline for audio-visual representation learning. We formulate an optimization problem where the goal is to find a subset that maximizes the mutual information between audio and visual channels of videos. This helps us find a subset with high audio-visual correspondence, which could be useful for self-supervised audio-visual representation learning.

Using our approach, we created datasets at varying scales from a large collection of unlabeled videos an unprecedented scale: We process 140 million full-length videos (total duration 1,030 years) and produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence. This is two orders of magnitude larger than the current largest video dataset used in the audio-visual learning literature, i.e., AudioSet (8 months), and twice as large as the largest video dataset in the literature, i.e., HowTo100M (15 years).

Statistics

ACAV100M is an automatically curated dataset of 10-seconds clips with high audio-visual correspondence. We provide the categorization results of the clips using pretrained audio (AudioSet), video (Kinetics400) and image (ImageNet) classifiers.

High Level Audio Labels (AudioSet)

Audio Labels (AudioSet)

Video Labels (Kinetics 400)

Image Labels (ImageNet)

Downloads

We provide five curated datasets of different scales; all the videos are 10-seconds long. The datasets are automatically constructed using our approach (using the clustering-based approach with 500 cluster centroids).

Each dataset contains YouTube ID and start second in a CSV format:

# YouTube ID, start second

Number of video clips (total duration)

Data curation pipeline and evaluation scripts

Publication

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

Sangho Lee*, Jiwan Chung*, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song

ICCV 2021

PDF   |   Bibtex
@inproceedings{lee2021acav100m,
   title="{ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning},
   author={Sangho Lee and Jiwan Chung and Youngjae Yu and Gunhee Kim and Thomas Breuel and Gal Chechik and Yale Song},
   booktitle={ICCV},
   year=2021
}