Patent · US Active

Self-supervised audio-visual learning for correlating music and video

US12340563B2 · kind B2 · utility

0Cited by

1References

18Claims

0Family size

Assignee

Adobe Inc. · US

Inventors

Justin Salamon · San Francisco, US
Bryan Russell · San Francisco, US
Didac Suris Coll-Vinent · New York, US

Key dates

Filing date	May 11, 2022
Grant date	Jun 24, 2025
Priority date	—
Expiry date	Jul 8, 2043

Classification

Technology area (CPC G)Physics
CPC primaryG10L25/57
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

Embodiments are disclosed for correlating video sequences and audio sequences by a media recommendation system using a trained encoder network. In particular, in one or more embodiments, the disclosed systems and methods comprise receiving a training input including a media sequence, including a video sequence paired with an audio sequence, segmenting the media sequence into a set of video sequence segments and a set of audio sequence segments, extracting visual features for each video sequence segment and audio features for each audio sequence segment, generating, by transformer networks, contextualized visual features from the extracted visual features and contextualized audio features from the extracted audio features, the transformer networks including a visual transformer and an audio transformer, generating predicted video and audio sequence segment pairings based on the contextualized visual and audio features, and training the visual transformer and the audio transformer to generate the contextualized visual and audio features.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.