Patent · US Active

Representation learning from video with spatial audio

US11308329B2 · kind B2 · utility

0Cited by
0References
18Claims
0Family size

Assignee

Inventors

Key dates

Filing dateMay 7, 2020
Grant dateApr 19, 2022
Priority date
Expiry dateAug 4, 2040

Classification

  • Technology area (CPC H)Electricity
  • CPC primaryH04S2420/11
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A computer system is trained to understand audio-visual spatial correspondence using audio-visual clips having multi-channel audio. The computer system includes an audio subnetwork, video subnetwork, and pretext subnetwork. The audio subnetwork receives the two channels of audio from the audio-visual clips, and the video subnetwork receives the video frames from the audio-visual clips. In a subset of the audio-visual clips the audio-visual spatial relationship is misaligned, causing the audio-visual spatial cues for the audio and video to be incorrect. The audio subnetwork outputs an audio feature vector for each audio-visual clip, and the video subnetwork outputs a video feature vector for each audio-visual clip. The audio and video feature vectors for each audio-visual clip are merged and provided to the pretext subnetwork, which is configured to classify the merged vector as either having a misaligned audio-visual spatial relationship or not. The subnetworks are trained based on the loss calculated from the classification.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.