Representation learning from video with spatial audio
US11308329B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | May 7, 2020 |
| Grant date | Apr 19, 2022 |
| Priority date | — |
| Expiry date | Aug 4, 2040 |
Classification
- Technology area (CPC H)Electricity
- CPC primaryH04S2420/11
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A computer system is trained to understand audio-visual spatial correspondence using audio-visual clips having multi-channel audio. The computer system includes an audio subnetwork, video subnetwork, and pretext subnetwork. The audio subnetwork receives the two channels of audio from the audio-visual clips, and the video subnetwork receives the video frames from the audio-visual clips. In a subset of the audio-visual clips the audio-visual spatial relationship is misaligned, causing the audio-visual spatial cues for the audio and video to be incorrect. The audio subnetwork outputs an audio feature vector for each audio-visual clip, and the video subnetwork outputs a video feature vector for each audio-visual clip. The audio and video feature vectors for each audio-visual clip are merged and provided to the pretext subnetwork, which is configured to classify the merged vector as either having a misaligned audio-visual spatial relationship or not. The subnetworks are trained based on the loss calculated from the classification.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.