Localization of narrations in image data
US12118787B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Oct 12, 2021 |
| Grant date | Oct 15, 2024 |
| Priority date | — |
| Expiry date | Oct 17, 2042 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG10L25/54
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Methods, system, and computer storage media are provided for multi-modal localization. Input data comprising two modalities, such as image data and corresponding text or audio data, may be received. A phrase may be extracted from the text or audio data, and a neural network system may be utilized to spatially and temporally localize the phrase within the image data. The neural network system may include a plurality of cross-modal attention layers that each compare features across the first and second modalities without comparing features of the same modality. Using the cross-modal attention layers, a region or subset of pixels within one or more frames of the image data may be identified as corresponding to the phrase, and a localization indicator may be presented for display with the image data. Embodiments may also include unsupervised training of the neural network system.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.