Patent · US Active

Localization of narrations in image data

US12118787B2 · kind B2 · utility

1Cited by
2References
19Claims
0Family size

Assignee

Inventors

Key dates

Filing dateOct 12, 2021
Grant dateOct 15, 2024
Priority date
Expiry dateOct 17, 2042

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG10L25/54
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Methods, system, and computer storage media are provided for multi-modal localization. Input data comprising two modalities, such as image data and corresponding text or audio data, may be received. A phrase may be extracted from the text or audio data, and a neural network system may be utilized to spatially and temporally localize the phrase within the image data. The neural network system may include a plurality of cross-modal attention layers that each compare features across the first and second modalities without comparing features of the same modality. Using the cross-modal attention layers, a region or subset of pixels within one or more frames of the image data may be identified as corresponding to the phrase, and a localization indicator may be presented for display with the image data. Embodiments may also include unsupervised training of the neural network system.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.