Spatial-temporal reasoning through pretrained language models for video-grounded dialogues
US11487999B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Apr 28, 2020 |
| Grant date | Nov 1, 2022 |
| Priority date | — |
| Expiry date | Apr 29, 2041 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/30
- WIPO fieldAudio-visual technology
- WIPO sectorElectrical engineering
Abstract
A system and method for generating a response in a video grounded dialogue are provided. A video-grounded dialogue neural network language model receives video input and text input. The text input includes a dialogue history between the model and a human user and a current utterance by the user. Encoded video input is generated using video encoding layers. Encoded text input is generated using text encoding layers. The encoded video input and the encoded text input are concatenated in to a single input sequence. A generative pre-trained transformer model generates the response to the current utterance from the singe input sequence.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.