Video synthesis via multimodal conditioning
US12375766B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Sep 30, 2022 |
| Grant date | Jul 29, 2025 |
| Priority date | — |
| Expiry date | Apr 18, 2043 |
Classification
- Technology area (CPC H)Electricity
- CPC primaryH04N21/23418
- WIPO fieldAudio-visual technology
- WIPO sectorElectrical engineering
Abstract
A multimodal video generation framework (MMVID) that benefits from text and images provided jointly or separately as input. Quantized representations of videos are utilized with a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. A new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens is used to improve video quality and consistency. Text augmentation is utilized to improve the robustness of the textual representation and diversity of generated videos. The framework incorporates various visual modalities, such as segmentation masks, drawings, and partially occluded images. In addition, the MMVID extracts visual information as suggested by a textual prompt.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.