Patent · US Active

Video synthesis via multimodal conditioning

US12375766B2 · kind B2 · utility

0Cited by
0References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateSep 30, 2022
Grant dateJul 29, 2025
Priority date
Expiry dateApr 18, 2043

Classification

  • Technology area (CPC H)Electricity
  • CPC primaryH04N21/23418
  • WIPO fieldAudio-visual technology
  • WIPO sectorElectrical engineering

Abstract

A multimodal video generation framework (MMVID) that benefits from text and images provided jointly or separately as input. Quantized representations of videos are utilized with a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. A new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens is used to improve video quality and consistency. Text augmentation is utilized to improve the robustness of the textual representation and diversity of generated videos. The framework incorporates various visual modalities, such as segmentation masks, drawings, and partially occluded images. In addition, the MMVID extracts visual information as suggested by a textual prompt.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.