Patent · US Active

Video synthesis via multimodal conditioning

US12375766B2 · kind B2 · utility

0Cited by

0References

20Claims

0Family size

Assignee

SNAP INC. · US

Inventors

Francesco Barbieri · Santa Monica, US
Ligong Han · Edison, US
Hsin-Ying Lee · San Jose, US
Shervin Minaee · Bellevue, US
Kyle Olszewski · Los Angeles, US
Jian Ren · Los Serranos, US
Sergey Tulyakov · Santa Monica, US

Key dates

Filing date	Sep 30, 2022
Grant date	Jul 29, 2025
Priority date	—
Expiry date	Apr 18, 2043

Classification

Technology area (CPC H)Electricity
CPC primaryH04N21/23418
WIPO fieldAudio-visual technology
WIPO sectorElectrical engineering

Abstract

A multimodal video generation framework (MMVID) that benefits from text and images provided jointly or separately as input. Quantized representations of videos are utilized with a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. A new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens is used to improve video quality and consistency. Text augmentation is utilized to improve the robustness of the textual representation and diversity of generated videos. The framework incorporates various visual modalities, such as segmentation masks, drawings, and partially occluded images. In addition, the MMVID extracts visual information as suggested by a textual prompt.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.