Patent · US Active

Personalized speech-to-video with three-dimensional (3D) skeleton regularization and expressive body poses

US11514634B2 · kind B2 · utility

2Cited by

3References

24Claims

0Family size

Assignees

BAIDU USA LLC · US
BAIDU.COM TIMES TECHNOLOGY (BEIJING) CO., LTD. · CN

Inventors

Miao Liao · Camas, US
Sibo Zhang · San Jose, US
Peng Wang · Meekrap-oord, CN
Ruigang Yang · Beijing, CN

Key dates

Filing date	Jun 12, 2020
Grant date	Nov 29, 2022
Priority date	—
Expiry date	Jun 12, 2040

Classification

Technology area (CPC G)Physics
CPC primaryG10L25/30
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

Presented herein are novel embodiments for converting a given speech audio or text into a photo-realistic speaking video of a person with synchronized, realistic, and expressive body dynamics. In one or more embodiments, 3D skeleton movements are generated from the audio sequence using a recurrent neural network, and an output video is synthesized via a conditional generative adversarial network. To make movements realistic and expressive, the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures may be embedded into the generation process in both learning and testing pipelines. The former prevents the generation of unreasonable body distortion, while the later helps the model quickly learn meaningful body movement with a few videos. To produce photo-realistic and high-resolution video with motion details, a part-attention mechanism is inserted in the conditional GAN, where each detailed part is automatically zoomed in to have their own discriminators.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.