Patent · US Active

Systems and methods for real-time neural text-to-speech

US10872598B2 · kind B2 · utility

7Cited by

6References

20Claims

0Family size

Assignee

BAIDU USA LLC · US

Inventors

Sercan Omer Arik · San Francisco, US
Mike Chrzanowski · Sunnyvale, US
Adam Coates · Mountain View, US
Gregory Diamos · San Jose, US
Andrew Gibiansky · Mountain View, US
John Miller · Redmond, US
Andrew Yan-Tak Ng · Palo Alto, US
Jonathan Raiman · Palo Alto, US
Shubhahrata Sengupta · Menlo Park, US
Mohammad Shoeybi · San Mateo, US

Key dates

Filing date	Jan 29, 2018
Grant date	Dec 22, 2020
Priority date	—
Expiry date	Jun 19, 2038

Classification

Technology area (CPC G)Physics
CPC primaryG06N3/047
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.