Patent · US Active

Systems and methods for neural text-to-speech using convolutional sequence learning

US10796686B2 · kind B2 · utility

11Cited by

6References

20Claims

0Family size

Assignee

BAIDU USA LLC · US

Inventors

Sercan Omer Arik · San Francisco, US
Wei Ping · Sunnyvale, US
Kainan Peng · Sunnyvale, US
Sharan Narang · San Francisco, US
Ajay Kannan · Dublin, US
Andrew Gibiansky · Mountain View, US
Jonathan Raiman · Palo Alto, US
John Miller · Redmond, US

Key dates

Filing date	Aug 8, 2018
Grant date	Oct 6, 2020
Priority date	—
Expiry date	Oct 2, 2038

Classification

Technology area (CPC G)Physics
CPC primaryG10L13/047
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

Described herein are embodiments of a fully-convolutional attention-based neural text-to-speech (TTS) system, which various embodiments may generally be referred to as Deep Voice 3. Embodiments of Deep Voice 3 match state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Deep Voice 3 embodiments were scaled to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, common error modes of attention-based speech synthesis networks were identified and mitigated, and several different waveform synthesis methods were compared. Also presented are embodiments that describe how to scale inference to ten million queries per day on one single-GPU server.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.