Patent · US Active

Systems and methods for multi-speaker neural text-to-speech

US10896669B2 · kind B2 · utility

7Cited by

6References

18Claims

0Family size

Assignee

BAIDU USA LLC · US

Inventors

Sercan Omer Arik · San Francisco, US
Gregory Diamos · San Jose, US
Andrew Gibiansky · Mountain View, US
John Miller · Redmond, US
Kainan Peng · Sunnyvale, US
Wei Ping · Sunnyvale, US
Jonathan Raiman · Palo Alto, US
Yanqi Zhou · San Jose, US

Key dates

Filing date	May 8, 2018
Grant date	Jan 19, 2021
Priority date	—
Expiry date	Sep 21, 2038

Classification

Technology area (CPC G)Physics
CPC primaryG10L25/30
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

Described herein are systems and methods for augmenting neural speech synthesis networks with low-dimensional trainable speaker embeddings in order to generate speech from different voices from a single model. As a starting point for multi-speaker experiments, improved single-speaker model embodiments, which may be referred to generally as Deep Voice 2 embodiments, were developed, as well as a post-processing neural vocoder for Tacotron (a neural character-to-spectrogram model). New techniques for multi-speaker speech synthesis were performed for both Deep Voice 2 and Tacotron embodiments on two multi-speaker TTS datasets—showing that neural text-to-speech systems can learn hundreds of unique voices from twenty-five minutes of audio per speaker.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.