Systems and methods for neural text-to-speech using convolutional sequence learning
US10796686B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Aug 8, 2018 |
| Grant date | Oct 6, 2020 |
| Priority date | — |
| Expiry date | Oct 2, 2038 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG10L13/047
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Described herein are embodiments of a fully-convolutional attention-based neural text-to-speech (TTS) system, which various embodiments may generally be referred to as Deep Voice 3. Embodiments of Deep Voice 3 match state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Deep Voice 3 embodiments were scaled to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, common error modes of attention-based speech synthesis networks were identified and mitigated, and several different waveform synthesis methods were compared. Also presented are embodiments that describe how to scale inference to ten million queries per day on one single-GPU server.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.