Expressive text-to-speech utilizing contextual word-level style tokens
US11322133B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Jul 21, 2020 |
| Grant date | May 3, 2022 |
| Priority date | — |
| Expiry date | Jul 24, 2040 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG10L25/30
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
The present disclosure relates to systems, methods, and non-transitory computer-readable media that generate expressive audio for input texts based on a word-level analysis of the input text. For example, the disclosed systems can utilize a multi-channel neural network to generate a character-level feature vector and a word-level feature vector based on a plurality of characters of an input text and a plurality of words of the input text, respectively. In some embodiments, the disclosed systems utilize the neural network to generate the word-level feature vector based on contextual word-level style tokens that correspond to style features associated with the input text. Based on the character-level and word-level feature vectors, the disclosed systems can generate a context-based speech map. The disclosed systems can utilize the context-based speech map to generate expressive audio for the input text.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.