Generating diverse and natural text-to-speech samples
US11475874B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Jan 29, 2021 |
| Grant date | Oct 18, 2022 |
| Priority date | — |
| Expiry date | Apr 18, 2041 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG10L2015/0631
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A method of generating diverse and natural text-to-speech (TTS) samples includes receiving a text and generating a speech sample based on the text using a TTS model. A training process trains the TTS model to generate the speech sample by receiving training samples. Each training sample includes a spectrogram and a training text corresponding to the spectrogram. For each training sample, the training process identifies speech units associated with the training text. For each speech unit, the training process generates a speech embedding, aligns the speech embedding with a portion of the spectrogram, extracts a latent feature from the aligned portion of the spectrogram, and assigns a quantized embedding to the latent feature. The training process generates the speech sample by decoding a concatenation of the speech embeddings and a quantized embeddings for the speech units associated with the training text corresponding to the spectrogram.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.