Parallel tacotron non-autoregressive and controllable TTS
US11908448B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | May 21, 2021 |
| Grant date | Feb 20, 2024 |
| Priority date | — |
| Expiry date | Jan 14, 2042 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06N3/048
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.