Patent · US Active

Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech

US12249315B2 · kind B2 · utility

0Cited by

1References

20Claims

0Family size

Assignee

Google LLC · US

Inventors

Isaac Elias · Mountain View, US
Byungha Chun · Warrington, GB
Jonathan Shen · Mountain View, US
Ye Jia · Santa Clara, US
Yu Zhang · Mountain View, US
Yonghui Wu · Fremont, US

Key dates

Filing date	Oct 31, 2023
Grant date	Mar 11, 2025
Priority date	—
Expiry date	Oct 31, 2043

Classification

Technology area (CPC G)Physics
CPC primaryG10L25/30
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.