Patent · US Active

Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech

US12249315B2 · kind B2 · utility

0Cited by
1References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateOct 31, 2023
Grant dateMar 11, 2025
Priority date
Expiry dateOct 31, 2043

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG10L25/30
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A method for training a non-autoregressive TTS model includes obtaining a sequence representation of an encoded text sequence concatenated with a variational embedding. The method also includes using a duration model network to predict a phoneme duration for each phoneme represented by the encoded text sequence. Based on the predicted phoneme durations, the method also includes learning an interval representation and an auxiliary attention context representation. The method also includes upsampling, using the interval representation and the auxiliary attention context representation, the sequence representation into an upsampled output specifying a number of frames. The method also includes generating, based on the upsampled output, one or more predicted mel-frequency spectrogram sequences for the encoded text sequence. The method also includes determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence and training the TTS model based on the final spectrogram loss.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.