Patent · US Active

Text-to-speech using duration prediction

US12100382B2 · kind B2 · utility

0Cited by
3References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateOct 1, 2021
Grant dateSep 24, 2024
Priority date
Expiry dateDec 5, 2041

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG10L2013/105
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, synthesizing audio data from text data using duration prediction. One of the methods includes processing an input text sequence that includes a respective text element at each of multiple input time steps using a first neural network to generate a modified input sequence comprising, for each input time step, a representation of the corresponding text element in the input text sequence; processing the modified input sequence using a second neural network to generate, for each input time step, a predicted duration of the corresponding text element in the output audio sequence; upsampling the modified input sequence according to the predicted durations to generate an intermediate sequence comprising a respective intermediate element at each of a plurality of intermediate time steps; and generating an output audio sequence using the intermediate sequence.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.