Patent · US Active

Controlling expressivity in end-to-end speech synthesis systems

US11676573B2 · kind B2 · utility

0Cited by
1References
22Claims
0Family size

Assignee

Inventors

Key dates

Filing dateJul 16, 2020
Grant dateJun 13, 2023
Priority date
Expiry dateApr 22, 2041

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG10L13/08
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A system for generating an output audio signal includes a context encoder, a text-prediction network, and a text-to-speech (TTS) model. The context encoder is configured to receive one or more context features associated with current input text and process the one or more context features to generate a context embedding associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech. The TTS model is configured to process the current input text and the style embedding to generate an output audio signal of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.