Patent · US Active

Training and using a transcript generation model on a multi-speaker audio stream

US11984127B2 · kind B2 · utility

0Cited by

0References

20Claims

0Family size

Assignee

MICROSOFT TECHNOLOGY LICENSING, LLC · US

Inventors

Naoyuki KANDA · Bellevue, US
Takuya Yoshioka · Bellevue, US
Zhuo Chen · Markham, CA
Jinyu Li · Beijing, CN
Yashesh GAUR · Redmond, US
Zhong Meng · Seattle, US
Xiaofei Wang · Cedar Grove, US
Xiong XIAO · Bothell, US

Key dates

Filing date	Dec 31, 2021
Grant date	May 14, 2024
Priority date	—
Expiry date	Nov 13, 2042

Classification

Technology area (CPC G)Physics
CPC primaryG10L17/00
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of the obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on the CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.