Training and using a transcript generation model on a multi-speaker audio stream
US11984127B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Dec 31, 2021 |
| Grant date | May 14, 2024 |
| Priority date | — |
| Expiry date | Nov 13, 2042 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG10L17/00
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
The disclosure herein describes using a transcript generation model for generating a transcript from a multi-speaker audio stream. Audio data including overlapping speech of a plurality of speakers is obtained and a set of frame embeddings are generated from audio data frames of the obtained audio data using an audio data encoder. A set of words and channel change (CC) symbols are generated from the set of frame embeddings using a transcript generation model. The CC symbols are included between pairs of adjacent words that are spoken by different people at the same time. The set of words and CC symbols are transformed into a plurality of transcript lines, wherein words of the set of words are sorted into transcript lines based on the CC symbols, and a multi-speaker transcript is generated based on the plurality of transcript lines. The inclusion of CC symbols by the model enables efficient, accurate multi-speaker transcription.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.