Patent · US Active

Method for automatically identifying sentence boundaries in noisy conversational data

US8364485B2 · kind B2 · utility

2Cited by

6References

4Claims

0Family size

Assignee

International Business Machines Corporation · US

Inventors

Tetsuya Nasukawa · Yokohama, JP
Diwakar Punjani · San Francisco, US
Shourya Roy · Kanchinakote, IN
L. Venkata Subramaniam · New Delhi, IN
Hironori Takeuchi · Yokohama, JP

Key dates

Filing date	Aug 27, 2007
Grant date	Jan 29, 2013
Priority date	—
Expiry date	Oct 14, 2030

Classification

Technology area (CPC G)Physics
CPC primaryG10L15/26
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

Sentence boundaries in noisy conversational transcription data are automatically identified. Noise and transcription symbols are removed, and a training set is formed with sentence boundaries marked based on long silences or on manual markings in the transcribed data. Frequencies of head and tail n-grams that occur at the beginning and ending of sentences are determined from the training set. N-grams that occur a significant number of times in the middle of sentences in relation to their occurrences at the beginning or ending of sentences are filtered out. A boundary is marked before every head n-gram and after every tail n-gram occurring in the conversational data and remaining after filtering. Turns are identified. A boundary is marked after each turn, unless the turn ends with an impermissible tail word or is an incomplete turn. The marked boundaries in the conversational data identify sentence boundaries.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.