Patent · US Active

Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building

US7917350B2 · kind B2 · utility

9Cited by

9References

17Claims

0Family size

Assignee

International Business Machines Corporation · US

Inventors

Shinsuke Mori · Anjo, JP
Daisuke Takuma · Tokyo, JP

Key dates

Filing date	May 26, 2008
Grant date	Mar 29, 2011
Priority date	—
Expiry date	Jan 6, 2029

Classification

Technology area (CPC G)Physics
CPC primaryG06F40/216
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

Calculates a word n-gram probability with high accuracy in a situation where a first corpus), which is a relatively small corpus containing manually segmented word information, and a second corpus, which is a relatively large corpus, are given as a training corpus that is storage containing vast quantities of sample sentences. Vocabulary including contextual information is expanded from words occurring in first corpus of relatively small size to words occurring in second corpus of relatively large size by using a word n-gram probability estimated from an unknown word model and the raw corpus. The first corpus (word-segmented) is used for calculating n-grams and the probability that the word boundary between two adjacent characters will be the boundary of two words (segmentation probability). The second corpus (word-unsegmented), in which probabilistic word boundaries are assigned based on information in the first corpus (word-segmented), is used for calculating a word n-grams.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.