Patent · US Active

Efficient language identification

US8027832B2 · kind B2 · utility

22Cited by

7References

20Claims

0Family size

Assignee

Microsoft Corporation · US

Inventors

William D. Ramsey · Redmond, US
Patricia M. Schmid · Redmond, US
Kevin Roland Powell · Redmond, US

Key dates

Filing date	Feb 11, 2005
Grant date	Sep 27, 2011
Priority date	—
Expiry date	Feb 26, 2028

Classification

Technology area (CPC Y)Emerging Cross-Sectional Technologies
CPC primaryY10T70/7057
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A system and methods of language identification of natural language text are presented. The system includes stored expected character counts and variances for a list of characters found in a natural language. Expected character counts and variances are stored for multiple languages to be considered during language identification. At run-time, one or more languages are identified for a text sample based on comparing actual and expected character counts. The present methods can be combined with upstream analyzing of Unicode ranges for characters in the text sample to limit the number of languages considered. Further, n-gram methods can be used in downstream processing to select the most probable language from among the languages identified by the present system and methods.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.