Patent · US Active

Efficient language identification

US8027832B2 · kind B2 · utility

22Cited by
7References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateFeb 11, 2005
Grant dateSep 27, 2011
Priority date
Expiry dateFeb 26, 2028

Classification

  • Technology area (CPC Y)Emerging Cross-Sectional Technologies
  • CPC primaryY10T70/7057
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A system and methods of language identification of natural language text are presented. The system includes stored expected character counts and variances for a list of characters found in a natural language. Expected character counts and variances are stored for multiple languages to be considered during language identification. At run-time, one or more languages are identified for a text sample based on comparing actual and expected character counts. The present methods can be combined with upstream analyzing of Unicode ranges for characters in the text sample to limit the number of languages considered. Further, n-gram methods can be used in downstream processing to select the most probable language from among the languages identified by the present system and methods.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.