Patent · US Expired

Identifying language and character set of data representing text

US6157905A · kind A · utility

101Cited by

18References

55Claims

0Family size

Assignee

Microsoft Corporation · US

Inventor

Robert D. Powell · Vashon, US

Key dates

Filing date	Dec 11, 1997
Grant date	Dec 5, 2000
Priority date	—
Expiry date	Dec 11, 2017

Classification

Technology area (CPC G)Physics
CPC primaryG06F40/216
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

The present invention provides a facility for identifying the unknown language of text represented by a series of data values in accordance with a character set that associates character glyphs with particular data values. The facility first generates a characterization that characterizes the series of data values in terms of the occurrence of particular data values on the series of data values. For each of a plurality of languages, the facility then retrieves a model that models the language in terms of the statistical occurrence of particular data values in representative samples of text in that language. The facility then compares the retrieved models to the generated characterization of the series of data values, and identifies as the distinguished language the language whose model compares most favorably to the generated characterization of the series of data values.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.