Patent · US Active

Language identification for documents containing multiple languages

US8224641B2 · kind B2 · utility

5Cited by
10References
30Claims
0Family size

Assignee

Inventor

Key dates

Filing dateNov 19, 2008
Grant dateJul 17, 2012
Priority date
Expiry dateMay 13, 2031

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F40/263
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.