Patent · US Active

Language identification for documents containing multiple languages

US8938384B2 · kind B2 · utility

3Cited by
10References
20Claims
0Family size

Assignee

Inventor

Key dates

Filing dateJul 16, 2012
Grant dateJan 20, 2015
Priority date
Expiry dateAug 6, 2033

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F40/263
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.