Patent · US Expired

Software and method for recognizing similarity of documents written in different languages based on a quantitative measure of similarity

US6519557B1 · kind B1 · utility

96Cited by
17References
16Claims
0Family size

Assignee

Inventors

Key dates

Filing dateJun 6, 2000
Grant dateFeb 11, 2003
Priority date
Expiry dateAug 26, 2020

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F40/194
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A system for identifying different language versions of the same structured format document (e.g., HTML web page) detects the language of the two documents and translates one or both into a preferred language if necessary, parses the two candidate documents and builds two hierarchical data structure based on the document. The data structures are used to compare the hierarchical structure of the two documents and also to access text portions in congruent positions in the two documents. A fuzzy measure of similarity of a set of text portions occupying congruent positions in the two documents is then obtained, to induce a measure of the similarity of the two documents which is compared to a fuzzy threshold.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.