Software and method for recognizing similarity of documents written in different languages based on a quantitative measure of similarity
US6519557B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Jun 6, 2000 |
| Grant date | Feb 11, 2003 |
| Priority date | — |
| Expiry date | Aug 26, 2020 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/194
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A system for identifying different language versions of the same structured format document (e.g., HTML web page) detects the language of the two documents and translates one or both into a preferred language if necessary, parses the two candidate documents and builds two hierarchical data structure based on the document. The data structures are used to compare the hierarchical structure of the two documents and also to access text portions in congruent positions in the two documents. A fuzzy measure of similarity of a set of text portions occupying congruent positions in the two documents is then obtained, to induce a measure of the similarity of the two documents which is compared to a fuzzy threshold.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.