Optimized mapping of documents to candidate duplicate documents in a document corpus
US9607029B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Dec 17, 2014 |
| Grant date | Mar 28, 2017 |
| Priority date | — |
| Expiry date | Oct 9, 2035 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/2237
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Technologies are disclosed for mapping documents to candidate duplicate documents in a document corpus. A bitset optimized inverted index is created for a document corpus. A document is received for which candidate duplicate documents in the document corpus are to be identified. The document is tokenized using adaptive tokenization. A determination made as to whether tokens in the document are represented in the bitset optimized inverted index. A list of candidate duplicate documents is created for tokens represented in the optimized inverted index utilizing in-memory bitsets that map tokens to documents that contain the tokens in the document corpus.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.