Patent · US Active

Optimized mapping of documents to candidate duplicate documents in a document corpus

US9607029B1 · kind B1 · utility

8Cited by
4References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateDec 17, 2014
Grant dateMar 28, 2017
Priority date
Expiry dateOct 9, 2035

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/2237
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Technologies are disclosed for mapping documents to candidate duplicate documents in a document corpus. A bitset optimized inverted index is created for a document corpus. A document is received for which candidate duplicate documents in the document corpus are to be identified. The document is tokenized using adaptive tokenization. A determination made as to whether tokens in the document are represented in the bitset optimized inverted index. A list of candidate duplicate documents is created for tokens represented in the optimized inverted index utilizing in-memory bitsets that map tokens to documents that contain the tokens in the document corpus.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.