Patent · US Active

Filtering invalid tokens from a document using high IDF token filtering

US7908279B1 · kind B1 · utility

8Cited by
23References
25Claims
0Family size

Assignee

Inventors

Key dates

Filing dateSep 17, 2007
Grant dateMar 15, 2011
Priority date
Expiry dateAug 28, 2029

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F40/194
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Systems and methods for filtering tokens from a document for determining whether the document describes substantially similar subject matter compared to another document are described. In one embodiment, a first document is obtained. This document is organized into a plurality of fields, and at least some of the fields include tokens representing the subject matter described by the document. A field of this document is selected and a token from within the selected field having the highest inverse document frequency (IDF) is selected. Those tokens that have a higher IDF than the selected token are removed. Using the remaining tokens, a determination is made as to whether the first document describes substantially similar subject matter to the subject matter described by a second document. An indication is provided as to whether the first document describes substantially similar subject matter to that described by a second document according to the determination.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.