Filtering invalid tokens from a document using high IDF token filtering
US7908279B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Sep 17, 2007 |
| Grant date | Mar 15, 2011 |
| Priority date | — |
| Expiry date | Aug 28, 2029 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/194
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Systems and methods for filtering tokens from a document for determining whether the document describes substantially similar subject matter compared to another document are described. In one embodiment, a first document is obtained. This document is organized into a plurality of fields, and at least some of the fields include tokens representing the subject matter described by the document. A field of this document is selected and a token from within the selected field having the highest inverse document frequency (IDF) is selected. Those tokens that have a higher IDF than the selected token are removed. Using the remaining tokens, a determination is made as to whether the first document describes substantially similar subject matter to the subject matter described by a second document. An indication is provided as to whether the first document describes substantially similar subject matter to that described by a second document according to the determination.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.