Method for document comparison and classification using document image layout
US6542635B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Sep 8, 1999 |
| Grant date | Apr 1, 2003 |
| Priority date | — |
| Expiry date | Sep 8, 2019 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06V30/414
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Document type comparison and classification using layout classification is accomplished by first segmenting a document page into blocks of text and white space. A grid of rows and columns, forming bins, is created on the page to intersect the blocks. Layout information is identified using a unique fixed length interval vector, to represent each row on the segmented document. By computing the Manhattan distance between interval vectors of all rows of two document pages and performing a warping function to determine the row to row correspondence, two documents may be compared by their layout. Furthermore, interval vectors may be grouped into N clusters with a cluster center, defined as the median of the interval vectors of the cluster, replacing each interval vector in its cluster. Using Hidden Markov Models, documents can be compared to document type models comprising rows represented by cluster centers and identified as belonging to one or more document types. In addition, documents stored in a database may be retrieved, deleted, or otherwise managed by type, using their corresponding vector sets without requiring expensive OCR of the document. Furthermore, based on the classificat…
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.