Patent · US Expired

Method for document comparison and classification using document image layout

US6542635B1 · kind B1 · utility

71Cited by
6References
14Claims
0Family size

Assignee

Inventors

Key dates

Filing dateSep 8, 1999
Grant dateApr 1, 2003
Priority date
Expiry dateSep 8, 2019

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06V30/414
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Document type comparison and classification using layout classification is accomplished by first segmenting a document page into blocks of text and white space. A grid of rows and columns, forming bins, is created on the page to intersect the blocks. Layout information is identified using a unique fixed length interval vector, to represent each row on the segmented document. By computing the Manhattan distance between interval vectors of all rows of two document pages and performing a warping function to determine the row to row correspondence, two documents may be compared by their layout. Furthermore, interval vectors may be grouped into N clusters with a cluster center, defined as the median of the interval vectors of the cluster, replacing each interval vector in its cluster. Using Hidden Markov Models, documents can be compared to document type models comprising rows represented by cluster centers and identified as belonging to one or more document types. In addition, documents stored in a database may be retrieved, deleted, or otherwise managed by type, using their corresponding vector sets without requiring expensive OCR of the document. Furthermore, based on the classificat…

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.