Patent · US Active

Method of identifying redundant text in an electronic document

US7643682B2 · kind B2 · utility

3Cited by
3References
7Claims
0Family size

Assignee

Inventor

Key dates

Filing dateApr 18, 2006
Grant dateJan 5, 2010
Priority date
Expiry dateApr 26, 2028

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F40/151
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A method of identifying redundant text fragments, which create artificial artifacts only, in an electronic page description language document includes a) providing a page having a plurality of text fragments, each text fragment comprising at least one glyph, the document including Unicode values for all glyphs and geometric information of all text fragments on the page and page description language parameters of all glyphs, b) identifying two text fragments as redundant candidates, if the Unicode sequence of the text fragments have identical corresponding Unicode sequences, c) defining a bounding box of quadrangular shape for each of the two redundant candidates according to their font characteristics, d) calculating the overlapping area of the two bounding boxes, and e) determining whether the two candidates form redundant text fragments by comparing the ratio of the overlapping area to the area of the smaller bounding box of both text fragments with a predetermined threshold.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.