Method of identifying redundant text in an electronic document
US7643682B2 · kind B2 · utility
Assignee
Inventor
Key dates
| Filing date | Apr 18, 2006 |
| Grant date | Jan 5, 2010 |
| Priority date | — |
| Expiry date | Apr 26, 2028 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/151
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A method of identifying redundant text fragments, which create artificial artifacts only, in an electronic page description language document includes a) providing a page having a plurality of text fragments, each text fragment comprising at least one glyph, the document including Unicode values for all glyphs and geometric information of all text fragments on the page and page description language parameters of all glyphs, b) identifying two text fragments as redundant candidates, if the Unicode sequence of the text fragments have identical corresponding Unicode sequences, c) defining a bounding box of quadrangular shape for each of the two redundant candidates according to their font characteristics, d) calculating the overlapping area of the two bounding boxes, and e) determining whether the two candidates form redundant text fragments by comparing the ratio of the overlapping area to the area of the smaller bounding box of both text fragments with a predetermined threshold.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.