Method of identifying semantic units in an electronic document
US7705848B2 · kind B2 · utility
Assignee
Inventor
Key dates
| Filing date | Apr 18, 2006 |
| Grant date | Apr 27, 2010 |
| Priority date | — |
| Expiry date | Feb 25, 2029 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06V30/414
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A method of identifying semantic units in an electronic document includes the steps of: providing an electronic document being described in a page description language, the document having at least one page having a plurality of text fragments, each text fragment including a plurality of glyphs that have not been identified as semantic units, the document further including geometric information and page description language parameters; determining strips of at least one glyph by comparing the geometric position of subsequent glyphs; determining zones of at least one strip wherein a zone is defined by the combined area of strips, the geometrical areas of which overlap with each other; determining a boundary between two semantic units in a zone based on the geometric properties of the glyphs; sorting the identified semantic units in the zone in a sorted list; and, combining subsequent semantic units in the sorted list according to geometric considerations.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.