Webpage entity extraction through joint understanding of page structures and sentences
US9092424B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Sep 30, 2009 |
| Grant date | Jul 28, 2015 |
| Priority date | — |
| Expiry date | Jan 16, 2033 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/295
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Described is a technology for understanding entities of a webpage, e.g., to label the entities on the webpage. An iterative and bidirectional framework processes a webpage, including a text understanding component (e.g., extended Semi-CRF model) that provides text segmentation features to a structure understanding component (e.g., extended HCRF model). The structure understanding component uses the text segmentation features and visual layout features of the webpage to identify a structure (e.g., labeled block). The text understanding component in turn uses the labeled block to further understand the text. The process continues iteratively until a similarity criterion is met, at which time the entities may be labeled. Also described is the use of multiple mentions of a set of text in the webpage to help in labeling an entity.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.