Patent · US Active

Webpage entity extraction through joint understanding of page structures and sentences

US9092424B2 · kind B2 · utility

1Cited by
7References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateSep 30, 2009
Grant dateJul 28, 2015
Priority date
Expiry dateJan 16, 2033

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F40/295
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Described is a technology for understanding entities of a webpage, e.g., to label the entities on the webpage. An iterative and bidirectional framework processes a webpage, including a text understanding component (e.g., extended Semi-CRF model) that provides text segmentation features to a structure understanding component (e.g., extended HCRF model). The structure understanding component uses the text segmentation features and visual layout features of the webpage to identify a structure (e.g., labeled block). The text understanding component in turn uses the labeled block to further understand the text. The process continues iteratively until a similarity criterion is met, at which time the entities may be labeled. Also described is the use of multiple mentions of a set of text in the webpage to help in labeling an entity.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.