Patent · US Active

Webpage entity extraction through joint understanding of page structures and sentences

US9092424B2 · kind B2 · utility

1Cited by

7References

20Claims

0Family size

Assignee

MICROSOFT TECHNOLOGY LICENSING, LLC · US

Inventors

Zaiqing Nie · Beijing, CN
Yong Cao · Nanhu, CN
Ji-Rong Wen · Beijing, CN
Chunyu Yang · Beijing, CN

Key dates

Filing date	Sep 30, 2009
Grant date	Jul 28, 2015
Priority date	—
Expiry date	Jan 16, 2033

Classification

Technology area (CPC G)Physics
CPC primaryG06F40/295
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

Described is a technology for understanding entities of a webpage, e.g., to label the entities on the webpage. An iterative and bidirectional framework processes a webpage, including a text understanding component (e.g., extended Semi-CRF model) that provides text segmentation features to a structure understanding component (e.g., extended HCRF model). The structure understanding component uses the text segmentation features and visual layout features of the webpage to identify a structure (e.g., labeled block). The text understanding component in turn uses the labeled block to further understand the text. The process continues iteratively until a similarity criterion is met, at which time the entities may be labeled. Also described is the use of multiple mentions of a set of text in the webpage to help in labeling an entity.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.