Patent · US Active

System and method for web content extraction

US8819028B2 · kind B2 · utility

4Cited by

1References

14Claims

0Family size

Assignee

HEWLETT-PACKARD DEVELOPMENT COMPANY, L.P. · US

Inventors

Ping Luo · Sha Tin, CN
Jian Fan · Cupertino, US
Samson J. Liu · Menlo Park, US
Yuhong Xiong · Fremont, US
Jerry Liu · Palo Alto, US

Key dates

Filing date	Dec 14, 2009
Grant date	Aug 26, 2014
Priority date	—
Expiry date	Dec 14, 2029

Classification

Technology area (CPC G)Physics
CPC primaryG06F3/1246
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A method and system for extracting Web content is disclosed. In one embodiment, Web content in a Webpage is extracted by identifying paragraphs in the Web content based on line-break node determination. A range of text-body associated with the identified paragraphs is then identified using a maximum scoring subsequence. Further, the identified text-body is refined using a heuristic rule of substantially horizontal alignment. Furthermore, one or more titles and one or more images associated with the Web content are extracted. Moreover, the Web content including the identified paragraphs, the one or more titles and the one or more images are outputted.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.