Patent · US Active

System and method for web content extraction

US8819028B2 · kind B2 · utility

4Cited by
1References
14Claims
0Family size

Assignee

Inventors

Key dates

Filing dateDec 14, 2009
Grant dateAug 26, 2014
Priority date
Expiry dateDec 14, 2029

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F3/1246
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A method and system for extracting Web content is disclosed. In one embodiment, Web content in a Webpage is extracted by identifying paragraphs in the Web content based on line-break node determination. A range of text-body associated with the identified paragraphs is then identified using a maximum scoring subsequence. Further, the identified text-body is refined using a heuristic rule of substantially horizontal alignment. Furthermore, one or more titles and one or more images associated with the Web content are extracted. Moreover, the Web content including the identified paragraphs, the one or more titles and the one or more images are outputted.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.