System and method for web content extraction
US8819028B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Dec 14, 2009 |
| Grant date | Aug 26, 2014 |
| Priority date | — |
| Expiry date | Dec 14, 2029 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F3/1246
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A method and system for extracting Web content is disclosed. In one embodiment, Web content in a Webpage is extracted by identifying paragraphs in the Web content based on line-break node determination. A range of text-body associated with the identified paragraphs is then identified using a maximum scoring subsequence. Further, the identified text-body is refined using a heuristic rule of substantially horizontal alignment. Furthermore, one or more titles and one or more images associated with the Web content are extracted. Moreover, the Web content including the identified paragraphs, the one or more titles and the one or more images are outputted.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.