Extracting data content items using template matching
US7765236B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Aug 31, 2007 |
| Grant date | Jul 27, 2010 |
| Priority date | — |
| Expiry date | Jun 9, 2028 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/986
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Systems and methods for extracting data content items from a web page are provided. A template is created by labeling data content items of interest associated with a web page and generating a template Document Object Model (DOM) tree based on the labeled web page. DOM trees are also generated for additional web pages that contain data content items for which extraction may be desired. These DOM trees are compared to the template DOM tree to determine alignment there between. The aligned data content items may then be extracted from the additional web pages and indexed, as desired. Labeling the data content items of interest prior to generating a template DOM tree allows for the desired data content items to be specified and more accurately extracted from related and/or similarly structured web pages.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.