System and method for crawling web-content
US11321400B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Jul 16, 2019 |
| Grant date | May 3, 2022 |
| Priority date | — |
| Expiry date | Jul 16, 2040 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06N3/08
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
Disclosed is a system comprising: a data repository storing web-content; a data processing arrangement communicatively coupled to data repository, wherein data processing arrangement is configured to: acquire a web-page signature file associated to web-content, from a web-server hosting a website for displaying web-content, wherein web-page signature file includes a plurality of data related to web-content; analyse plurality of data included in web-page signature file to identify a modification in website; compare web-content stored in data repository with web-content displayed on website to determine additional web-content included in web-content displayed on website; use a machine learning algorithm to determine an importance value for additional web-content using a set of predefined parameters; crawl web-content stored in data repository based on additional web-content upon determining importance value to be greater than a predefined threshold value; and predict a time for crawling web-content using forecast module.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.