Patent · US Active

System and method for crawling web-content

US11321400B2 · kind B2 · utility

0Cited by
0References
15Claims
0Family size

Assignee

Inventors

Key dates

Filing dateJul 16, 2019
Grant dateMay 3, 2022
Priority date
Expiry dateJul 16, 2040

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06N3/08
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Disclosed is a system comprising: a data repository storing web-content; a data processing arrangement communicatively coupled to data repository, wherein data processing arrangement is configured to: acquire a web-page signature file associated to web-content, from a web-server hosting a website for displaying web-content, wherein web-page signature file includes a plurality of data related to web-content; analyse plurality of data included in web-page signature file to identify a modification in website; compare web-content stored in data repository with web-content displayed on website to determine additional web-content included in web-content displayed on website; use a machine learning algorithm to determine an importance value for additional web-content using a set of predefined parameters; crawl web-content stored in data repository based on additional web-content upon determining importance value to be greater than a predefined threshold value; and predict a time for crawling web-content using forecast module.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.