Patent · US Active

System and method for crawling web-content

US11321400B2 · kind B2 · utility

0Cited by

0References

15Claims

0Family size

Assignee

Innoplexus AG · DE

Inventors

Shubhojit Mallick · New Delhi, IN
Sandeep Singh · Mumbai, IN
Vatsal Agarwal · Rampur, IN

Key dates

Filing date	Jul 16, 2019
Grant date	May 3, 2022
Priority date	—
Expiry date	Jul 16, 2040

Classification

Technology area (CPC G)Physics
CPC primaryG06N3/08
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

Disclosed is a system comprising: a data repository storing web-content; a data processing arrangement communicatively coupled to data repository, wherein data processing arrangement is configured to: acquire a web-page signature file associated to web-content, from a web-server hosting a website for displaying web-content, wherein web-page signature file includes a plurality of data related to web-content; analyse plurality of data included in web-page signature file to identify a modification in website; compare web-content stored in data repository with web-content displayed on website to determine additional web-content included in web-content displayed on website; use a machine learning algorithm to determine an importance value for additional web-content using a set of predefined parameters; crawl web-content stored in data repository based on additional web-content upon determining importance value to be greater than a predefined threshold value; and predict a time for crawling web-content using forecast module.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.