Patent · US Expired

Method and system for incremental web crawling

US6631369B1 · kind B1 · utility

130Cited by
5References
23Claims
0Family size

Assignee

Inventors

Key dates

Filing dateJun 30, 1999
Grant dateOct 7, 2003
Priority date
Expiry dateJun 30, 2019

Classification

  • Technology area (CPC Y)Emerging Cross-Sectional Technologies
  • CPC primaryY10S707/99934
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A Web crawler creates an index of documents in a document store on a computer network. In an initial crawl, the crawler creates a first full index for the document store. The first full crawl is based on a set of predefined “seed” URLs and crawl restrictions, and involves recursively retrieving each folder/document directly or indirectly linked to the seed URLs. In the process of creating the first full index, the crawler creates a History Table containing a list of URLs for each folder and document found in the first full crawl. The History Table also includes a local commit time (LCT) for each document and a deleted documents count (DDC) and LCT or maximum LCT (MLCT) for each folder (this assumes that the store supports a folder hierarchy and the MLCT, LCT and DDC properties). Thereafter, in an incremental crawl, the crawler determines, for each folder, (1) whether the DDC for that folder has changed and (2) whether the MLCT is more recent than the corresponding value in the History Table. If the DDC has changed, the crawler obtains a full list of items (URLs) in that folder, and compares the list with the URLs in the History Table to identify the deleted documents. T…

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.