Patent · US Expired

Method and system for incremental web crawling

US6631369B1 · kind B1 · utility

130Cited by

5References

23Claims

0Family size

Assignee

Microsoft Corporation · US

Inventors

Dmitriy Meyerzon · Bellevue, US
Srikanth Shoroff · Sammamish, US
F. Soner Terek · Bellevue, US
Sankrant Sanu · Redmond, US

Key dates

Filing date	Jun 30, 1999
Grant date	Oct 7, 2003
Priority date	—
Expiry date	Jun 30, 2019

Classification

Technology area (CPC Y)Emerging Cross-Sectional Technologies
CPC primaryY10S707/99934
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A Web crawler creates an index of documents in a document store on a computer network. In an initial crawl, the crawler creates a first full index for the document store. The first full crawl is based on a set of predefined “seed” URLs and crawl restrictions, and involves recursively retrieving each folder/document directly or indirectly linked to the seed URLs. In the process of creating the first full index, the crawler creates a History Table containing a list of URLs for each folder and document found in the first full crawl. The History Table also includes a local commit time (LCT) for each document and a deleted documents count (DDC) and LCT or maximum LCT (MLCT) for each folder (this assumes that the store supports a folder hierarchy and the MLCT, LCT and DDC properties). Thereafter, in an incremental crawl, the crawler determines, for each folder, (1) whether the DDC for that folder has changed and (2) whether the MLCT is more recent than the corresponding value in the History Table. If the DDC has changed, the crawler obtains a full list of items (URLs) in that folder, and compares the list with the URLs in the History Table to identify the deleted documents. T…

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.