Method and system for incremental web crawling
US6631369B1 · kind B1 · utility
Assignee
Inventors
Key dates
| Filing date | Jun 30, 1999 |
| Grant date | Oct 7, 2003 |
| Priority date | — |
| Expiry date | Jun 30, 2019 |
Classification
- Technology area (CPC Y)Emerging Cross-Sectional Technologies
- CPC primaryY10S707/99934
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A Web crawler creates an index of documents in a document store on a computer network. In an initial crawl, the crawler creates a first full index for the document store. The first full crawl is based on a set of predefined “seed” URLs and crawl restrictions, and involves recursively retrieving each folder/document directly or indirectly linked to the seed URLs. In the process of creating the first full index, the crawler creates a History Table containing a list of URLs for each folder and document found in the first full crawl. The History Table also includes a local commit time (LCT) for each document and a deleted documents count (DDC) and LCT or maximum LCT (MLCT) for each folder (this assumes that the store supports a folder hierarchy and the MLCT, LCT and DDC properties). Thereafter, in an incremental crawl, the crawler determines, for each folder, (1) whether the DDC for that folder has changed and (2) whether the MLCT is more recent than the corresponding value in the History Table. If the DDC has changed, the crawler obtains a full list of items (URLs) in that folder, and compares the list with the URLs in the History Table to identify the deleted documents. T…
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.