Patent · US Expired

Method and system for detecting duplicate documents in web crawls

US6547829B1 · kind B1 · utility

177Cited by

3References

22Claims

0Family size

Assignee

Microsoft Corporation · US

Inventors

Dmitriy Meyerzon · Bellevue, US
Srikanth Shoroff · Sammamish, US
F. Soner Terek · Bellevue, US
Scott Norin · Newcastle, US

Key dates

Filing date	Jun 30, 1999
Grant date	Apr 15, 2003
Priority date	—
Expiry date	Jun 30, 2019

Classification

Technology area (CPC Y)Emerging Cross-Sectional Technologies
CPC primaryY10S707/99945
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A Web crawler application takes advantage of a document store's ability to provide a content identifier (CID) having a value that is a unique function of the physical storage location of a data object or document, such as a Web page. In operation, the crawler first tries to fetch the CID for a document. If the CID attribute is not supported by the document store, the crawler fetches the document, filters it to obtain a hash function, and commits the document to an index if the hash function is not present in a history table. If the CID is available from the document store, the CID is fetched from the document store. The crawler then determines whether the CID is present in the history table, which indicates whether an identical copy of the document in question has already been indexed under a different URL. If the CID is present, indicating that the document has already been indexed, the new URL is placed in the history file but the document itself is not retrieved from the document store, nor is it filtered again to obtain a CID. If the CID is not present in the history table, the full document is retrieved and indexed. The CID data structure is an extension of a known globally un…

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.