Patent · US Active

Handling dynamic URLs in crawl for better coverage of unique content

US7827166B2 · kind B2 · utility

10Cited by
1References
24Claims
0Family size

Assignee

Inventors

Key dates

Filing dateOct 13, 2006
Grant dateNov 2, 2010
Priority date
Expiry dateJun 8, 2027

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/951
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

Techniques for identifying duplicate webpages are provided. In one technique, one or more parameters of a first unique URL are identified where each of the one or more parameters do not substantially affect the content of the corresponding webpage. The first URL and subsequent URLs may be rewritten to drop each of the one or more parameters. Each of the subsequent URLs is compared to the first URL. If a subsequent URL is the same as the first URL, then the corresponding webpage of the subsequent URL is not accessed or crawled. In another technique, the parameters of multiple URLs are sorted, for example, alphabetically. If any URLs are the same, then the webpages of the duplicate URLs are not accessed or crawled.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.