Patent · US Active

Method and device for deduplicating web page

US10346257B2 · kind B2 · utility

1Cited by
0References
16Claims
0Family size

Assignee

Inventors

Key dates

Filing dateDec 23, 2014
Grant dateJul 9, 2019
Priority date
Expiry dateDec 17, 2037

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/958
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A method and a device is described for de-duplicating a web page. The method includes: extracting at least one core sentence from a target web page; mapping each core sentence to a unique numeric value to form a first numeric value set; determining an intersection set of the first numeric value set and each second numeric value set, and the number of numeric values included in each intersection set, and determining a maximum number of numeric values included in each intersection set; and when a ratio of the maximum number to a total number of numeric values in the first numeric value set is greater than a set threshold, processing the target web page as a duplicate web page. In embodiments of the present invention, during web page de-duplication processing, accuracy can be improved, an anti-noise capability can be enhanced, and a calculating scale can be reduced.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.