Method and device for deduplicating web page
US10346257B2 · kind B2 · utility
Assignee
Inventors
Key dates
| Filing date | Dec 23, 2014 |
| Grant date | Jul 9, 2019 |
| Priority date | — |
| Expiry date | Dec 17, 2037 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F16/958
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A method and a device is described for de-duplicating a web page. The method includes: extracting at least one core sentence from a target web page; mapping each core sentence to a unique numeric value to form a first numeric value set; determining an intersection set of the first numeric value set and each second numeric value set, and the number of numeric values included in each intersection set, and determining a maximum number of numeric values included in each intersection set; and when a ratio of the maximum number to a total number of numeric values in the first numeric value set is greater than a set threshold, processing the target web page as a duplicate web page. In embodiments of the present invention, during web page de-duplication processing, accuracy can be improved, an anti-noise capability can be enhanced, and a calculating scale can be reduced.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.