Patent · US Active

Removing non-substantive content from a web page by removing its text-sparse nodes and removing high-frequency sentences of its text-dense nodes using sentence hash value frequency across a web page collection

US9449114B2 · kind B2 · utility

1Cited by

6References

16Claims

0Family size

Assignee

PAYPAL, INC. · US

Inventors

John Roper · Sammamish, US
Dane Glasgow · Los Gatos, US

Key dates

Filing date	Apr 15, 2010
Grant date	Sep 20, 2016
Priority date	—
Expiry date	Aug 11, 2032

Classification

Technology area (CPC G)Physics
CPC primaryG06F40/143
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A method and system for removing chrome from a web page is provided. An example system includes a parsing module, a text density analyzer, a content node selector 206, and a text extractor. The parsing module may be configured to parse a web page into a tree structure. The text density analyzer may be configured to determine a text density score value for each node from the tree structure. The content node selector may be configured to identify one or more nodes from the tree structure as content nodes based on their respective text density score values. The text extractor may be configured to extract text from the content nodes only.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.