Patent · US Active

Language-oriented focused crawling using transliteration based meta-features

US9189557B2 · kind B2 · utility

7Cited by
0References
20Claims
0Family size

Assignee

Inventors

Key dates

Filing dateMar 11, 2013
Grant dateNov 17, 2015
Priority date
Expiry dateJan 23, 2034

Classification

  • Technology area (CPC G)Physics
  • CPC primaryG06F16/9566
  • WIPO fieldComputer technology
  • WIPO sectorElectrical engineering

Abstract

A web page identified by a URL stored in a downloads queue is downloaded, and hyperlinks in the downloaded web page are identified. Each hyperlink is screened by parsing the hyperlink (optionally only the URL of the hyperlink) to identify features comprising character strings, computing for each feature values for one or more meta-features indicative of the hyperlinked web page being in a target language, aggregating the meta-feature values to generate a score for the hyperlink, and adding the URL of the hyperlink to the downloads queue conditional upon the score satisfying a screening criterion. The downloading, identifying, and screening are iteratively repeated to perform web crawling, and an index of web pages in the target language is constructed based on analysis of content of the downloaded web pages. The meta-features may include a transliterated target word meta-feature, a language code meta-feature, a country code meta-feature, or so forth.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.