Patent · US Active

Pre-processing for identifying nonsense passages in documents being ingested into a corpus of a natural language processing system

US9842096B2 · kind B2 · utility

2Cited by

4References

20Claims

0Family size

Assignee

International Business Machines Corporation · US

Inventors

Charles E. Beller · Baltimore, US
Michael Drzewucki · Salem, US
Christopher Phipps · Arlington, US
Kristen M. Summers · Takoma Park, US
Julie Yu · Centreville, US

Key dates

Filing date	May 12, 2016
Grant date	Dec 12, 2017
Priority date	—
Expiry date	May 12, 2036

Classification

Technology area (CPC G)Physics
CPC primaryG06F40/40
WIPO fieldComputer technology
WIPO sectorElectrical engineering

Abstract

A mechanism is provided in a data processing system for identifying nonsense passages in documents being ingested into a corpus. A natural language processing pipeline configured to execute in the data processing system receives an input document to be ingested into a corpus. The natural language processing pipeline divides the input document into a plurality of input passages. A filter component of the natural language processing pipeline identifies whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts. The natural language processing pipeline filters each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages. The natural language processing pipeline adds the filtered plurality of input passages into the corpus.

Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.