Method for building parallel corpora
US7949514B2 · kind B2 · utility
Assignee
Inventor
Key dates
| Filing date | Apr 20, 2007 |
| Grant date | May 24, 2011 |
| Priority date | — |
| Expiry date | Mar 23, 2030 |
Classification
- Technology area (CPC G)Physics
- CPC primaryG06F40/45
- WIPO fieldComputer technology
- WIPO sectorElectrical engineering
Abstract
A method for identifying documents for enriching a statistical translation tool includes retrieving a source document which is responsive to a source language query that may be specific to a selected domain. A set of text segments is extracted from the retrieved source document and translated into corresponding target language segments with a statistical translation tool to be enriched. Target language queries based on the target language segments are formulated. Sets of target documents responsive to the target language queries are retrieved. The sets of retrieved target documents are filtered, including identifying any candidate documents which meet a selection criterion that is based on co-occurrence of a document in a plurality of the sets. The candidate documents, where found, are compared with the retrieved source document for determining whether any of the candidate documents match the source document. Matching documents can then be stored and used at their turn in a training phase for enriching the translation tool.
Source: USPTO / EPO open patent data. Objective bibliographic and citation counts.