Project

General

Profile

Resources » History » Version 9

« Previous - Version 9/15 (diff) - Next » - Current version
Prokopis Prokopidis, 2016-02-05 02:35 PM


Domain-specific resources acquired with ILSP-FC

ILSP-FC [1] has been used in order to acquire several domain-specific datasets for training and evaluating domain-specific SMT systems. These datasets include:

Additionally, experiments involving crawling public administration websites for the purposes of ELRC have generated bilingual collections in several language pairs, some examples of which are available at the following links: "EN-DE":http://nlp.ilsp.gr/elrc/output_bundesregierung.tmx.html ; "EN-LV":http://nlp.ilsp.gr/elrc/output_eu2015_en-lv.tmx.html ; "EN-GA":http://nlp.ilsp.gr/elrc/output_citizensinformation_en-ga.tmx.html.

References

  1. V. Papavassiliou, P. Prokopidis, G. Thurmair. A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In the 6th Workshop on Building and Using Comparable Corpora. 2013.

  2. P. Pecina, A. Toral, V. Papavassiliou, P. Prokopidis, A. Tamchyna, A. Way, J.V. Genabith. Domain adaptation of statistical machine translation with domain-focused web crawling. Language Resources and Evaluation. Vol. 49:1. 2015.

  3. M. Esplà-Gomis, F. Klubička, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis. Comparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites. In LREC 2014.

  4. R. Rubino, T. Pirineny, M. Esplà-Gomis, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis, A. Toral. Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling. In WMT2015