Domain-specific resources acquired with ILSP-FC
ILSP-FC [1] has been used in order to acquire several domain-specific datasets for training and evaluating domain-specific SMT systems. These datasets include:
- bilingual corpora in EN-EL and EN-FR (for the environment and labor legislation domains) that were then used by the PANACEA consortium for domain adaptation SMT experiments [2] and the generation of domain specific bilingual glossaries ; monolingual corpora in EL, EN, ES, FR, and IT, and in the same domains, used for the creation of domain-specific ngram lists.
- all combinations of DE, EL, EN, PT for the automotive and medical domains in QTLaunchPad
- EN-HR bilingual corpora for the tourist domain [3]; EN-FI bilingual corpora used for the Abu-MaTran project submissions in WMT 2015 [4];
Additionally, experiments involving crawling public administration websites for the purposes of ELRC have generated bilingual collections in several language pairs, some examples of which are available at the following links: EN-DE ; EN-LV ; EN-GA.
V. Papavassiliou, P. Prokopidis, G. Thurmair. A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In the 6th Workshop on Building and Using Comparable Corpora. 2013.
P. Pecina, A. Toral, V. Papavassiliou, P. Prokopidis, A. Tamchyna, A. Way, J.V. Genabith. Domain adaptation of statistical machine translation with domain-focused web crawling. Language Resources and Evaluation. Vol. 49:1. 2015.
M. Esplà-Gomis, F. Klubička, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis. Comparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites. In LREC 2014.
R. Rubino, T. Pirineny, M. Esplà-Gomis, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis, A. Toral. Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling. In WMT2015