Project

General

Profile

Resources » History » Version 5

Prokopis Prokopidis, 2015-08-20 02:09 PM

1 1 Prokopis Prokopidis
h1. Domain-specific resources acquired with ILSP-FC
2 1 Prokopis Prokopidis
3 4 Prokopis Prokopidis
ILSP-FC [1] has been used in order to acquire several domain-specific datasets for training and evaluating domain-specific SMT systems. These datasets  include:
4 1 Prokopis Prokopidis
5 5 Prokopis Prokopidis
* "bilingual corpora":http://panacea-lr.eu/en/info-for-researchers/data-sets/bilingual-aligned-parallel-corpora/ in EN-EL, EN-FR, EN-IT and EN-ES (for the environment and labor legislation domains) that were then used by the PANACEA consortium for domain adaptation SMT experiments [2] and the generation of domain specific "bilingual glossaries":http://panacea-lr.eu/en/info-for-researchers/data-sets/bilingual-glossaries/ ; "monolingual corpora":http://panacea-lr.eu/en/info-for-researchers/data-sets/monolingual-corpora/ in the same languages and domains, used for the creation of domain-specific "ngram lists":http://panacea-lr.eu/en/info-for-researchers/data-sets/monolingual-corpora-n-grams/.
6 4 Prokopis Prokopidis
* all combinations of DE, EL, EN, PT for the "automotive and medical domains":http://qt21.metashare.ilsp.gr/repository/search/?q=qtlp in QTLaunchPad
7 1 Prokopis Prokopidis
* EN-HR bilingual corpora for the tourist domain [3]; EN-FI bilingual corpora used for the Abu-MaTran project submissions in WMT 2015 [4];
8 1 Prokopis Prokopidis
9 4 Prokopis Prokopidis
Additionally, experiments involving crawling public administration websites for the purposes of ELRC have generated bilingual collections in several language pairs, some examples of which are available at the following links: "EN-DE":http://nlp.ilsp.gr/elrc/output_bundesregierung.tmx.html; "EN-LV":http://nlp.ilsp.gr/elrc/output_eu2015_en-lv.tmx.html; "EN-GA":http://nlp.ilsp.gr/elrc/output_citizensinformation_en-ga.tmx.html.
10 1 Prokopis Prokopidis
11 1 Prokopis Prokopidis
h2. References
12 2 Prokopis Prokopidis
13 2 Prokopis Prokopidis
[1] V. Papavassiliou, P. Prokopidis, G. Thurmair. A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In the 6th Workshop on Building and Using Comparable Corpora. 2013.
14 2 Prokopis Prokopidis
15 1 Prokopis Prokopidis
[2] P. Pecina, A. Toral, V. Papavassiliou, P. Prokopidis, A. Tamchyna, A. Way, J.V. Genabith. Domain adaptation of statistical machine translation with domain-focused web crawling. Language Resources and Evaluation. Vol. 49:1. 2015.
16 2 Prokopis Prokopidis
17 1 Prokopis Prokopidis
[3] M. Esplà-Gomis, F. Klubička, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis. Comparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites. In LREC 2014.
18 2 Prokopis Prokopidis
19 1 Prokopis Prokopidis
[4] R. Rubino, T. Pirineny, M. Esplà-Gomis, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis, A. Toral. Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling. In WMT2015 (to appear)