Resources » History » Version 7
Version 6 (Prokopis Prokopidis, 2015-09-18 01:28 PM) → Version 7/15 (Prokopis Prokopidis, 2015-09-18 01:29 PM)
h1. Domain-specific resources acquired with ILSP-FC
ILSP-FC [1] has been used in order to acquire several domain-specific datasets for training and evaluating domain-specific SMT systems. These datasets include:
* "bilingual corpora":http://panacea-lr.eu/en/info-for-researchers/data-sets/bilingual-aligned-parallel-corpora/ in EN-EL and EN-FR (for the environment and labor legislation domains) that were then used by the PANACEA consortium for domain adaptation SMT experiments [2] and the generation of domain specific "bilingual glossaries":http://panacea-lr.eu/en/info-for-researchers/data-sets/bilingual-glossaries/ ; "monolingual corpora":http://panacea-lr.eu/en/info-for-researchers/data-sets/monolingual-corpora/ in the EN, EL, EN, ES, FR, IT and IT, DE and in the same domains, used for the creation of domain-specific "ngram lists":http://panacea-lr.eu/en/info-for-researchers/data-sets/monolingual-corpora-n-grams/.
* all combinations of DE, EL, EN, PT for the "automotive and medical domains":http://qt21.metashare.ilsp.gr/repository/search/?q=qtlp in QTLaunchPad
* EN-HR bilingual corpora for the tourist domain [3]; EN-FI bilingual corpora used for the Abu-MaTran project submissions in WMT 2015 [4];
Additionally, experiments involving crawling public administration websites for the purposes of ELRC have generated bilingual collections in several language pairs, some examples of which are available at the following links: "EN-DE":http://nlp.ilsp.gr/elrc/output_bundesregierung.tmx.html; "EN-LV":http://nlp.ilsp.gr/elrc/output_eu2015_en-lv.tmx.html; "EN-GA":http://nlp.ilsp.gr/elrc/output_citizensinformation_en-ga.tmx.html.
h2. References
[1] V. Papavassiliou, P. Prokopidis, G. Thurmair. A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In the 6th Workshop on Building and Using Comparable Corpora. 2013.
[2] P. Pecina, A. Toral, V. Papavassiliou, P. Prokopidis, A. Tamchyna, A. Way, J.V. Genabith. Domain adaptation of statistical machine translation with domain-focused web crawling. Language Resources and Evaluation. Vol. 49:1. 2015.
[3] M. Esplà-Gomis, F. Klubička, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis. Comparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites. In LREC 2014.
[4] R. Rubino, T. Pirineny, M. Esplà-Gomis, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis, A. Toral. Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling. In WMT2015 (to appear)
ILSP-FC [1] has been used in order to acquire several domain-specific datasets for training and evaluating domain-specific SMT systems. These datasets include:
* "bilingual corpora":http://panacea-lr.eu/en/info-for-researchers/data-sets/bilingual-aligned-parallel-corpora/ in EN-EL and EN-FR (for the environment and labor legislation domains) that were then used by the PANACEA consortium for domain adaptation SMT experiments [2] and the generation of domain specific "bilingual glossaries":http://panacea-lr.eu/en/info-for-researchers/data-sets/bilingual-glossaries/ ; "monolingual corpora":http://panacea-lr.eu/en/info-for-researchers/data-sets/monolingual-corpora/ in the EN, EL, EN, ES, FR, IT and IT, DE and in the same domains, used for the creation of domain-specific "ngram lists":http://panacea-lr.eu/en/info-for-researchers/data-sets/monolingual-corpora-n-grams/.
* all combinations of DE, EL, EN, PT for the "automotive and medical domains":http://qt21.metashare.ilsp.gr/repository/search/?q=qtlp in QTLaunchPad
* EN-HR bilingual corpora for the tourist domain [3]; EN-FI bilingual corpora used for the Abu-MaTran project submissions in WMT 2015 [4];
Additionally, experiments involving crawling public administration websites for the purposes of ELRC have generated bilingual collections in several language pairs, some examples of which are available at the following links: "EN-DE":http://nlp.ilsp.gr/elrc/output_bundesregierung.tmx.html; "EN-LV":http://nlp.ilsp.gr/elrc/output_eu2015_en-lv.tmx.html; "EN-GA":http://nlp.ilsp.gr/elrc/output_citizensinformation_en-ga.tmx.html.
h2. References
[1] V. Papavassiliou, P. Prokopidis, G. Thurmair. A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In the 6th Workshop on Building and Using Comparable Corpora. 2013.
[2] P. Pecina, A. Toral, V. Papavassiliou, P. Prokopidis, A. Tamchyna, A. Way, J.V. Genabith. Domain adaptation of statistical machine translation with domain-focused web crawling. Language Resources and Evaluation. Vol. 49:1. 2015.
[3] M. Esplà-Gomis, F. Klubička, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis. Comparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites. In LREC 2014.
[4] R. Rubino, T. Pirineny, M. Esplà-Gomis, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis, A. Toral. Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling. In WMT2015 (to appear)