Version 10 - History - Resources - ILSP Focused Crawler - ILSP NLP

Resources » History » Version 10

Version 9 (Prokopis Prokopidis, 2016-02-05 02:35 PM) → Version 10/15 (Prokopis Prokopidis, 2016-02-05 02:37 PM)

# Domain-specific resources acquired with ILSP-FC

ILSP-FC [1] has been used in order to acquire several domain-specific datasets for training and evaluating domain-specific SMT systems. These datasets include:

* [bilingual corpora](http://panacea-lr.eu/en/info-for-researchers/data-sets/bilingual-aligned-parallel-corpora/) "bilingual corpora":http://panacea-lr.eu/en/info-for-researchers/data-sets/bilingual-aligned-parallel-corpora/ in EN-EL and EN-FR (for the environment and labor legislation domains) that were then used by the PANACEA consortium for domain adaptation SMT experiments [2] and the generation of domain specific [bilingual glossaries](http://panacea-lr.eu/en/info-for-researchers/data-sets/bilingual-glossaries/) "bilingual glossaries":http://panacea-lr.eu/en/info-for-researchers/data-sets/bilingual-glossaries/ ; [monolingual corpora](http://panacea-lr.eu/en/info-for-researchers/data-sets/monolingual-corpora/) "monolingual corpora":http://panacea-lr.eu/en/info-for-researchers/data-sets/monolingual-corpora/ in EL, EN, ES, FR, and IT, and in the same domains, used for the creation of domain-specific [ngram lists](http://panacea-lr.eu/en/info-for-researchers/data-sets/monolingual-corpora-n-grams/). "ngram lists":http://panacea-lr.eu/en/info-for-researchers/data-sets/monolingual-corpora-n-grams/.
* all combinations of DE, EL, EN, PT for the [automotive "automotive and medical domains](http://qt21.metashare.ilsp.gr/repository/search/?q=qtlp) domains":http://qt21.metashare.ilsp.gr/repository/search/?q=qtlp in QTLaunchPad
* EN-HR bilingual corpora for the tourist domain [3]; EN-FI bilingual corpora used for the Abu-MaTran project submissions in WMT 2015 [4];

Additionally, experiments involving crawling public administration websites for the purposes of ELRC have generated bilingual collections in several language pairs, some examples of which are available at the following links: [EN-DE](http://nlp.ilsp.gr/elrc/output_bundesregierung.tmx.html) "EN-DE":http://nlp.ilsp.gr/elrc/output_bundesregierung.tmx.html ; [EN-LV](http://nlp.ilsp.gr/elrc/output_eu2015_en-lv.tmx.html) "EN-LV":http://nlp.ilsp.gr/elrc/output_eu2015_en-lv.tmx.html ; [EN-GA](http://nlp.ilsp.gr/elrc/output_citizensinformation_en-ga.tmx.html). "EN-GA":http://nlp.ilsp.gr/elrc/output_citizensinformation_en-ga.tmx.html.

# References

1. V. Papavassiliou, P. Prokopidis, G. Thurmair. A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In the 6th Workshop on Building and Using Comparable Corpora. 2013.

2. P. Pecina, A. Toral, V. Papavassiliou, P. Prokopidis, A. Tamchyna, A. Way, J.V. Genabith. Domain adaptation of statistical machine translation with domain-focused web crawling. Language Resources and Evaluation. Vol. 49:1. 2015.

3. M. Esplà-Gomis, F. Klubička, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis. Comparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites. In LREC 2014.

4. R. Rubino, T. Pirineny, M. Esplà-Gomis, N. Ljubešić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis, A. Toral. Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling. In WMT2015

Project

General

Profile

ILSP Focused Crawler

Resources » History » Version 10