ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. The required input from the user consists of a list of seed URLs pointing to relevant web pages and a list of terms that describe a topic. ILSP-FC integrates modules for text normalization, language identification, document clean-up, text classification, bilingual document alignment (i.e. identification of pairs of documents that are translations of each other) and sentence alignment. If the user does not provide a list of terms, the software can be used as a general crawler.
ILSP-FC is being developed by researchers of the ILSP/Athena RIC and currently being used in the European Language Resource Coordination Data effort. ELRC Data implements the acquisition of language resources and language processing services, as well as their provision to the language resource repository of the Connecting Europe Facility (CEF) eTranslation platform, which helps European and national public administrations exchange information across language barriers in EU.
An initial version of the crawler was produced during PANACEA, an EU FP7 project for the acquisition and production of Language Resources. It was then extended during the QTLaunchPad project, a European Commission-funded collaborative research initiative dedicated to overcoming quality barriers in machine and human translation and in language technologies; and the FP7-PEOPLE Abu-MaTran project for enhancing industry-academia cooperation in the adoption of machine translation technologies.
More information on how to download and use ILSP-FC can be found in the Documentation.
ILSP-FC is a Java project released under the GNU GPL, v. 3.0 license. It depends on open-source libraries for web mining and building data-processing workflows. If you would like to try ILSP-FC as a bilingual crawler before installing it, you can use this web service on small websites.
If you use ILSP-FC in scientific work, please cite: Papavassiliou, V., Prokopidis, P. & G. Thurmair. (2013). A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, pages 43-51. Sofia, Bulgaria : Association for Computational Linguistics (BibTeX)
The pair detection module of ILSP-FC was used for aligning documents in the WMT16 Bilingual Document Alignment Shared Task. The system reached a recall of 91% in the soft scoring setting prepared by the organizers. More details are presented in the system paper: Papavassiliou, V., Prokopidis, P., and Piperidis, S. (2016). The ILSP/ARC submission to the WMT 2016 bilingual document alignment shared task. In Proceedings of the First Conference on Machine Translation. Berlin, Germany: Association for Computational Linguistics (BibTeX)
Please send any questions to ilsp-fc at ilsp-dot-gr.