Projects
- ILSP Focused Crawler
ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. The required input from the user consists of a list of seed URLs pointing to relevant web pages and a list of terms that describe a topic. ILSP-FC integrates modules for text normalization, language identification, document clean-up, text classification, bilingual document alignment (i.e. identification of pairs of documents that are translations of each other) and sentence alignment. If the user does not provide a list of terms, the software can be used as a general crawler....
Also available in: Atom