Project

General

Profile

Introduction » History » Version 20

Prokopis Prokopidis, 2016-02-16 12:48 PM

1 19 Prokopis Prokopidis
# Introduction
2 19 Prokopis Prokopidis
3 19 Prokopis Prokopidis
ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. ILSP-FC integrates modules for text normalization, language identification, document clean-up, metadata extraction, text classification, identification of bitexts (documents that are translations of each other), alignment of segments, and filtering of segment pairs. A detailed description of each module is available at http://aclweb.org/anthology/W/W13/W13-2506.pdf . 
4 19 Prokopis Prokopidis
5 19 Prokopis Prokopidis
# Workflows
6 19 Prokopis Prokopidis
7 20 Prokopis Prokopidis
The current version of ILSP-FC offers the user the option to [[GettingStarted|run all relevant processes]] in a pipeline or to select a specific subset (e.g. alignment and merging only).
8 19 Prokopis Prokopidis
9 19 Prokopis Prokopidis
In a configuration for acquiring monolingual data, ILSP-FC applies the following processes (one after the other):
10 19 Prokopis Prokopidis
11 19 Prokopis Prokopidis
- crawls the web until an expiration criterion is met (i.e. harvests webpages and stores the ones that are in the targeted language, and relevant to a targeted topic if required)
12 19 Prokopis Prokopidis
- exports the stored data (i.e. stores downloaded web pages/documents and for each page generates a CesDoc files with its content and metadata). 
13 19 Prokopis Prokopidis
- discards (near) duplicate documents
14 19 Prokopis Prokopidis
15 19 Prokopis Prokopidis
16 19 Prokopis Prokopidis
In a configuration for acquiring parallel data, it applies the following processes (one after the other):
17 19 Prokopis Prokopidis
18 19 Prokopis Prokopidis
- [[Crawl|crawls]] a website with content in the targeted languages (i.e. harvests the website and stores pages that are in the targeted languages, and relevant to a targeted topic if required)
19 19 Prokopis Prokopidis
- [[Export|exports]] the stored data (i.e. stores downloaded web page/document and generates a CesDoc file with its content and metadata). 
20 19 Prokopis Prokopidis
- discards (near) [[NearDeduplication|duplicate]] documents
21 19 Prokopis Prokopidis
- identifies [[PairDetection|pairs]] of (candidate) parallel documents and generates a cesAlign file for each detected pair.
22 19 Prokopis Prokopidis
- [[SegmentAlignment|aligns]] the segments in each detected document pair and generates a TMX for each document pair 
23 19 Prokopis Prokopidis
- [[TMXmerging|merges]] TMX files corresponding to each document pair in order to create the final output, i.e. a TMX that includes all (or a selection of) segment pairs
24 19 Prokopis Prokopidis
25 19 Prokopis Prokopidis
26 19 Prokopis Prokopidis
# Input
27 19 Prokopis Prokopidis
28 19 Prokopis Prokopidis
In case of general monolingual crawls, the ILSP-FC travels across the web and stores web pages/documents with content in the targeted language. The required input from the user is:
29 19 Prokopis Prokopidis
30 19 Prokopis Prokopidis
- a list of seed URLs (i.e. a text file with one URL per text line). 
31 19 Prokopis Prokopidis
32 19 Prokopis Prokopidis
In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are in the targeted language and related to a targeted domain), the input should include:
33 19 Prokopis Prokopidis
34 19 Prokopis Prokopidis
- a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]].
35 19 Prokopis Prokopidis
- a list of term triplets (_\<relevance,term,subtopic\>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English.
36 19 Prokopis Prokopidis
37 19 Prokopis Prokopidis
38 19 Prokopis Prokopidis
In case of general bilingual crawling, the input from the user includes:
39 19 Prokopis Prokopidis
- a seed URL list which should contain URL(s) from only one web site with content in both of the targeted languages(e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. Examples of seed URLs can be found at [[seed_examples.txt]].
40 19 Prokopis Prokopidis
41 19 Prokopis Prokopidis
In case of focused bilingual crawls, the input should also include: 
42 19 Prokopis Prokopidis
43 19 Prokopis Prokopidis
- a list of term triplets (_\<relevance,term,subtopic\>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of  _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]]. 
44 19 Prokopis Prokopidis
Note that in case a thematic website is targeted, it is very likely that examination of domainnesses could  be avoid (i.e. construction and use of a list of terms that define the targeted topic might be redundant).
45 19 Prokopis Prokopidis
46 19 Prokopis Prokopidis
# Output
47 19 Prokopis Prokopidis
48 19 Prokopis Prokopidis
cesDoc (lists),
49 19 Prokopis Prokopidis
cesAlign (lists),
50 19 Prokopis Prokopidis
TMX (lists),
51 19 Prokopis Prokopidis
merged TMX (TMX, HTML, MD) 
52 19 Prokopis Prokopidis
53 1 Prokopis Prokopidis
# Documentation
54 1 Prokopis Prokopidis
55 19 Prokopis Prokopidis
- [[HowToGet|How to get ILSP-FC]]: Learn about the different ways to get ILSP-FC
56 19 Prokopis Prokopidis
- [[DeveloperSetup|Developer Setup]]: Learn how to build ILSP-FC 
57 19 Prokopis Prokopidis
- [[GettingStarted|Getting Started]]: Learn how to run ILSP-FC
58 19 Prokopis Prokopidis
- [[Languages-supported|Languages Supported]] Learn about supported languages
59 19 Prokopis Prokopidis
- [[Resources|Resources]] acquired with ILSP-FC