Project

General

Profile

Introduction » History » Version 24

Vassilis Papavassiliou, 2016-02-16 03:18 PM

1 19 Prokopis Prokopidis
# Introduction
2 19 Prokopis Prokopidis
3 19 Prokopis Prokopidis
ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. ILSP-FC integrates modules for text normalization, language identification, document clean-up, metadata extraction, text classification, identification of bitexts (documents that are translations of each other), alignment of segments, and filtering of segment pairs. A detailed description of each module is available at http://aclweb.org/anthology/W/W13/W13-2506.pdf . 
4 19 Prokopis Prokopidis
5 19 Prokopis Prokopidis
# Workflows
6 19 Prokopis Prokopidis
7 20 Prokopis Prokopidis
The current version of ILSP-FC offers the user the option to [[GettingStarted|run all relevant processes]] in a pipeline or to select a specific subset (e.g. alignment and merging only).
8 19 Prokopis Prokopidis
9 19 Prokopis Prokopidis
In a configuration for acquiring monolingual data, ILSP-FC applies the following processes (one after the other):
10 19 Prokopis Prokopidis
11 19 Prokopis Prokopidis
- crawls the web until an expiration criterion is met (i.e. harvests webpages and stores the ones that are in the targeted language, and relevant to a targeted topic if required)
12 19 Prokopis Prokopidis
- exports the stored data (i.e. stores downloaded web pages/documents and for each page generates a CesDoc files with its content and metadata). 
13 19 Prokopis Prokopidis
- discards (near) duplicate documents
14 19 Prokopis Prokopidis
15 19 Prokopis Prokopidis
16 19 Prokopis Prokopidis
In a configuration for acquiring parallel data, it applies the following processes (one after the other):
17 19 Prokopis Prokopidis
18 19 Prokopis Prokopidis
- [[Crawl|crawls]] a website with content in the targeted languages (i.e. harvests the website and stores pages that are in the targeted languages, and relevant to a targeted topic if required)
19 19 Prokopis Prokopidis
- [[Export|exports]] the stored data (i.e. stores downloaded web page/document and generates a CesDoc file with its content and metadata). 
20 19 Prokopis Prokopidis
- discards (near) [[NearDeduplication|duplicate]] documents
21 19 Prokopis Prokopidis
- identifies [[PairDetection|pairs]] of (candidate) parallel documents and generates a cesAlign file for each detected pair.
22 19 Prokopis Prokopidis
- [[SegmentAlignment|aligns]] the segments in each detected document pair and generates a TMX for each document pair 
23 19 Prokopis Prokopidis
- [[TMXmerging|merges]] TMX files corresponding to each document pair in order to create the final output, i.e. a TMX that includes all (or a selection of) segment pairs
24 19 Prokopis Prokopidis
25 19 Prokopis Prokopidis
26 19 Prokopis Prokopidis
# Input
27 19 Prokopis Prokopidis
28 19 Prokopis Prokopidis
In case of general monolingual crawls, the ILSP-FC travels across the web and stores web pages/documents with content in the targeted language. The required input from the user is:
29 19 Prokopis Prokopidis
30 19 Prokopis Prokopidis
- a list of seed URLs (i.e. a text file with one URL per text line). 
31 19 Prokopis Prokopidis
32 19 Prokopis Prokopidis
In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are in the targeted language and related to a targeted domain), the input should include:
33 19 Prokopis Prokopidis
34 19 Prokopis Prokopidis
- a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]].
35 19 Prokopis Prokopidis
- a list of term triplets (_\<relevance,term,subtopic\>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English.
36 19 Prokopis Prokopidis
37 19 Prokopis Prokopidis
38 19 Prokopis Prokopidis
In case of general bilingual crawling, the input from the user includes:
39 19 Prokopis Prokopidis
- a seed URL list which should contain URL(s) from only one web site with content in both of the targeted languages(e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. Examples of seed URLs can be found at [[seed_examples.txt]].
40 19 Prokopis Prokopidis
41 19 Prokopis Prokopidis
In case of focused bilingual crawls, the input should also include: 
42 19 Prokopis Prokopidis
43 19 Prokopis Prokopidis
- a list of term triplets (_\<relevance,term,subtopic\>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of  _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]]. 
44 19 Prokopis Prokopidis
Note that in case a thematic website is targeted, it is very likely that examination of domainnesses could  be avoid (i.e. construction and use of a list of terms that define the targeted topic might be redundant).
45 19 Prokopis Prokopidis
46 19 Prokopis Prokopidis
# Output
47 19 Prokopis Prokopidis
48 21 Vassilis Papavassiliou
Each module of the tool provides its own output which feeds the next module in the pipeline: 
49 21 Vassilis Papavassiliou
50 23 Vassilis Papavassiliou
[[Crawl|Crawl]]: Creates the "run" directories (i.e. directories containing all resources fetched/extracted/used/required for each cycle of this crawl). (See 
51 23 Vassilis Papavassiliou
setting _dest_)
52 23 Vassilis Papavassiliou
[[Export|Export]] : The output of this module consists of:
53 23 Vassilis Papavassiliou
* a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See this "cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml file for an example in English for the _Environment_ domain. 
54 23 Vassilis Papavassiliou
* a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this "rendered cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml.html file.
55 24 Vassilis Papavassiliou
[[Pairdetect|Pair detect]]
56 23 Vassilis Papavassiliou
The output of the ilsp-fc in the case of a bilingual crawl consists of: 
57 23 Vassilis Papavassiliou
* a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example "cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml file serves as a link between a detected pair of cesDoc documents in "English":http://nlp.ilsp.gr/xslt/ilsp-fc/98.xml and "Spanish":http://nlp.ilsp.gr/xslt/ilsp-fc/44.xml.
58 23 Vassilis Papavassiliou
* a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this "rendered cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml.html file.
59 23 Vassilis Papavassiliou
* a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this "TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.tmx file.
60 23 Vassilis Papavassiliou
* a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this "rendered TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.html file.
61 19 Prokopis Prokopidis
cesDoc (lists),
62 19 Prokopis Prokopidis
cesAlign (lists),
63 19 Prokopis Prokopidis
TMX (lists),
64 19 Prokopis Prokopidis
merged TMX (TMX, HTML, MD) 
65 19 Prokopis Prokopidis
66 1 Prokopis Prokopidis
# Documentation
67 1 Prokopis Prokopidis
68 19 Prokopis Prokopidis
- [[HowToGet|How to get ILSP-FC]]: Learn about the different ways to get ILSP-FC
69 19 Prokopis Prokopidis
- [[DeveloperSetup|Developer Setup]]: Learn how to build ILSP-FC 
70 19 Prokopis Prokopidis
- [[GettingStarted|Getting Started]]: Learn how to run ILSP-FC
71 19 Prokopis Prokopidis
- [[Languages-supported|Languages Supported]] Learn about supported languages
72 19 Prokopis Prokopidis
- [[Resources|Resources]] acquired with ILSP-FC