Project

General

Profile

Introduction » History » Version 34

Vassilis Papavassiliou, 2016-02-16 03:53 PM

1 19 Prokopis Prokopidis
# Introduction
2 19 Prokopis Prokopidis
3 19 Prokopis Prokopidis
ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. ILSP-FC integrates modules for text normalization, language identification, document clean-up, metadata extraction, text classification, identification of bitexts (documents that are translations of each other), alignment of segments, and filtering of segment pairs. A detailed description of each module is available at http://aclweb.org/anthology/W/W13/W13-2506.pdf . 
4 19 Prokopis Prokopidis
5 19 Prokopis Prokopidis
# Workflows
6 19 Prokopis Prokopidis
7 20 Prokopis Prokopidis
The current version of ILSP-FC offers the user the option to [[GettingStarted|run all relevant processes]] in a pipeline or to select a specific subset (e.g. alignment and merging only).
8 19 Prokopis Prokopidis
9 19 Prokopis Prokopidis
In a configuration for acquiring monolingual data, ILSP-FC applies the following processes (one after the other):
10 19 Prokopis Prokopidis
11 19 Prokopis Prokopidis
- crawls the web until an expiration criterion is met (i.e. harvests webpages and stores the ones that are in the targeted language, and relevant to a targeted topic if required)
12 19 Prokopis Prokopidis
- exports the stored data (i.e. stores downloaded web pages/documents and for each page generates a CesDoc files with its content and metadata). 
13 19 Prokopis Prokopidis
- discards (near) duplicate documents
14 19 Prokopis Prokopidis
15 19 Prokopis Prokopidis
16 19 Prokopis Prokopidis
In a configuration for acquiring parallel data, it applies the following processes (one after the other):
17 19 Prokopis Prokopidis
18 19 Prokopis Prokopidis
- [[Crawl|crawls]] a website with content in the targeted languages (i.e. harvests the website and stores pages that are in the targeted languages, and relevant to a targeted topic if required)
19 19 Prokopis Prokopidis
- [[Export|exports]] the stored data (i.e. stores downloaded web page/document and generates a CesDoc file with its content and metadata). 
20 19 Prokopis Prokopidis
- discards (near) [[NearDeduplication|duplicate]] documents
21 19 Prokopis Prokopidis
- identifies [[PairDetection|pairs]] of (candidate) parallel documents and generates a cesAlign file for each detected pair.
22 19 Prokopis Prokopidis
- [[SegmentAlignment|aligns]] the segments in each detected document pair and generates a TMX for each document pair 
23 19 Prokopis Prokopidis
- [[TMXmerging|merges]] TMX files corresponding to each document pair in order to create the final output, i.e. a TMX that includes all (or a selection of) segment pairs
24 19 Prokopis Prokopidis
25 19 Prokopis Prokopidis
26 19 Prokopis Prokopidis
# Input
27 19 Prokopis Prokopidis
28 19 Prokopis Prokopidis
In case of general monolingual crawls, the ILSP-FC travels across the web and stores web pages/documents with content in the targeted language. The required input from the user is:
29 19 Prokopis Prokopidis
30 19 Prokopis Prokopidis
- a list of seed URLs (i.e. a text file with one URL per text line). 
31 19 Prokopis Prokopidis
32 19 Prokopis Prokopidis
In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are in the targeted language and related to a targeted domain), the input should include:
33 19 Prokopis Prokopidis
34 19 Prokopis Prokopidis
- a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]].
35 19 Prokopis Prokopidis
- a list of term triplets (_\<relevance,term,subtopic\>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English.
36 19 Prokopis Prokopidis
37 19 Prokopis Prokopidis
38 19 Prokopis Prokopidis
In case of general bilingual crawling, the input from the user includes:
39 19 Prokopis Prokopidis
- a seed URL list which should contain URL(s) from only one web site with content in both of the targeted languages(e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. Examples of seed URLs can be found at [[seed_examples.txt]].
40 19 Prokopis Prokopidis
41 19 Prokopis Prokopidis
In case of focused bilingual crawls, the input should also include: 
42 19 Prokopis Prokopidis
43 19 Prokopis Prokopidis
- a list of term triplets (_\<relevance,term,subtopic\>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of  _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]]. 
44 19 Prokopis Prokopidis
Note that in case a thematic website is targeted, it is very likely that examination of domainnesses could  be avoid (i.e. construction and use of a list of terms that define the targeted topic might be redundant).
45 19 Prokopis Prokopidis
46 19 Prokopis Prokopidis
# Output
47 19 Prokopis Prokopidis
48 21 Vassilis Papavassiliou
Each module of the tool provides its own output which feeds the next module in the pipeline: 
49 21 Vassilis Papavassiliou
50 23 Vassilis Papavassiliou
[[Crawl|Crawl]]: Creates the "run" directories (i.e. directories containing all resources fetched/extracted/used/required for each cycle of this crawl). (See 
51 23 Vassilis Papavassiliou
setting _dest_)
52 1 Prokopis Prokopidis
53 30 Vassilis Papavassiliou
[[Export|Export]] : 
54 30 Vassilis Papavassiliou
55 27 Vassilis Papavassiliou
- a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See this "cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml file for an example in English for the _Environment_ domain. 
56 1 Prokopis Prokopidis
- a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this "rendered cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml.html file.
57 28 Vassilis Papavassiliou
58 30 Vassilis Papavassiliou
[[Pairdetection|Pair detect]] : 
59 29 Vassilis Papavassiliou
60 27 Vassilis Papavassiliou
- a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example "cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml file serves as a link between a detected pair of cesDoc documents in "English":http://nlp.ilsp.gr/xslt/ilsp-fc/98.xml and "Spanish":http://nlp.ilsp.gr/xslt/ilsp-fc/44.xml.
61 27 Vassilis Papavassiliou
- a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this "rendered cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml.html file.
62 29 Vassilis Papavassiliou
63 23 Vassilis Papavassiliou
[[Segment Alignment|Segment Alignment]] : 
64 29 Vassilis Papavassiliou
65 1 Prokopis Prokopidis
- a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this "TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.tmx file.
66 1 Prokopis Prokopidis
- a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this "rendered TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.html file.
67 30 Vassilis Papavassiliou
68 30 Vassilis Papavassiliou
[[TMX merging|TMX merging]]:
69 27 Vassilis Papavassiliou
70 34 Vassilis Papavassiliou
- a TMX file that includes filtered segment pairs of the generated TMX files. This is the final output of the process (i.e. the parallel corpus). As an example, see this "TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/sample.tmx file.
71 33 Vassilis Papavassiliou
- an HTML file (by XSL transformation of the TMX file). As an example, see this "rendered TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/sample.html file.
72 33 Vassilis Papavassiliou
- an XML file which contains metadata of the generated corpus. As an example, see this "XML":http://nlp.ilsp.gr/xslt/ilsp-fc/sample.md.xml file.
73 31 Vassilis Papavassiliou
74 19 Prokopis Prokopidis
75 1 Prokopis Prokopidis
# Documentation
76 1 Prokopis Prokopidis
77 19 Prokopis Prokopidis
- [[HowToGet|How to get ILSP-FC]]: Learn about the different ways to get ILSP-FC
78 19 Prokopis Prokopidis
- [[DeveloperSetup|Developer Setup]]: Learn how to build ILSP-FC 
79 19 Prokopis Prokopidis
- [[GettingStarted|Getting Started]]: Learn how to run ILSP-FC
80 19 Prokopis Prokopidis
- [[Languages-supported|Languages Supported]] Learn about supported languages
81 19 Prokopis Prokopidis
- [[Resources|Resources]] acquired with ILSP-FC