Introduction » History » Version 36
Prokopis Prokopidis, 2016-02-16 06:09 PM
1 | 19 | Prokopis Prokopidis | # Introduction |
---|---|---|---|
2 | 19 | Prokopis Prokopidis | |
3 | 19 | Prokopis Prokopidis | ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. ILSP-FC integrates modules for text normalization, language identification, document clean-up, metadata extraction, text classification, identification of bitexts (documents that are translations of each other), alignment of segments, and filtering of segment pairs. A detailed description of each module is available at http://aclweb.org/anthology/W/W13/W13-2506.pdf . |
4 | 19 | Prokopis Prokopidis | |
5 | 19 | Prokopis Prokopidis | # Workflows |
6 | 19 | Prokopis Prokopidis | |
7 | 20 | Prokopis Prokopidis | The current version of ILSP-FC offers the user the option to [[GettingStarted|run all relevant processes]] in a pipeline or to select a specific subset (e.g. alignment and merging only). |
8 | 19 | Prokopis Prokopidis | |
9 | 19 | Prokopis Prokopidis | In a configuration for acquiring monolingual data, ILSP-FC applies the following processes (one after the other): |
10 | 19 | Prokopis Prokopidis | |
11 | 19 | Prokopis Prokopidis | - crawls the web until an expiration criterion is met (i.e. harvests webpages and stores the ones that are in the targeted language, and relevant to a targeted topic if required) |
12 | 19 | Prokopis Prokopidis | - exports the stored data (i.e. stores downloaded web pages/documents and for each page generates a CesDoc files with its content and metadata). |
13 | 19 | Prokopis Prokopidis | - discards (near) duplicate documents |
14 | 19 | Prokopis Prokopidis | |
15 | 19 | Prokopis Prokopidis | |
16 | 19 | Prokopis Prokopidis | In a configuration for acquiring parallel data, it applies the following processes (one after the other): |
17 | 19 | Prokopis Prokopidis | |
18 | 19 | Prokopis Prokopidis | - [[Crawl|crawls]] a website with content in the targeted languages (i.e. harvests the website and stores pages that are in the targeted languages, and relevant to a targeted topic if required) |
19 | 19 | Prokopis Prokopidis | - [[Export|exports]] the stored data (i.e. stores downloaded web page/document and generates a CesDoc file with its content and metadata). |
20 | 19 | Prokopis Prokopidis | - discards (near) [[NearDeduplication|duplicate]] documents |
21 | 19 | Prokopis Prokopidis | - identifies [[PairDetection|pairs]] of (candidate) parallel documents and generates a cesAlign file for each detected pair. |
22 | 19 | Prokopis Prokopidis | - [[SegmentAlignment|aligns]] the segments in each detected document pair and generates a TMX for each document pair |
23 | 19 | Prokopis Prokopidis | - [[TMXmerging|merges]] TMX files corresponding to each document pair in order to create the final output, i.e. a TMX that includes all (or a selection of) segment pairs |
24 | 19 | Prokopis Prokopidis | |
25 | 19 | Prokopis Prokopidis | |
26 | 19 | Prokopis Prokopidis | # Input |
27 | 19 | Prokopis Prokopidis | |
28 | 19 | Prokopis Prokopidis | In case of general monolingual crawls, the ILSP-FC travels across the web and stores web pages/documents with content in the targeted language. The required input from the user is: |
29 | 19 | Prokopis Prokopidis | |
30 | 19 | Prokopis Prokopidis | - a list of seed URLs (i.e. a text file with one URL per text line). |
31 | 19 | Prokopis Prokopidis | |
32 | 19 | Prokopis Prokopidis | In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are in the targeted language and related to a targeted domain), the input should include: |
33 | 19 | Prokopis Prokopidis | |
34 | 19 | Prokopis Prokopidis | - a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]]. |
35 | 19 | Prokopis Prokopidis | - a list of term triplets (_\<relevance,term,subtopic\>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English. |
36 | 19 | Prokopis Prokopidis | |
37 | 19 | Prokopis Prokopidis | |
38 | 19 | Prokopis Prokopidis | In case of general bilingual crawling, the input from the user includes: |
39 | 19 | Prokopis Prokopidis | - a seed URL list which should contain URL(s) from only one web site with content in both of the targeted languages(e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. Examples of seed URLs can be found at [[seed_examples.txt]]. |
40 | 19 | Prokopis Prokopidis | |
41 | 19 | Prokopis Prokopidis | In case of focused bilingual crawls, the input should also include: |
42 | 19 | Prokopis Prokopidis | |
43 | 19 | Prokopis Prokopidis | - a list of term triplets (_\<relevance,term,subtopic\>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]]. |
44 | 19 | Prokopis Prokopidis | Note that in case a thematic website is targeted, it is very likely that examination of domainnesses could be avoid (i.e. construction and use of a list of terms that define the targeted topic might be redundant). |
45 | 19 | Prokopis Prokopidis | |
46 | 19 | Prokopis Prokopidis | # Output |
47 | 19 | Prokopis Prokopidis | |
48 | 21 | Vassilis Papavassiliou | Each module of the tool provides its own output which feeds the next module in the pipeline: |
49 | 21 | Vassilis Papavassiliou | |
50 | 23 | Vassilis Papavassiliou | [[Crawl|Crawl]]: Creates the "run" directories (i.e. directories containing all resources fetched/extracted/used/required for each cycle of this crawl). (See |
51 | 23 | Vassilis Papavassiliou | setting _dest_) |
52 | 1 | Prokopis Prokopidis | |
53 | 30 | Vassilis Papavassiliou | [[Export|Export]] : |
54 | 30 | Vassilis Papavassiliou | |
55 | 35 | Prokopis Prokopidis | - a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See this [cesDoc](http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml) file for an example in English for the _Environment_ domain. |
56 | 35 | Prokopis Prokopidis | - a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this [rendered cesDoc](http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml.html) file. |
57 | 28 | Vassilis Papavassiliou | |
58 | 30 | Vassilis Papavassiliou | [[Pairdetection|Pair detect]] : |
59 | 29 | Vassilis Papavassiliou | |
60 | 35 | Prokopis Prokopidis | - a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example [cesAlign](http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml) file serves as a link between a detected pair of cesDoc documents in [English](http://nlp.ilsp.gr/xslt/ilsp-fc/98.xml) and [Spanish](http://nlp.ilsp.gr/xslt/ilsp-fc/44.xml). |
61 | 35 | Prokopis Prokopidis | - a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this [rendered cesAlign](http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml.html) file. |
62 | 29 | Vassilis Papavassiliou | |
63 | 23 | Vassilis Papavassiliou | [[Segment Alignment|Segment Alignment]] : |
64 | 29 | Vassilis Papavassiliou | |
65 | 35 | Prokopis Prokopidis | - a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this [TMX](http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.tmx) file. |
66 | 35 | Prokopis Prokopidis | - a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this [rendered TMX](http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.html) file. |
67 | 30 | Vassilis Papavassiliou | |
68 | 30 | Vassilis Papavassiliou | [[TMX merging|TMX merging]]: |
69 | 27 | Vassilis Papavassiliou | |
70 | 36 | Prokopis Prokopidis | - a TMX file that includes filtered segment pairs of the generated TMX files. This is the final output of the process (i.e. the parallel corpus). As an example, see this [TMX](http://nlp.ilsp.gr/ilsp-fc/merged_tmx/sample_merged_eng-spa_resource.tmx) file. |
71 | 36 | Prokopis Prokopidis | - an HTML file (by XSL transformation of the TMX file). As an example, see this [rendered TMX](http://nlp.ilsp.gr/ilsp-fc/merged_tmx/sample_merged_eng-spa_resource.html) file. |
72 | 36 | Prokopis Prokopidis | - an XML file which contains metadata of the generated corpus. As an example, see this [XML](http://nlp.ilsp.gr/ilsp-fc/merged_tmx/sample_merged_eng-spa_resource.md.xml) file. |
73 | 19 | Prokopis Prokopidis | |
74 | 1 | Prokopis Prokopidis | # Documentation |
75 | 1 | Prokopis Prokopidis | |
76 | 19 | Prokopis Prokopidis | - [[HowToGet|How to get ILSP-FC]]: Learn about the different ways to get ILSP-FC |
77 | 19 | Prokopis Prokopidis | - [[DeveloperSetup|Developer Setup]]: Learn how to build ILSP-FC |
78 | 19 | Prokopis Prokopidis | - [[GettingStarted|Getting Started]]: Learn how to run ILSP-FC |
79 | 19 | Prokopis Prokopidis | - [[Languages-supported|Languages Supported]] Learn about supported languages |
80 | 19 | Prokopis Prokopidis | - [[Resources|Resources]] acquired with ILSP-FC |