Getting Started » History » Version 106
Vassilis Papavassiliou, 2014-08-15 02:51 PM
1 | 1 | Prokopis Prokopidis | h1. Getting Started |
---|---|---|---|
2 | 2 | Prokopis Prokopidis | |
3 | 2 | Prokopis Prokopidis | Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this |
4 | 2 | Prokopis Prokopidis | |
5 | 70 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar</code></pre> |
6 | 2 | Prokopis Prokopidis | |
7 | 93 | Vassilis Papavassiliou | h2. Input |
8 | 93 | Vassilis Papavassiliou | |
9 | 88 | Vassilis Papavassiliou | In case of general monolingual crawls the required input from the user is: |
10 | 96 | Vassilis Papavassiliou | * a list of seed URLs (i.e. a text file with one URL per text line). |
11 | 1 | Prokopis Prokopidis | |
12 | 97 | Vassilis Papavassiliou | In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are related to a targeted domain), the input should include: |
13 | 95 | Vassilis Papavassiliou | * a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]]. |
14 | 91 | Prokopis Prokopidis | * a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English. Details on how to construct/bootstrap such lists and how they are used in text to topic classification could be found at this paper http://www.aclweb.org/anthology/W13-2506.pdf |
15 | 1 | Prokopis Prokopidis | |
16 | 88 | Vassilis Papavassiliou | In case of general bilingual crawling, the input from the user includes: |
17 | 99 | Vassilis Papavassiliou | * a seed URL list which should contain URL(s) from only one web site (e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. However, the user could use the <code> filter </code> parameter (see below) to allow visiting only links pointing to pages either inside versions of the top domain of the URL (e.g. http://www.fifa.com/, http://es.fifa.com/ , etc.) or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Examples of seed URLs can ve found at [[seed_examples.txt]]. |
18 | 1 | Prokopis Prokopidis | |
19 | 91 | Prokopis Prokopidis | In case of focused bilingual crawls, the input should also include: |
20 | 98 | Vassilis Papavassiliou | * a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]]. |
21 | 78 | Vassilis Papavassiliou | |
22 | 94 | Prokopis Prokopidis | h2. Language support |
23 | 94 | Prokopis Prokopidis | |
24 | 88 | Vassilis Papavassiliou | For both monolingual and bilingual crawling, the set of currently supported languages comprises de, el, en, es, fr, hr, it, ja, and pt. |
25 | 91 | Prokopis Prokopidis | |
26 | 91 | Prokopis Prokopidis | In order to add another language, a developer/user should: |
27 | 91 | Prokopis Prokopidis | * verify that the targeted language is supported by the default language identifier (https://code.google.com/p/language-detection/) integrated in the ILSP-FC, |
28 | 91 | Prokopis Prokopidis | * add a textline with proper content in the [[langKeys.txt]] file which is included in the ilsp-fc runnable jar, and |
29 | 91 | Prokopis Prokopidis | * add a proper analyser in the <code>gr.ilsp.fmc.utils.AnalyserFactory</code> class of the ilsp-fc source. |
30 | 94 | Prokopis Prokopidis | |
31 | 94 | Prokopis Prokopidis | h2. Other settings |
32 | 73 | Prokopis Prokopidis | |
33 | 102 | Vassilis Papavassiliou | There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration files for monolingual and bilingual crawls are [[FMC_config.xml]] and [[FBC_config.xml]] respectively. They are included in the ilsp-fc runnable jar. |
34 | 1 | Prokopis Prokopidis | |
35 | 40 | Prokopis Prokopidis | Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows: |
36 | 38 | Prokopis Prokopidis | |
37 | 84 | Vassilis Papavassiliou | <pre><code> |
38 | 84 | Vassilis Papavassiliou | crawlandexport : Forces the crawler to crawl and export the results. |
39 | 84 | Vassilis Papavassiliou | -a : user agent name (required) |
40 | 30 | Vassilis Papavassiliou | -type : the type of crawling. Crawling for monolingual (m) or parallel (p). |
41 | 38 | Prokopis Prokopidis | -cfg : the configuration file that will be used instead of the default (see crawler_config.xml above). |
42 | 34 | Vassilis Papavassiliou | -c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of |
43 | 34 | Vassilis Papavassiliou | the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time |
44 | 34 | Vassilis Papavassiliou | will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle. |
45 | 34 | Vassilis Papavassiliou | The default value is 10 minutes. |
46 | 65 | Vassilis Papavassiliou | -n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes. |
47 | 34 | Vassilis Papavassiliou | -t : the number of threads that will be used to fetch web pages in parallel. |
48 | 67 | Vassilis Papavassiliou | -f : Forces the crawler to start a new job (required). |
49 | 34 | Vassilis Papavassiliou | -lang : the targeted language in case of monolingual crawling (required). |
50 | 1 | Prokopis Prokopidis | -l1 : the first targeted language in case of bilingual crawling (required). |
51 | 34 | Vassilis Papavassiliou | -l2 : the second targeted language in case of bilingual crawling (required). |
52 | 47 | Vassilis Papavassiliou | -u : the text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling |
53 | 28 | Vassilis Papavassiliou | the list should contain only 1 or 2 URLs from the same web doamin. |
54 | 69 | Vassilis Papavassiliou | -tc : domain definition (a text file that contains a list of term triplets that describe the targeted |
55 | 69 | Vassilis Papavassiliou | domain). If omitted, the crawl will be a "general" one (i.e. module for text-to-domain |
56 | 69 | Vassilis Papavassiliou | classification will not be used). |
57 | 67 | Vassilis Papavassiliou | -k : Forces the crawler to annotate boilerplate content in parsed text. |
58 | 42 | Vassilis Papavassiliou | -filter : A regular expression to filter out URLs which do NOT match this regex. |
59 | 1 | Prokopis Prokopidis | The use of this filter forces the crawler to either focus on a specific |
60 | 1 | Prokopis Prokopidis | web domain (i.e. ".*ec.europa.eu.*"), or on a part of a web domain |
61 | 1 | Prokopis Prokopidis | (e.g.".*/legislation_summaries/environment.*"). Note that if this filter |
62 | 44 | Vassilis Papavassiliou | is used, only the seed URLs that match this regex will be fetched. |
63 | 92 | Vassilis Papavassiliou | -u_r : This parameter should be used for bilingual crawling when there is an already known pattern in URLs |
64 | 71 | Vassilis Papavassiliou | which implies that one page is the candidate translation the other. It includes the two strings |
65 | 1 | Prokopis Prokopidis | to be replaced separated by ';'. |
66 | 1 | Prokopis Prokopidis | -d : Forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages |
67 | 1 | Prokopis Prokopidis | inside the same web site). It should be used only for monolingual crawling. |
68 | 84 | Vassilis Papavassiliou | -len : Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is |
69 | 84 | Vassilis Papavassiliou | less than this value (default is 3) the paragraph will be annotated as "out of interest" and |
70 | 84 | Vassilis Papavassiliou | will not be included into the clean text of the web page. |
71 | 84 | Vassilis Papavassiliou | -mtlen : Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned |
72 | 84 | Vassilis Papavassiliou | text is less than this value (default is 200), the document will not be stored. |
73 | 106 | Vassilis Papavassiliou | -align : Extracts sentences from the detected document pairs and alignes the extracted sentences |
74 | 106 | Vassilis Papavassiliou | by using an aligner (default is hunalign). |
75 | 106 | Vassilis Papavassiliou | -dict : Uses this dictionary for the sentence alignment. If has no argument the default dictionary |
76 | 106 | Vassilis Papavassiliou | of the aligner will be used if exists. |
77 | 1 | Prokopis Prokopidis | -xslt : Insert a stylesheet for rendering xml results as html. |
78 | 85 | Vassilis Papavassiliou | -oxslt : Export crawl results with the help of an xslt file for better examination of results. |
79 | 85 | Vassilis Papavassiliou | -dom : Title of the targeted domain (required when domain definition, i.e. tc parameter, is used). |
80 | 85 | Vassilis Papavassiliou | -dest : The directory where the results (i.e. the crawled data) will be stored. |
81 | 85 | Vassilis Papavassiliou | -of : A text file containing a list with the exported XML files (see section Output below). |
82 | 85 | Vassilis Papavassiliou | -ofh : An HTML file containing a list with the generated XML files (see section Output below). |
83 | 85 | Vassilis Papavassiliou | -oft : A text file containing a list with the exported TMX files (see section Output below). |
84 | 85 | Vassilis Papavassiliou | -ofth : An HTML file containing a list with the generated TMX files (see section Output below). |
85 | 22 | Prokopis Prokopidis | </code></pre> |
86 | 1 | Prokopis Prokopidis | |
87 | 1 | Prokopis Prokopidis | h2. Run a monolingual crawl |
88 | 1 | Prokopis Prokopidis | |
89 | 22 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \ |
90 | 100 | Vassilis Papavassiliou | -cfg FMC_config.xml -type m -c 10 -lang en -of output_test1_list.txt \ |
91 | 100 | Vassilis Papavassiliou | -ofh output_test1_list.txt.html -tc ENV_EN_topic.txt \ |
92 | 100 | Vassilis Papavassiliou | -u ENV_EN_seeds.txt -f -k -dom Environment</code></pre> |
93 | 1 | Prokopis Prokopidis | |
94 | 1 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test2 \ |
95 | 85 | Vassilis Papavassiliou | -f -k -type m -c 5 -lang es -of output_test2_list.txt \ |
96 | 100 | Vassilis Papavassiliou | -ofh output_test2_list.txt.html -u seed_examples.txt \ |
97 | 71 | Vassilis Papavassiliou | </code></pre> |
98 | 71 | Vassilis Papavassiliou | |
99 | 71 | Vassilis Papavassiliou | h2. Run a bilingual crawl |
100 | 71 | Vassilis Papavassiliou | |
101 | 71 | Vassilis Papavassiliou | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test3 -c 10 -f -k -l1 de -l2 it \ |
102 | 85 | Vassilis Papavassiliou | -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt \ |
103 | 100 | Vassilis Papavassiliou | -type p -u seed_examples.txt -cfg FBC_config.xml -dom HS -len 0 -mtlen 100 -xslt -oxslt</code></pre> |
104 | 71 | Vassilis Papavassiliou | |
105 | 100 | Vassilis Papavassiliou | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test4 -c 10 -f -k -l1 es -l2 en \ |
106 | 100 | Vassilis Papavassiliou | -type p -u seed_examples.txt -filter ".*uefa.com.*" \ |
107 | 1 | Prokopis Prokopidis | -len 0 -mtlen 80 -xslt -oxslt -dest "/var/crawl_results/" \ |
108 | 101 | Vassilis Papavassiliou | -of test_U_ES-EN_output.txt -ofh test_U_ES-EN_output.txt.html \ |
109 | 101 | Vassilis Papavassiliou | -oft test_U_ES-EN_output.tmx.txt -ofth test_U_ES-EN_output.tmx.html \ |
110 | 1 | Prokopis Prokopidis | -align -dict </code></pre> |
111 | 1 | Prokopis Prokopidis | |
112 | 85 | Vassilis Papavassiliou | h2. Output |
113 | 85 | Vassilis Papavassiliou | |
114 | 100 | Vassilis Papavassiliou | The output of the ilsp-fc in the case of a monolingual crawl consists of: |
115 | 100 | Vassilis Papavassiliou | * a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See [[cesDOC_file]] for an example in French for the Environment domain. |
116 | 100 | Vassilis Papavassiliou | * a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. See [[rendered_cesDOC_file]]. |
117 | 71 | Vassilis Papavassiliou | |
118 | 100 | Vassilis Papavassiliou | The output of the ilsp-fc in the case of a bilingual crawl consists of: |
119 | 105 | Vassilis Papavassiliou | * a list of links to XML files following the cesAling Corpus Encoding Standard for linking cesDoc documents. This example [[cesAlign_file]] serves as a link between a detected pair of cesDOC documents in English ([[EN_doc]]) and Spanish ([[ESdoc]]). |
120 | 100 | Vassilis Papavassiliou | * a list of links pointing to HTML files (by XSL transformation of each cesAling XML) for easier browsing of the collection. See [[rendered_cesAling_file]]. |
121 | 103 | Vassilis Papavassiliou | * a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. See [[TMX_file]]. |
122 | 100 | Vassilis Papavassiliou | * a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. See [[rendered_TMX_file]]. |