Getting Started » History » Version 116
Prokopis Prokopidis, 2014-08-15 03:48 PM
1 | 1 | Prokopis Prokopidis | h1. Getting Started |
---|---|---|---|
2 | 2 | Prokopis Prokopidis | |
3 | 2 | Prokopis Prokopidis | Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this |
4 | 2 | Prokopis Prokopidis | |
5 | 70 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar</code></pre> |
6 | 2 | Prokopis Prokopidis | |
7 | 93 | Vassilis Papavassiliou | h2. Input |
8 | 93 | Vassilis Papavassiliou | |
9 | 88 | Vassilis Papavassiliou | In case of general monolingual crawls the required input from the user is: |
10 | 96 | Vassilis Papavassiliou | * a list of seed URLs (i.e. a text file with one URL per text line). |
11 | 1 | Prokopis Prokopidis | |
12 | 97 | Vassilis Papavassiliou | In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are related to a targeted domain), the input should include: |
13 | 95 | Vassilis Papavassiliou | * a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]]. |
14 | 91 | Prokopis Prokopidis | * a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English. Details on how to construct/bootstrap such lists and how they are used in text to topic classification could be found at this paper http://www.aclweb.org/anthology/W13-2506.pdf |
15 | 1 | Prokopis Prokopidis | |
16 | 88 | Vassilis Papavassiliou | In case of general bilingual crawling, the input from the user includes: |
17 | 99 | Vassilis Papavassiliou | * a seed URL list which should contain URL(s) from only one web site (e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. However, the user could use the <code> filter </code> parameter (see below) to allow visiting only links pointing to pages either inside versions of the top domain of the URL (e.g. http://www.fifa.com/, http://es.fifa.com/ , etc.) or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Examples of seed URLs can ve found at [[seed_examples.txt]]. |
18 | 1 | Prokopis Prokopidis | |
19 | 91 | Prokopis Prokopidis | In case of focused bilingual crawls, the input should also include: |
20 | 98 | Vassilis Papavassiliou | * a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]]. |
21 | 78 | Vassilis Papavassiliou | |
22 | 94 | Prokopis Prokopidis | h2. Language support |
23 | 94 | Prokopis Prokopidis | |
24 | 88 | Vassilis Papavassiliou | For both monolingual and bilingual crawling, the set of currently supported languages comprises de, el, en, es, fr, hr, it, ja, and pt. |
25 | 91 | Prokopis Prokopidis | |
26 | 91 | Prokopis Prokopidis | In order to add another language, a developer/user should: |
27 | 91 | Prokopis Prokopidis | * verify that the targeted language is supported by the default language identifier (https://code.google.com/p/language-detection/) integrated in the ILSP-FC, |
28 | 91 | Prokopis Prokopidis | * add a textline with proper content in the [[langKeys.txt]] file which is included in the ilsp-fc runnable jar, and |
29 | 91 | Prokopis Prokopidis | * add a proper analyser in the <code>gr.ilsp.fmc.utils.AnalyserFactory</code> class of the ilsp-fc source. |
30 | 94 | Prokopis Prokopidis | |
31 | 94 | Prokopis Prokopidis | h2. Other settings |
32 | 73 | Prokopis Prokopidis | |
33 | 102 | Vassilis Papavassiliou | There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration files for monolingual and bilingual crawls are [[FMC_config.xml]] and [[FBC_config.xml]] respectively. They are included in the ilsp-fc runnable jar. |
34 | 1 | Prokopis Prokopidis | |
35 | 40 | Prokopis Prokopidis | Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows: |
36 | 38 | Prokopis Prokopidis | |
37 | 84 | Vassilis Papavassiliou | <pre><code> |
38 | 84 | Vassilis Papavassiliou | crawlandexport : Forces the crawler to crawl and export the results. |
39 | 84 | Vassilis Papavassiliou | -a : user agent name (required) |
40 | 30 | Vassilis Papavassiliou | -type : the type of crawling. Crawling for monolingual (m) or parallel (p). |
41 | 38 | Prokopis Prokopidis | -cfg : the configuration file that will be used instead of the default (see crawler_config.xml above). |
42 | 34 | Vassilis Papavassiliou | -c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of |
43 | 34 | Vassilis Papavassiliou | the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time |
44 | 34 | Vassilis Papavassiliou | will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle. |
45 | 34 | Vassilis Papavassiliou | The default value is 10 minutes. |
46 | 65 | Vassilis Papavassiliou | -n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes. |
47 | 34 | Vassilis Papavassiliou | -t : the number of threads that will be used to fetch web pages in parallel. |
48 | 67 | Vassilis Papavassiliou | -f : Forces the crawler to start a new job (required). |
49 | 34 | Vassilis Papavassiliou | -lang : the targeted language in case of monolingual crawling (required). |
50 | 1 | Prokopis Prokopidis | -l1 : the first targeted language in case of bilingual crawling (required). |
51 | 34 | Vassilis Papavassiliou | -l2 : the second targeted language in case of bilingual crawling (required). |
52 | 47 | Vassilis Papavassiliou | -u : the text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling |
53 | 28 | Vassilis Papavassiliou | the list should contain only 1 or 2 URLs from the same web doamin. |
54 | 69 | Vassilis Papavassiliou | -tc : domain definition (a text file that contains a list of term triplets that describe the targeted |
55 | 69 | Vassilis Papavassiliou | domain). If omitted, the crawl will be a "general" one (i.e. module for text-to-domain |
56 | 69 | Vassilis Papavassiliou | classification will not be used). |
57 | 67 | Vassilis Papavassiliou | -k : Forces the crawler to annotate boilerplate content in parsed text. |
58 | 42 | Vassilis Papavassiliou | -filter : A regular expression to filter out URLs which do NOT match this regex. |
59 | 1 | Prokopis Prokopidis | The use of this filter forces the crawler to either focus on a specific |
60 | 1 | Prokopis Prokopidis | web domain (i.e. ".*ec.europa.eu.*"), or on a part of a web domain |
61 | 1 | Prokopis Prokopidis | (e.g.".*/legislation_summaries/environment.*"). Note that if this filter |
62 | 44 | Vassilis Papavassiliou | is used, only the seed URLs that match this regex will be fetched. |
63 | 92 | Vassilis Papavassiliou | -u_r : This parameter should be used for bilingual crawling when there is an already known pattern in URLs |
64 | 71 | Vassilis Papavassiliou | which implies that one page is the candidate translation the other. It includes the two strings |
65 | 1 | Prokopis Prokopidis | to be replaced separated by ';'. |
66 | 1 | Prokopis Prokopidis | -d : Forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages |
67 | 1 | Prokopis Prokopidis | inside the same web site). It should be used only for monolingual crawling. |
68 | 84 | Vassilis Papavassiliou | -len : Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is |
69 | 84 | Vassilis Papavassiliou | less than this value (default is 3) the paragraph will be annotated as "out of interest" and |
70 | 84 | Vassilis Papavassiliou | will not be included into the clean text of the web page. |
71 | 84 | Vassilis Papavassiliou | -mtlen : Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned |
72 | 84 | Vassilis Papavassiliou | text is less than this value (default is 200), the document will not be stored. |
73 | 106 | Vassilis Papavassiliou | -align : Extracts sentences from the detected document pairs and alignes the extracted sentences |
74 | 106 | Vassilis Papavassiliou | by using an aligner (default is hunalign). |
75 | 106 | Vassilis Papavassiliou | -dict : Uses this dictionary for the sentence alignment. If has no argument the default dictionary |
76 | 106 | Vassilis Papavassiliou | of the aligner will be used if exists. |
77 | 1 | Prokopis Prokopidis | -xslt : Insert a stylesheet for rendering xml results as html. |
78 | 85 | Vassilis Papavassiliou | -oxslt : Export crawl results with the help of an xslt file for better examination of results. |
79 | 85 | Vassilis Papavassiliou | -dom : Title of the targeted domain (required when domain definition, i.e. tc parameter, is used). |
80 | 85 | Vassilis Papavassiliou | -dest : The directory where the results (i.e. the crawled data) will be stored. |
81 | 85 | Vassilis Papavassiliou | -of : A text file containing a list with the exported XML files (see section Output below). |
82 | 85 | Vassilis Papavassiliou | -ofh : An HTML file containing a list with the generated XML files (see section Output below). |
83 | 85 | Vassilis Papavassiliou | -oft : A text file containing a list with the exported TMX files (see section Output below). |
84 | 85 | Vassilis Papavassiliou | -ofth : An HTML file containing a list with the generated TMX files (see section Output below). |
85 | 22 | Prokopis Prokopidis | </code></pre> |
86 | 1 | Prokopis Prokopidis | |
87 | 1 | Prokopis Prokopidis | h2. Run a monolingual crawl |
88 | 1 | Prokopis Prokopidis | |
89 | 22 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \ |
90 | 100 | Vassilis Papavassiliou | -cfg FMC_config.xml -type m -c 10 -lang en -of output_test1_list.txt \ |
91 | 100 | Vassilis Papavassiliou | -ofh output_test1_list.txt.html -tc ENV_EN_topic.txt \ |
92 | 100 | Vassilis Papavassiliou | -u ENV_EN_seeds.txt -f -k -dom Environment</code></pre> |
93 | 1 | Prokopis Prokopidis | |
94 | 1 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test2 \ |
95 | 85 | Vassilis Papavassiliou | -f -k -type m -c 5 -lang es -of output_test2_list.txt \ |
96 | 100 | Vassilis Papavassiliou | -ofh output_test2_list.txt.html -u seed_examples.txt \ |
97 | 71 | Vassilis Papavassiliou | </code></pre> |
98 | 71 | Vassilis Papavassiliou | |
99 | 71 | Vassilis Papavassiliou | h2. Run a bilingual crawl |
100 | 71 | Vassilis Papavassiliou | |
101 | 71 | Vassilis Papavassiliou | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test3 -c 10 -f -k -l1 de -l2 it \ |
102 | 85 | Vassilis Papavassiliou | -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt \ |
103 | 100 | Vassilis Papavassiliou | -type p -u seed_examples.txt -cfg FBC_config.xml -dom HS -len 0 -mtlen 100 -xslt -oxslt</code></pre> |
104 | 71 | Vassilis Papavassiliou | |
105 | 100 | Vassilis Papavassiliou | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test4 -c 10 -f -k -l1 es -l2 en \ |
106 | 100 | Vassilis Papavassiliou | -type p -u seed_examples.txt -filter ".*uefa.com.*" \ |
107 | 1 | Prokopis Prokopidis | -len 0 -mtlen 80 -xslt -oxslt -dest "/var/crawl_results/" \ |
108 | 101 | Vassilis Papavassiliou | -of test_U_ES-EN_output.txt -ofh test_U_ES-EN_output.txt.html \ |
109 | 101 | Vassilis Papavassiliou | -oft test_U_ES-EN_output.tmx.txt -ofth test_U_ES-EN_output.tmx.html \ |
110 | 1 | Prokopis Prokopidis | -align -dict </code></pre> |
111 | 1 | Prokopis Prokopidis | |
112 | 85 | Vassilis Papavassiliou | h2. Output |
113 | 85 | Vassilis Papavassiliou | |
114 | 100 | Vassilis Papavassiliou | The output of the ilsp-fc in the case of a monolingual crawl consists of: |
115 | 112 | Vassilis Papavassiliou | * a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). As an example, see this [[cesDOC_file]] for an example in English for the Environment domain. |
116 | 116 | Prokopis Prokopidis | * a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this As an example, see this "rendered cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1290.xml.html file. |
117 | 71 | Vassilis Papavassiliou | |
118 | 100 | Vassilis Papavassiliou | The output of the ilsp-fc in the case of a bilingual crawl consists of: |
119 | 115 | Prokopis Prokopidis | * a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example "cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/16_71_h.xml file serves as a link between a detected pair of cesDoc documents in "English":http://nlp.ilsp.gr/xslt/ilsp-fc/16.xml and "Spanish":http://nlp.ilsp.gr/xslt/ilsp-fc/71.xml. |
120 | 114 | Prokopis Prokopidis | * a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this "rendered cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/16_71_h.xml.html file. |
121 | 114 | Prokopis Prokopidis | * a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this "TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/16_71_h.tmx file. |
122 | 114 | Prokopis Prokopidis | * a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this "rendered TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/16_71_h.html file. |