Getting Started » History » Version 158
Vassilis Papavassiliou, 2016-05-31 04:46 PM
1 | 130 | Prokopis Prokopidis | # Getting Started |
---|---|---|---|
2 | 2 | Prokopis Prokopidis | |
3 | 2 | Prokopis Prokopidis | Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this |
4 | 2 | Prokopis Prokopidis | |
5 | 70 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar</code></pre> |
6 | 2 | Prokopis Prokopidis | |
7 | 134 | Vassilis Papavassiliou | ## Examples of running monolingual crawls |
8 | 134 | Vassilis Papavassiliou | |
9 | 134 | Vassilis Papavassiliou | * Given a seed URL list [[ENV_EN_seeds.txt]], the following example crawls the web for 5 minutes and constructs a collection containing English web pages. |
10 | 134 | Vassilis Papavassiliou | |
11 | 134 | Vassilis Papavassiliou | ``` |
12 | 134 | Vassilis Papavassiliou | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \ |
13 | 154 | Vassilis Papavassiliou | -crawl -export -dedup -a test -f -type m -c 5 -lang en -k -u ENV_EN_seeds.txt -oxslt\ |
14 | 154 | Vassilis Papavassiliou | -dest crawlResults -bs "output_test" |
15 | 134 | Vassilis Papavassiliou | ``` |
16 | 134 | Vassilis Papavassiliou | |
17 | 134 | Vassilis Papavassiliou | In this and other example commands in this documentation, a `log4j.xml` file is being used to set logging configuration details. An example `log4j.xml` file can be downloaded from [[Log4j_xml|here]]. |
18 | 134 | Vassilis Papavassiliou | |
19 | 134 | Vassilis Papavassiliou | * Given a seed URL list [[ENV_EN_seeds.txt]] and a topic definition for the _Environment_ domain in Engish [[ENV_EN_topictxt|ENV_EN_topic.txt]], the following example crawls the web for 10 cycles and constructs a collection containing English web pages related to this domain. |
20 | 134 | Vassilis Papavassiliou | |
21 | 134 | Vassilis Papavassiliou | ``` |
22 | 134 | Vassilis Papavassiliou | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \ |
23 | 154 | Vassilis Papavassiliou | -crawl -export -dedup -a test1 -f -type m -n 10 -lang en -k -u seed-examples.txt -oxslt \ |
24 | 154 | Vassilis Papavassiliou | -tc ENV-EN-topic.txt -dom Environment -dest crawlResults -bs "output-test" |
25 | 134 | Vassilis Papavassiliou | ``` |
26 | 134 | Vassilis Papavassiliou | |
27 | 134 | Vassilis Papavassiliou | ## Example of running bilingual crawls |
28 | 134 | Vassilis Papavassiliou | |
29 | 151 | Vassilis Papavassiliou | This is a test example to verify that the whole workflow (crawl, export, deuplication, pair detection, alingment) works successfully. |
30 | 134 | Vassilis Papavassiliou | |
31 | 1 | Prokopis Prokopidis | ``` |
32 | 151 | Vassilis Papavassiliou | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-SNAPSHOT-jar-with-dependencies.jar \ |
33 | 151 | Vassilis Papavassiliou | -crawl -export -dedup -pairdetect -align -tmxmerge -f -k -oxslt -type p -n 1 -t 20 -len 0 -mtlen 100 \ |
34 | 151 | Vassilis Papavassiliou | -pdm "aupdih" -segtypes "1:1" -lang "end;lt;deu;lv" -a test -filter ".*www\.airbaltic\.com.*" \ |
35 | 151 | Vassilis Papavassiliou | -u "/var/www/html/elrc/test-seeds" -dest "/var/www/html/elrc/test" \ |
36 | 151 | Vassilis Papavassiliou | -bs "/var/www/html/elrc/test/output_test" &> "/var/www/html/elrc/test/log_test" |
37 | 1 | Prokopis Prokopidis | ``` |
38 | 151 | Vassilis Papavassiliou | |
39 | 151 | Vassilis Papavassiliou | Seed URLs : |
40 | 151 | Vassilis Papavassiliou | |
41 | 151 | Vassilis Papavassiliou | ``` |
42 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/lv/bernu-atlaide |
43 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/lv/profila-registracija |
44 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/de/ermaessigung-kinder |
45 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/de/profil-erstellen |
46 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/en/child-discount |
47 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/en/create-account |
48 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/lt/child-discount |
49 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/lt/sukurti-paskira |
50 | 151 | Vassilis Papavassiliou | ``` |
51 | 151 | Vassilis Papavassiliou | |
52 | 151 | Vassilis Papavassiliou | |
53 | 152 | Vassilis Papavassiliou | ## Options |
54 | 152 | Vassilis Papavassiliou | There are several options concerning the applied processes. Besides the following comprehensive list, you could see the options that are supported for each module. |
55 | 152 | Vassilis Papavassiliou | |
56 | 152 | Vassilis Papavassiliou | ``` |
57 | 152 | Vassilis Papavassiliou | -a,--agentname <arg> Agent name to identify the person or the organization |
58 | 152 | Vassilis Papavassiliou | responsible for the crawl |
59 | 152 | Vassilis Papavassiliou | -align,--align_sentences <arg> Sentence align document pairs using this aligner (default is |
60 | 152 | Vassilis Papavassiliou | maligna) |
61 | 152 | Vassilis Papavassiliou | -bs,--basename <arg> Basename to be used in generating all files for easier |
62 | 152 | Vassilis Papavassiliou | content navigation |
63 | 152 | Vassilis Papavassiliou | -c,--crawlduration <arg> Maximum crawl duration in minutes |
64 | 152 | Vassilis Papavassiliou | -cc,--creative_commons Force the alignment process to generate a merged TMX with |
65 | 152 | Vassilis Papavassiliou | sentence alignments only from document pairs for which an |
66 | 152 | Vassilis Papavassiliou | open content license has been detected. |
67 | 152 | Vassilis Papavassiliou | -cfg,--config <arg> Path to the XML configuration file |
68 | 152 | Vassilis Papavassiliou | -crawl,--crawl Start a crawl |
69 | 152 | Vassilis Papavassiliou | -d,--stay_in_webdomain Force the monolingual crawler to stay in a specific web |
70 | 152 | Vassilis Papavassiliou | domain |
71 | 152 | Vassilis Papavassiliou | -dbg,--debug Use debug level for logging |
72 | 152 | Vassilis Papavassiliou | -dedup,--deduplicate Deduplicate and discard (near) duplicate documents |
73 | 152 | Vassilis Papavassiliou | -del,--delete_redundant_files Delete redundant crawled documents that have not been |
74 | 152 | Vassilis Papavassiliou | detected as members of a document pair |
75 | 152 | Vassilis Papavassiliou | -dest,--destination <arg> Path to a directory where the acquired/generated resources |
76 | 152 | Vassilis Papavassiliou | will be stored |
77 | 152 | Vassilis Papavassiliou | -pdm,--pairDetectMethods <arg> When creating a merged TMX file, only use sentence alignments |
78 | 152 | Vassilis Papavassiliou | from document pairs that have been identified by specific |
79 | 152 | Vassilis Papavassiliou | methods, e.g. auidh. See the pdm option. |
80 | 152 | Vassilis Papavassiliou | -dom,--domain <arg> A descriptive title for the targeted domain |
81 | 152 | Vassilis Papavassiliou | -export,--export Export crawled documents to cesDoc XML files |
82 | 152 | Vassilis Papavassiliou | -f,--force Force a new crawl. Caution: This will remove any previously |
83 | 152 | Vassilis Papavassiliou | crawled data |
84 | 152 | Vassilis Papavassiliou | -filter,--fetchfilter <arg> Use this regex to force the crawler to crawl only in specific |
85 | 152 | Vassilis Papavassiliou | sub webdomains. Webpages with urls that do not match this |
86 | 152 | Vassilis Papavassiliou | regex will not be fetched. |
87 | 152 | Vassilis Papavassiliou | -h,--help This message |
88 | 152 | Vassilis Papavassiliou | -i,--inputdir <arg> Input directory for deduplication, pairdetection, or |
89 | 152 | Vassilis Papavassiliou | alignment |
90 | 152 | Vassilis Papavassiliou | -ifp,--image_urls Full image URLs (and not only their basenames) will be used |
91 | 152 | Vassilis Papavassiliou | in pair detection with common images |
92 | 152 | Vassilis Papavassiliou | -k,--keepboiler Keep and annotate boilerplate content in parsed text |
93 | 152 | Vassilis Papavassiliou | -l,--loggingAppender <arg> Logging appender (console, DRFA) to use |
94 | 152 | Vassilis Papavassiliou | -lang,--languages <arg> Two or three letter ISO code(s) of target language(s), e.g. |
95 | 152 | Vassilis Papavassiliou | el (for a monolingual crawl for Greek content) or en;el (for |
96 | 152 | Vassilis Papavassiliou | a bilingual crawl) |
97 | 152 | Vassilis Papavassiliou | -len,--length <arg> Μinimum number of tokens per text block. Shorter text blocks |
98 | 152 | Vassilis Papavassiliou | will be annoteted as "ooi-length" |
99 | 152 | Vassilis Papavassiliou | -mtlen,--minlength <arg> Minimum number of tokens in crawled documents (after |
100 | 152 | Vassilis Papavassiliou | boilerplate detection). Shorter documents will be discarded. |
101 | 152 | Vassilis Papavassiliou | -n,--numloops <arg> Maximum number of fetch/update loops |
102 | 152 | Vassilis Papavassiliou | -oxslt,--offline_xslt Apply an xsl transformation to generate html files during |
103 | 152 | Vassilis Papavassiliou | exporting. |
104 | 152 | Vassilis Papavassiliou | -p_r,--path_replacements <arg> Put the strings to be replaced, separated by ';'. This might |
105 | 152 | Vassilis Papavassiliou | be useful for crawling via the web service |
106 | 152 | Vassilis Papavassiliou | -pairdetect,--pair_detection Detect document pairs in crawled documents |
107 | 152 | Vassilis Papavassiliou | -pdm,--pair_detection_methods <arg> Α string forcing the crawler to detect pairs using one or |
108 | 152 | Vassilis Papavassiliou | more specific methods: a (links between documents), u |
109 | 152 | Vassilis Papavassiliou | (patterns in urls), p (common images and similar digit |
110 | 152 | Vassilis Papavassiliou | sequences),i (common images), d (similar digit sequences), h, or m, or l |
111 | 152 | Vassilis Papavassiliou | (high/medium/low similarity of html structure) |
112 | 152 | Vassilis Papavassiliou | -segtypes,--segtypes <arg> When creating a merged TMX file, only use sentence alignments |
113 | 152 | Vassilis Papavassiliou | of specific types, ie. 1:1 |
114 | 152 | Vassilis Papavassiliou | -storefilter,--storefilter <arg> Use this regex to force the crawler to store only webpages |
115 | 152 | Vassilis Papavassiliou | with urls that match this regex. |
116 | 152 | Vassilis Papavassiliou | -t,--threads <arg> Maximum number of fetcher threads to use |
117 | 152 | Vassilis Papavassiliou | -tc,--topic <arg> Path to a file with the topic definition |
118 | 152 | Vassilis Papavassiliou | -tmxmerge,--tmxmerge Merge aligned segments from each document pair into one tmx |
119 | 152 | Vassilis Papavassiliou | file |
120 | 155 | Vassilis Papavassiliou | -type,--type <arg> Crawl type: m (monolingual) or p (parallel) |
121 | 152 | Vassilis Papavassiliou | -u,--urls <arg> File with seed urls used to initialize the crawl |
122 | 152 | Vassilis Papavassiliou | -u_r,--url_replacements <arg> A string to be replaced, separated by ';'. |
123 | 152 | Vassilis Papavassiliou | ``` |
124 | 151 | Vassilis Papavassiliou | |
125 | 134 | Vassilis Papavassiliou | |
126 | 130 | Prokopis Prokopidis | ## Other settings |
127 | 73 | Prokopis Prokopidis | |
128 | 102 | Vassilis Papavassiliou | There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration files for monolingual and bilingual crawls are [[FMC_config.xml]] and [[FBC_config.xml]] respectively. They are included in the ilsp-fc runnable jar. |
129 | 1 | Prokopis Prokopidis | |
130 | 40 | Prokopis Prokopidis | Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows: |
131 | 139 | Vassilis Papavassiliou | |
132 | 146 | Vassilis Papavassiliou | -u_r : url_replacements. Besides the default patterns , the user could add more patterns separated by ; |
133 | 146 | Vassilis Papavassiliou | |
134 | 149 | Vassilis Papavassiliou | -ifp : image_fullpath. Keep image fullpath for pair detection for representing an image instead of its name only. |
135 | 149 | Vassilis Papavassiliou | |
136 | 146 | Vassilis Papavassiliou | |
137 | 150 | Vassilis Papavassiliou | -doctypes : Defines the types of the document pairs from which the segment pairs will be selected. The proposed value is "aupidh" since pairs of type "m" and "l" (e.g. eng-1_lav-3_m.xml or eng-2_lav-8_l.xml) are only used for testing or examining the tool. |
138 | 1 | Prokopis Prokopidis | |
139 | 1 | Prokopis Prokopidis | |
140 | 135 | Vassilis Papavassiliou | [//]: # ( ## Input ) |
141 | 135 | Vassilis Papavassiliou | |
142 | 135 | Vassilis Papavassiliou | [//]: # (In case of general monolingual crawls the required input from the user is: ) |
143 | 135 | Vassilis Papavassiliou | [//]: # (* a list of seed URLs (i.e. a text file with one URL per text line). ) |
144 | 135 | Vassilis Papavassiliou | |
145 | 135 | Vassilis Papavassiliou | [//]: # (In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are related to a targeted domain), the input should include: ) |
146 | 135 | Vassilis Papavassiliou | [//]: # (* a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]]. ) |
147 | 135 | Vassilis Papavassiliou | [//]: # (* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English. Details on how to construct/bootstrap such lists and how they are used in text to topic classification could be found at this paper http://www.aclweb.org/anthology/W13-2506.pdf ) |
148 | 135 | Vassilis Papavassiliou | |
149 | 135 | Vassilis Papavassiliou | [//]: # (In case of general bilingual crawling, the input from the user includes:) |
150 | 135 | Vassilis Papavassiliou | [//]: # (* a seed URL list which should contain URL(s) from only one web site (e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. However, the user could use the <code> filter </code> parameter (see below) to allow visiting only links pointing to pages either inside versions of the top domain of the URL (e.g. http://www.fifa.com/, http://es.fifa.com/ , etc.) or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Examples of seed URLs can be found at [[seed_examples.txt]]. ) |
151 | 135 | Vassilis Papavassiliou | |
152 | 135 | Vassilis Papavassiliou | [//]: # (In case of focused bilingual crawls, the input should also include: ) |
153 | 135 | Vassilis Papavassiliou | [//]: # (* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]].) |
154 | 135 | Vassilis Papavassiliou | |
155 | 135 | Vassilis Papavassiliou | [//]: # (## Language support ) |
156 | 135 | Vassilis Papavassiliou | |
157 | 135 | Vassilis Papavassiliou | [//]: # (For both monolingual and bilingual crawling, the set of currently supported languages comprises de, el, en, es, fr, ga, hr, it, ja, and pt. ) |
158 | 135 | Vassilis Papavassiliou | |
159 | 135 | Vassilis Papavassiliou | [//]: # (In order to add another language, a developer/user should: ) |
160 | 135 | Vassilis Papavassiliou | [//]: # (* verify that the targeted language is supported by the default language identifier (https://code.google.com/p/language-detection/) integrated in the ILSP-FC, ) |
161 | 135 | Vassilis Papavassiliou | [//]: # (* add a textline with proper content in the [[langKeys.txt]] file which is included in the ilsp-fc runnable jar, and) |
162 | 135 | Vassilis Papavassiliou | [//]: # (* add a proper analyser in the <code>gr.ilsp.fmc.utils.AnalyserFactory</code> class of the ilsp-fc source. ) |
163 | 135 | Vassilis Papavassiliou | |
164 | 135 | Vassilis Papavassiliou | |
165 | 135 | Vassilis Papavassiliou | |
166 | 135 | Vassilis Papavassiliou | [//]: # (## Run a monolingual crawl ) |
167 | 135 | Vassilis Papavassiliou | |
168 | 136 | Vassilis Papavassiliou | [//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr -cfg FMC_config.xml -type m -c 10 -lang en -of output_test1_list.txt -ofh output_test1_list.txt.html -tc ENV_EN_topic.txt -u ENV_EN_seeds.txt -f -k -dom Environment</code></pre> ) |
169 | 1 | Prokopis Prokopidis | |
170 | 136 | Vassilis Papavassiliou | [//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test2 -f -k -type m -c 5 -lang es -of output_test2_list.txt -ofh output_test2_list.txt.html -u seed_examples.txt </code></pre> ) |
171 | 1 | Prokopis Prokopidis | |
172 | 136 | Vassilis Papavassiliou | [//]: # (## Run a bilingual crawl ) |
173 | 71 | Vassilis Papavassiliou | |
174 | 136 | Vassilis Papavassiliou | [//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test3 -c 10 -f -k -l1 de -l2 it -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt -type p -u seed_examples.txt -cfg FBC_config.xml -dom HS -len 0 -mtlen 100 -xslt -oxslt</code></pre> ) |
175 | 1 | Prokopis Prokopidis | |
176 | 136 | Vassilis Papavassiliou | [//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test4 -c 10 -f -k -l1 es -l2 en -type p -u seed_examples.txt -filter ".*uefa.com.*" -len 0 -mtlen 80 -xslt -oxslt -dest "/var/crawl_results/" -of test_U_ES-EN_output.txt -ofh test_U_ES-EN_output.txt.html -oft test_U_ES-EN_output.tmx.txt -ofth test_U_ES-EN_output.tmx.html -align hunalign -dict </code></pre> ) |
177 | 123 | Prokopis Prokopidis | |
178 | 136 | Vassilis Papavassiliou | [//]: # ( <pre><code>java -jar ilsp-fc-2.2-jar-with-dependencies.jar crawlandexport -f -a abumatran -type p -align maligna -l1 en -l2 fr -u seed_examples.txt -filter ".*(nrcan|rncan).*" -n 2 -xslt -oxslt -of output_demo_EN-FR.txt -ofh output_demo_EN-FR.txt.html -oft output_demo_EN-FR.tmx.txt -ofth output_demo_EN-FR.tmx.html </code></pre>) |
179 | 122 | Prokopis Prokopidis | |
180 | 136 | Vassilis Papavassiliou | [//]: # ( ## Output ) |
181 | 85 | Vassilis Papavassiliou | |
182 | 136 | Vassilis Papavassiliou | [//]: # (The output of the ilsp-fc in the case of a monolingual crawl consists of: ) |
183 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See this "cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml file for an example in English for the _Environment_ domain. ) |
184 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this "rendered cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml.html file. ) |
185 | 71 | Vassilis Papavassiliou | |
186 | 136 | Vassilis Papavassiliou | [//]: # (The output of the ilsp-fc in the case of a bilingual crawl consists of: ) |
187 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example "cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml file serves as a link between a detected pair of cesDoc documents in "English":http://nlp.ilsp.gr/xslt/ilsp-fc/98.xml and "Spanish":http://nlp.ilsp.gr/xslt/ilsp-fc/44.xml.) |
188 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this "rendered cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml.html file.) |
189 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this "TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.tmx file.) |
190 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this "rendered TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.html file.) |
191 | 137 | Vassilis Papavassiliou | |
192 | 137 | Vassilis Papavassiliou | |
193 | 137 | Vassilis Papavassiliou | ## Running modules of the ILSP-FC |
194 | 137 | Vassilis Papavassiliou | |
195 | 137 | Vassilis Papavassiliou | The ILSP-FC, in a configuration for acquiring parallel data, applies the following processes (one after the other): |
196 | 137 | Vassilis Papavassiliou | * [[Crawl|Crawl]] |
197 | 137 | Vassilis Papavassiliou | * [[Export|Export]] |
198 | 137 | Vassilis Papavassiliou | * [[NearDeduplication|Near Deduplication]] |
199 | 137 | Vassilis Papavassiliou | * [[PairDetection|Pair Detection]] |
200 | 137 | Vassilis Papavassiliou | * [[SegmentAlignment|Segment Alignment]] |
201 | 137 | Vassilis Papavassiliou | * [[TMXmerging|TMX Merging]] |