Project

General

Profile

Getting Started » History » Version 157

Vassilis Papavassiliou, 2016-05-31 03:59 PM

1 130 Prokopis Prokopidis
# Getting Started
2 2 Prokopis Prokopidis
3 2 Prokopis Prokopidis
Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this
4 2 Prokopis Prokopidis
5 70 Prokopis Prokopidis
<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar</code></pre>
6 2 Prokopis Prokopidis
7 134 Vassilis Papavassiliou
## Examples of running monolingual crawls
8 134 Vassilis Papavassiliou
9 134 Vassilis Papavassiliou
* Given a seed URL list [[ENV_EN_seeds.txt]], the following example crawls the web for 5 minutes and constructs a collection containing English web pages.
10 134 Vassilis Papavassiliou
11 134 Vassilis Papavassiliou
```
12 134 Vassilis Papavassiliou
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
13 154 Vassilis Papavassiliou
-crawl -export -dedup -a test -f -type m -c 5 -lang en -k -u ENV_EN_seeds.txt -oxslt\
14 154 Vassilis Papavassiliou
-dest crawlResults -bs "output_test" 
15 134 Vassilis Papavassiliou
```
16 134 Vassilis Papavassiliou
17 134 Vassilis Papavassiliou
In this and other example commands in this documentation, a `log4j.xml` file is being used to set logging configuration details. An example `log4j.xml` file can be downloaded from [[Log4j_xml|here]]. 
18 134 Vassilis Papavassiliou
19 134 Vassilis Papavassiliou
*  Given a seed URL list [[ENV_EN_seeds.txt]] and a topic definition for the _Environment_ domain in Engish [[ENV_EN_topictxt|ENV_EN_topic.txt]], the following example crawls the web for 10 cycles and constructs a collection containing English web pages related to this domain.
20 134 Vassilis Papavassiliou
21 134 Vassilis Papavassiliou
```
22 134 Vassilis Papavassiliou
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
23 154 Vassilis Papavassiliou
-crawl -export -dedup -a test1 -f -type m -n 10 -lang en -k -u seed-examples.txt -oxslt \
24 154 Vassilis Papavassiliou
-tc ENV-EN-topic.txt -dom Environment -dest crawlResults -bs "output-test"
25 134 Vassilis Papavassiliou
```
26 134 Vassilis Papavassiliou
27 134 Vassilis Papavassiliou
## Example of running bilingual crawls
28 134 Vassilis Papavassiliou
29 151 Vassilis Papavassiliou
This is a test example to verify that the whole workflow (crawl, export, deuplication, pair detection, alingment) works successfully.
30 134 Vassilis Papavassiliou
31 1 Prokopis Prokopidis
```
32 151 Vassilis Papavassiliou
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-SNAPSHOT-jar-with-dependencies.jar \
33 151 Vassilis Papavassiliou
-crawl -export -dedup -pairdetect -align -tmxmerge -f -k -oxslt -type p -n 1 -t 20 -len 0 -mtlen 100  \
34 151 Vassilis Papavassiliou
-pdm "aupdih" -segtypes "1:1" -lang "end;lt;deu;lv" -a test -filter ".*www\.airbaltic\.com.*" \
35 151 Vassilis Papavassiliou
-u "/var/www/html/elrc/test-seeds" -dest "/var/www/html/elrc/test" \
36 151 Vassilis Papavassiliou
-bs "/var/www/html/elrc/test/output_test" &> "/var/www/html/elrc/test/log_test"
37 1 Prokopis Prokopidis
```
38 151 Vassilis Papavassiliou
39 151 Vassilis Papavassiliou
Seed URLs :
40 151 Vassilis Papavassiliou
41 151 Vassilis Papavassiliou
```
42 151 Vassilis Papavassiliou
https://www.airbaltic.com/lv/bernu-atlaide
43 151 Vassilis Papavassiliou
https://www.airbaltic.com/lv/profila-registracija
44 151 Vassilis Papavassiliou
https://www.airbaltic.com/de/ermaessigung-kinder
45 151 Vassilis Papavassiliou
https://www.airbaltic.com/de/profil-erstellen
46 151 Vassilis Papavassiliou
https://www.airbaltic.com/en/child-discount
47 151 Vassilis Papavassiliou
https://www.airbaltic.com/en/create-account
48 151 Vassilis Papavassiliou
https://www.airbaltic.com/lt/child-discount
49 151 Vassilis Papavassiliou
https://www.airbaltic.com/lt/sukurti-paskira
50 151 Vassilis Papavassiliou
```
51 151 Vassilis Papavassiliou
52 151 Vassilis Papavassiliou
53 152 Vassilis Papavassiliou
## Options
54 152 Vassilis Papavassiliou
There are several options concerning the applied processes. Besides the following comprehensive list, you could see the options that are supported for each module.  
55 152 Vassilis Papavassiliou
56 152 Vassilis Papavassiliou
```
57 152 Vassilis Papavassiliou
 -a,--agentname <arg>                  Agent name to identify the person or the organization
58 152 Vassilis Papavassiliou
                                       responsible for the crawl
59 152 Vassilis Papavassiliou
 -align,--align_sentences <arg>        Sentence align document pairs using this aligner (default is
60 152 Vassilis Papavassiliou
                                       maligna)
61 152 Vassilis Papavassiliou
 -bs,--basename <arg>                  Basename to be used in generating all files for easier
62 152 Vassilis Papavassiliou
                                       content navigation
63 152 Vassilis Papavassiliou
 -c,--crawlduration <arg>              Maximum crawl duration in minutes
64 152 Vassilis Papavassiliou
 -cc,--creative_commons                Force the alignment process to generate a merged TMX with
65 152 Vassilis Papavassiliou
                                       sentence alignments only from document pairs for which an
66 152 Vassilis Papavassiliou
                                       open content license has been detected.
67 152 Vassilis Papavassiliou
 -cfg,--config <arg>                   Path to the XML configuration file
68 152 Vassilis Papavassiliou
 -crawl,--crawl                        Start a crawl
69 152 Vassilis Papavassiliou
 -d,--stay_in_webdomain                Force the monolingual crawler to stay in a specific web
70 152 Vassilis Papavassiliou
                                       domain
71 152 Vassilis Papavassiliou
 -dbg,--debug                          Use debug level for logging
72 152 Vassilis Papavassiliou
 -dedup,--deduplicate                  Deduplicate and discard (near) duplicate documents
73 152 Vassilis Papavassiliou
 -del,--delete_redundant_files         Delete redundant crawled documents that have not been
74 152 Vassilis Papavassiliou
                                       detected as members of a document pair
75 152 Vassilis Papavassiliou
 -dest,--destination <arg>             Path to a directory where the acquired/generated resources
76 152 Vassilis Papavassiliou
                                       will be stored
77 152 Vassilis Papavassiliou
 -pdm,--pairDetectMethods <arg>        When creating a merged TMX file, only use sentence alignments
78 152 Vassilis Papavassiliou
                                       from document pairs that have been identified by specific
79 152 Vassilis Papavassiliou
                                       methods, e.g. auidh. See the pdm option.
80 152 Vassilis Papavassiliou
 -dom,--domain <arg>                   A descriptive title for the targeted domain
81 152 Vassilis Papavassiliou
 -export,--export                      Export crawled documents to cesDoc XML files
82 152 Vassilis Papavassiliou
 -f,--force                            Force a new crawl. Caution: This will remove any previously
83 152 Vassilis Papavassiliou
                                       crawled data
84 152 Vassilis Papavassiliou
 -filter,--fetchfilter <arg>           Use this regex to force the crawler to crawl only in specific
85 152 Vassilis Papavassiliou
                                       sub webdomains. Webpages with urls that do not match this
86 152 Vassilis Papavassiliou
                                       regex will not be fetched.
87 152 Vassilis Papavassiliou
 -h,--help                             This message
88 152 Vassilis Papavassiliou
 -i,--inputdir <arg>                   Input directory for deduplication, pairdetection, or
89 152 Vassilis Papavassiliou
                                       alignment
90 152 Vassilis Papavassiliou
 -ifp,--image_urls                     Full image URLs (and not only their basenames) will be used
91 152 Vassilis Papavassiliou
                                       in pair detection with common images
92 152 Vassilis Papavassiliou
 -k,--keepboiler                       Keep and annotate boilerplate content in parsed text
93 152 Vassilis Papavassiliou
 -l,--loggingAppender <arg>            Logging appender (console, DRFA) to use
94 152 Vassilis Papavassiliou
 -lang,--languages <arg>               Two or three letter ISO code(s) of target language(s), e.g.
95 152 Vassilis Papavassiliou
                                       el (for a monolingual crawl for Greek content) or en;el (for
96 152 Vassilis Papavassiliou
                                       a bilingual crawl)
97 152 Vassilis Papavassiliou
 -len,--length <arg>                   Μinimum number of tokens per text block. Shorter text blocks
98 152 Vassilis Papavassiliou
                                       will be annoteted as "ooi-length"
99 152 Vassilis Papavassiliou
 -mtlen,--minlength <arg>              Minimum number of tokens in crawled documents (after
100 152 Vassilis Papavassiliou
                                       boilerplate detection). Shorter documents will be discarded.
101 152 Vassilis Papavassiliou
 -n,--numloops <arg>                   Maximum number of fetch/update loops
102 152 Vassilis Papavassiliou
 -oxslt,--offline_xslt                 Apply an xsl transformation to generate html files during
103 152 Vassilis Papavassiliou
                                       exporting.
104 152 Vassilis Papavassiliou
 -p_r,--path_replacements <arg>        Put the strings to be replaced, separated by ';'. This might
105 152 Vassilis Papavassiliou
                                       be useful for crawling via the web service
106 152 Vassilis Papavassiliou
 -pairdetect,--pair_detection          Detect document pairs in crawled documents
107 152 Vassilis Papavassiliou
 -pdm,--pair_detection_methods <arg>   Α string forcing the crawler to detect pairs using one or
108 152 Vassilis Papavassiliou
                                       more specific methods: a (links between documents), u
109 152 Vassilis Papavassiliou
                                       (patterns in urls), p (common images and similar digit
110 152 Vassilis Papavassiliou
                                       sequences),i (common images), d (similar digit sequences), h, or m, or l
111 152 Vassilis Papavassiliou
                                       (high/medium/low similarity of html structure)
112 152 Vassilis Papavassiliou
 -segtypes,--segtypes <arg>            When creating a merged TMX file, only use sentence alignments
113 152 Vassilis Papavassiliou
                                       of specific types, ie. 1:1
114 152 Vassilis Papavassiliou
 -storefilter,--storefilter <arg>      Use this regex to force the crawler to store only webpages
115 152 Vassilis Papavassiliou
                                       with urls that match this regex.
116 152 Vassilis Papavassiliou
 -t,--threads <arg>                    Maximum number of fetcher threads to use
117 152 Vassilis Papavassiliou
 -tc,--topic <arg>                     Path to a file with the topic definition
118 152 Vassilis Papavassiliou
 -tmxmerge,--tmxmerge                  Merge aligned segments from each document pair into one tmx
119 152 Vassilis Papavassiliou
                                       file
120 155 Vassilis Papavassiliou
 -type,--type <arg>                    Crawl type: m (monolingual) or  p (parallel)
121 152 Vassilis Papavassiliou
 -u,--urls <arg>                       File with seed urls used to initialize the crawl
122 152 Vassilis Papavassiliou
 -u_r,--url_replacements <arg>         A string to be replaced, separated by ';'.
123 152 Vassilis Papavassiliou
```
124 151 Vassilis Papavassiliou
125 134 Vassilis Papavassiliou
126 130 Prokopis Prokopidis
## Other settings
127 73 Prokopis Prokopidis
128 102 Vassilis Papavassiliou
There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration files for monolingual and bilingual crawls are [[FMC_config.xml]]  and [[FBC_config.xml]] respectively. They are included in the ilsp-fc runnable jar.
129 1 Prokopis Prokopidis
130 40 Prokopis Prokopidis
Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows:
131 139 Vassilis Papavassiliou
132 139 Vassilis Papavassiliou
133 139 Vassilis Papavassiliou
-storefilter:	A regular expression to discard (i.e. visit/fetch/process but do not store) webpages with URLs which do NOT match this regex.
134 140 Vassilis Papavassiliou
135 146 Vassilis Papavassiliou
-u_r	:	url_replacements. Besides the default patterns , the user could add more patterns separated by ;
136 146 Vassilis Papavassiliou
137 149 Vassilis Papavassiliou
-ifp   : image_fullpath. Keep image fullpath for pair detection for representing an image instead of its name only.
138 149 Vassilis Papavassiliou
139 146 Vassilis Papavassiliou
140 150 Vassilis Papavassiliou
-doctypes	:	Defines the types of the document pairs from which the segment pairs will be selected. The proposed value is "aupidh"	since pairs of type "m" and "l" (e.g. eng-1_lav-3_m.xml or eng-2_lav-8_l.xml) are only used for testing or examining the tool.
141 1 Prokopis Prokopidis
142 1 Prokopis Prokopidis
143 135 Vassilis Papavassiliou
[//]: # ( ## Input )
144 135 Vassilis Papavassiliou
145 135 Vassilis Papavassiliou
[//]: # (In case of general monolingual crawls the required input from the user is: )
146 135 Vassilis Papavassiliou
[//]: # (* a list of seed URLs (i.e. a text file with one URL per text line). )
147 135 Vassilis Papavassiliou
148 135 Vassilis Papavassiliou
[//]: # (In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are related to a targeted domain), the input should include: ) 
149 135 Vassilis Papavassiliou
[//]: # (* a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]]. )
150 135 Vassilis Papavassiliou
[//]: # (* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English. Details on how to construct/bootstrap such lists and how they are used in text to topic classification could be found at this paper http://www.aclweb.org/anthology/W13-2506.pdf )
151 135 Vassilis Papavassiliou
152 135 Vassilis Papavassiliou
[//]: # (In case of general bilingual crawling, the input from the user includes:)
153 135 Vassilis Papavassiliou
[//]: # (* a seed URL list which should contain URL(s) from only one web site (e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. However, the user could use the <code> filter </code> parameter (see below) to allow visiting only links pointing to pages either inside versions of the top domain of the URL (e.g. http://www.fifa.com/,  http://es.fifa.com/ , etc.) or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Examples of seed URLs can be found at [[seed_examples.txt]]. )
154 135 Vassilis Papavassiliou
155 135 Vassilis Papavassiliou
[//]: # (In case of focused bilingual crawls, the input should also include: )
156 135 Vassilis Papavassiliou
[//]: # (* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of  _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]].)
157 135 Vassilis Papavassiliou
158 135 Vassilis Papavassiliou
[//]: # (## Language support )
159 135 Vassilis Papavassiliou
160 135 Vassilis Papavassiliou
[//]: # (For both monolingual and bilingual crawling, the set of currently supported languages comprises de, el, en, es, fr, ga, hr, it, ja, and pt. )
161 135 Vassilis Papavassiliou
162 135 Vassilis Papavassiliou
[//]: # (In order to add another language, a developer/user should: )
163 135 Vassilis Papavassiliou
[//]: # (* verify that the targeted language is supported by the default language identifier (https://code.google.com/p/language-detection/) integrated in the ILSP-FC, )
164 135 Vassilis Papavassiliou
[//]: # (* add a textline with proper content in the [[langKeys.txt]] file which is included in the ilsp-fc runnable jar, and)
165 135 Vassilis Papavassiliou
[//]: # (* add a proper analyser in the <code>gr.ilsp.fmc.utils.AnalyserFactory</code> class of the ilsp-fc source.  )
166 135 Vassilis Papavassiliou
167 135 Vassilis Papavassiliou
168 135 Vassilis Papavassiliou
169 135 Vassilis Papavassiliou
[//]: # (## Run a monolingual crawl )
170 135 Vassilis Papavassiliou
171 136 Vassilis Papavassiliou
[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr -cfg FMC_config.xml -type m -c 10 -lang en -of output_test1_list.txt -ofh output_test1_list.txt.html -tc ENV_EN_topic.txt  -u ENV_EN_seeds.txt -f -k -dom Environment</code></pre> )
172 1 Prokopis Prokopidis
173 136 Vassilis Papavassiliou
[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test2 -f -k -type m -c 5 -lang es -of output_test2_list.txt -ofh output_test2_list.txt.html -u seed_examples.txt  </code></pre> )
174 1 Prokopis Prokopidis
175 136 Vassilis Papavassiliou
[//]: # (## Run a bilingual crawl )
176 71 Vassilis Papavassiliou
177 136 Vassilis Papavassiliou
[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test3 -c 10 -f -k -l1 de -l2 it -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt -type p -u seed_examples.txt -cfg FBC_config.xml -dom HS -len 0 -mtlen 100 -xslt -oxslt</code></pre> )
178 1 Prokopis Prokopidis
179 136 Vassilis Papavassiliou
[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test4 -c 10 -f -k -l1 es -l2 en -type p -u seed_examples.txt -filter ".*uefa.com.*" -len 0 -mtlen 80 -xslt -oxslt -dest "/var/crawl_results/" -of test_U_ES-EN_output.txt -ofh test_U_ES-EN_output.txt.html -oft test_U_ES-EN_output.tmx.txt -ofth test_U_ES-EN_output.tmx.html -align  hunalign -dict </code></pre> )
180 123 Prokopis Prokopidis
181 136 Vassilis Papavassiliou
[//]: # ( <pre><code>java -jar ilsp-fc-2.2-jar-with-dependencies.jar crawlandexport -f -a abumatran -type p -align maligna -l1 en -l2 fr -u seed_examples.txt -filter ".*(nrcan|rncan).*" -n 2 -xslt -oxslt -of output_demo_EN-FR.txt -ofh output_demo_EN-FR.txt.html -oft output_demo_EN-FR.tmx.txt -ofth output_demo_EN-FR.tmx.html </code></pre>)
182 122 Prokopis Prokopidis
183 136 Vassilis Papavassiliou
[//]: # ( ## Output )
184 85 Vassilis Papavassiliou
185 136 Vassilis Papavassiliou
[//]: # (The output of the ilsp-fc in the case of a monolingual crawl consists of: )
186 136 Vassilis Papavassiliou
[//]: # (* a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See this "cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml file for an example in English for the _Environment_ domain. )
187 136 Vassilis Papavassiliou
[//]: # (* a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this "rendered cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml.html file. )
188 71 Vassilis Papavassiliou
189 136 Vassilis Papavassiliou
[//]: # (The output of the ilsp-fc in the case of a bilingual crawl consists of: )
190 136 Vassilis Papavassiliou
[//]: # (* a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example "cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml file serves as a link between a detected pair of cesDoc documents in "English":http://nlp.ilsp.gr/xslt/ilsp-fc/98.xml and "Spanish":http://nlp.ilsp.gr/xslt/ilsp-fc/44.xml.)
191 136 Vassilis Papavassiliou
[//]: # (* a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this "rendered cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml.html file.)
192 136 Vassilis Papavassiliou
[//]: # (* a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this "TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.tmx file.)
193 136 Vassilis Papavassiliou
[//]: # (* a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this "rendered TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.html file.)
194 137 Vassilis Papavassiliou
195 137 Vassilis Papavassiliou
196 137 Vassilis Papavassiliou
## Running modules of the ILSP-FC
197 137 Vassilis Papavassiliou
198 137 Vassilis Papavassiliou
The ILSP-FC, in a configuration for acquiring parallel data,  applies the following processes (one after the other):
199 137 Vassilis Papavassiliou
* [[Crawl|Crawl]]
200 137 Vassilis Papavassiliou
* [[Export|Export]] 
201 137 Vassilis Papavassiliou
* [[NearDeduplication|Near Deduplication]]
202 137 Vassilis Papavassiliou
* [[PairDetection|Pair Detection]]
203 137 Vassilis Papavassiliou
* [[SegmentAlignment|Segment Alignment]]
204 137 Vassilis Papavassiliou
* [[TMXmerging|TMX Merging]]