Project

General

Profile

Getting Started » History » Version 148

Vassilis Papavassiliou, 2016-02-16 07:53 PM

1 130 Prokopis Prokopidis
# Getting Started
2 2 Prokopis Prokopidis
3 2 Prokopis Prokopidis
Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this
4 2 Prokopis Prokopidis
5 70 Prokopis Prokopidis
<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar</code></pre>
6 2 Prokopis Prokopidis
7 134 Vassilis Papavassiliou
## Examples of running monolingual crawls
8 134 Vassilis Papavassiliou
9 134 Vassilis Papavassiliou
* Given a seed URL list [[ENV_EN_seeds.txt]], the following example crawls the web for 5 minutes and constructs a collection containing English web pages.
10 134 Vassilis Papavassiliou
11 134 Vassilis Papavassiliou
```
12 134 Vassilis Papavassiliou
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
13 134 Vassilis Papavassiliou
-crawl -export -dedup -a test -f -type m -c 5 -lang en -k -u ENV_EN_seeds.txt -xslt -oxslt\
14 134 Vassilis Papavassiliou
-dest crawlResults -of output-test-list.txt -ofh output-test-list.txt.html 
15 134 Vassilis Papavassiliou
```
16 134 Vassilis Papavassiliou
17 134 Vassilis Papavassiliou
In this and other example commands in this documentation, a `log4j.xml` file is being used to set logging configuration details. An example `log4j.xml` file can be downloaded from [[Log4j_xml|here]]. 
18 134 Vassilis Papavassiliou
19 134 Vassilis Papavassiliou
*  Given a seed URL list [[ENV_EN_seeds.txt]] and a topic definition for the _Environment_ domain in Engish [[ENV_EN_topictxt|ENV_EN_topic.txt]], the following example crawls the web for 10 cycles and constructs a collection containing English web pages related to this domain.
20 134 Vassilis Papavassiliou
21 134 Vassilis Papavassiliou
```
22 134 Vassilis Papavassiliou
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
23 134 Vassilis Papavassiliou
-crawl -export -dedup -a test1 -f -type m -n 10 -lang en -k -u seed-examples.txt -xslt -oxslt \
24 134 Vassilis Papavassiliou
-tc ENV-EN-topic.txt -dom Environment -dest crawlResults -of output-test1-list.txt -ofh output-test1-list.txt.html
25 134 Vassilis Papavassiliou
```
26 134 Vassilis Papavassiliou
27 134 Vassilis Papavassiliou
## Example of running bilingual crawls
28 134 Vassilis Papavassiliou
29 134 Vassilis Papavassiliou
30 134 Vassilis Papavassiliou
```
31 134 Vassilis Papavassiliou
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
32 134 Vassilis Papavassiliou
-crawl -export -dedup -pairdetect -align -tmxmerge -f -k -xslt -oxslt -type p -n 10 -t 20 -len 0 -mtlen 80 \
33 134 Vassilis Papavassiliou
-lang "en;es" -doctypes "auidh" -segtypes "1:1" -a test -u ENV_EN_ES_seed.txt \
34 134 Vassilis Papavassiliou
-dest "crawlResults" -of "output_xml_list.txt" -ofh "output_xml_list.html" \
35 134 Vassilis Papavassiliou
-oft "output_tmx_list.tmx.txt" -ofth "output_tmx_list.tmx.html" -tmx "output.tmx" -metadata
36 134 Vassilis Papavassiliou
```
37 134 Vassilis Papavassiliou
38 130 Prokopis Prokopidis
## Other settings
39 73 Prokopis Prokopidis
40 102 Vassilis Papavassiliou
There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration files for monolingual and bilingual crawls are [[FMC_config.xml]]  and [[FBC_config.xml]] respectively. They are included in the ilsp-fc runnable jar.
41 1 Prokopis Prokopidis
42 40 Prokopis Prokopidis
Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows:
43 38 Prokopis Prokopidis
44 139 Vassilis Papavassiliou
-crawl		:	For applying crawling process.
45 34 Vassilis Papavassiliou
46 139 Vassilis Papavassiliou
-f			:	Forces the crawler to start a new job.
47 139 Vassilis Papavassiliou
48 139 Vassilis Papavassiliou
-type		:	The type of crawling. Crawling for monolingual (m) or parallel (p).
49 139 Vassilis Papavassiliou
50 139 Vassilis Papavassiliou
-lang		:	The language iso codes of the targeted languages separated by ";".
51 139 Vassilis Papavassiliou
52 139 Vassilis Papavassiliou
-cfg		:	The full path to a configuration file that can be used to override default parameters.
53 139 Vassilis Papavassiliou
54 140 Vassilis Papavassiliou
-a		:	User agent name. It is proposed to use a name similar to the targeted site in case of bilingual crawls.
55 139 Vassilis Papavassiliou
56 140 Vassilis Papavassiliou
-u		:	The fullpath of text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling the list should contain the URL of the main page of the targeted website, or (of course) other URLs of this website.
57 139 Vassilis Papavassiliou
58 139 Vassilis Papavassiliou
-filter		:	A regular expression to filter out URLs which do NOT match this regex.
59 140 Vassilis Papavassiliou
			The use of this filter forces the crawler to either focus on a specific web domain (i.e. ".*ec.europa.eu.*"), or on a part of a web domain (e.g.".*/legislation_summaries/environment.*") or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Note that if this filter is used, only the seed URLs that match this regex will be fetched.
60 1 Prokopis Prokopidis
61 140 Vassilis Papavassiliou
-n		:	The crawl duration in cycles. Since the crawler runs in cycles (during which links stored at the top of the crawler’s frontier are extracted and new links are examined) it is proposed to use this parameter either for testing purposes or selecting a large number (i.e. 100) to "verify" that the crawler will visit the entire website.
62 1 Prokopis Prokopidis
63 140 Vassilis Papavassiliou
-c		:	the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle. 
64 140 Vassilis Papavassiliou
65 141 Vassilis Papavassiliou
-dest		:	The directory where the results (i.e. the crawled data) will be stored. The tool will create the file structure dest/agent/crawl-id (where dest and agent stand for the arguments of parameters dest and agent respectively and crawl-id is generated automatically). In this directory, the tool will create the "run" directories (i.e. directories containing all resources fetched/extracted/used/required for each cycle of this crawl). In addition a pdf directory for storing acquired pdf files will be created.
66 139 Vassilis Papavassiliou
67 140 Vassilis Papavassiliou
-t		:	The number of threads that will be used to fetch web pages in parallel.
68 1 Prokopis Prokopidis
69 140 Vassilis Papavassiliou
-k		:	Forces the crawler to annotate boilerplate content in parsed text.
70 139 Vassilis Papavassiliou
71 139 Vassilis Papavassiliou
-len		:	Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is
72 140 Vassilis Papavassiliou
			less than this value the paragraph will be annotated as "out of interest" and will not be included into the clean text of the web page.
73 1 Prokopis Prokopidis
74 1 Prokopis Prokopidis
-mtlen		:	Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned	text is less than this value, the document will not be stored.
75 139 Vassilis Papavassiliou
76 140 Vassilis Papavassiliou
-tc		:	The fullpath of topic file (a text file that contains a list of term triplets that describe the targeted topic). An example domain definition of "Environment" for the English-Spanish pair can be found at http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki/ENV_EN_ES_topic. If omitted, the crawl will be a "general" one (i.e. module for text-to-domain classification will not be used).
77 139 Vassilis Papavassiliou
78 139 Vassilis Papavassiliou
-dom		:	Title of the targeted domain (required when domain definition, i.e. tc parameter, is used).
79 139 Vassilis Papavassiliou
80 139 Vassilis Papavassiliou
-storefilter:	A regular expression to discard (i.e. visit/fetch/process but do not store) webpages with URLs which do NOT match this regex.
81 140 Vassilis Papavassiliou
82 1 Prokopis Prokopidis
-d		:	Forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages inside the same web site). It should be used only for monolingual crawling.
83 141 Vassilis Papavassiliou
84 141 Vassilis Papavassiliou
85 143 Vassilis Papavassiliou
-export	:	For exporting process
86 141 Vassilis Papavassiliou
87 146 Vassilis Papavassiliou
-of	:	The fullpath of text file containing a list with fullpaths of the exported cesDoc files, or cesAling files.
88 142 Vassilis Papavassiliou
89 143 Vassilis Papavassiliou
-xslt	:	If exists, it inserts a stylesheet for rendering XML results as HTML.
90 1 Prokopis Prokopidis
91 143 Vassilis Papavassiliou
-oxslt	:	If exists, Export crawl results with the help of an xslt file for better examination of results.
92 142 Vassilis Papavassiliou
93 143 Vassilis Papavassiliou
-ofh	:	The fullpath of HTML file containing a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection.
94 1 Prokopis Prokopidis
95 147 Vassilis Papavassiliou
-dedup	:	for (near) deduplication.
96 147 Vassilis Papavassiliou
97 148 Vassilis Papavassiliou
-pairdetect : for identification of candidate parallel documents
98 146 Vassilis Papavassiliou
99 148 Vassilis Papavassiliou
-meth   : methods to be used for pair detection. Put a string which contains a for checking links, u for checking urls for patterns, p for combining common images and digits, i for using common images, d for examining digit sequences, s for examining structures.
100 146 Vassilis Papavassiliou
101 146 Vassilis Papavassiliou
-u_r	:	url_replacements. Besides the default patterns , the user could add more patterns separated by ;
102 146 Vassilis Papavassiliou
103 146 Vassilis Papavassiliou
104 145 Vassilis Papavassiliou
105 145 Vassilis Papavassiliou
-align	:	for segment alignment
106 145 Vassilis Papavassiliou
107 145 Vassilis Papavassiliou
-oft	:	The fullpath of text file containing a list with fullpaths of the generated TMX files
108 145 Vassilis Papavassiliou
109 145 Vassilis Papavassiliou
-ofth	:	The fullpath of HTML file containing a list of links pointing to generated transformed TMX files
110 143 Vassilis Papavassiliou
111 1 Prokopis Prokopidis
112 1 Prokopis Prokopidis
113 135 Vassilis Papavassiliou
[//]: # ( ## Input )
114 135 Vassilis Papavassiliou
115 135 Vassilis Papavassiliou
[//]: # (In case of general monolingual crawls the required input from the user is: )
116 135 Vassilis Papavassiliou
[//]: # (* a list of seed URLs (i.e. a text file with one URL per text line). )
117 135 Vassilis Papavassiliou
118 135 Vassilis Papavassiliou
[//]: # (In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are related to a targeted domain), the input should include: ) 
119 135 Vassilis Papavassiliou
[//]: # (* a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]]. )
120 135 Vassilis Papavassiliou
[//]: # (* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English. Details on how to construct/bootstrap such lists and how they are used in text to topic classification could be found at this paper http://www.aclweb.org/anthology/W13-2506.pdf )
121 135 Vassilis Papavassiliou
122 135 Vassilis Papavassiliou
[//]: # (In case of general bilingual crawling, the input from the user includes:)
123 135 Vassilis Papavassiliou
[//]: # (* a seed URL list which should contain URL(s) from only one web site (e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. However, the user could use the <code> filter </code> parameter (see below) to allow visiting only links pointing to pages either inside versions of the top domain of the URL (e.g. http://www.fifa.com/,  http://es.fifa.com/ , etc.) or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Examples of seed URLs can be found at [[seed_examples.txt]]. )
124 135 Vassilis Papavassiliou
125 135 Vassilis Papavassiliou
[//]: # (In case of focused bilingual crawls, the input should also include: )
126 135 Vassilis Papavassiliou
[//]: # (* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of  _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]].)
127 135 Vassilis Papavassiliou
128 135 Vassilis Papavassiliou
[//]: # (## Language support )
129 135 Vassilis Papavassiliou
130 135 Vassilis Papavassiliou
[//]: # (For both monolingual and bilingual crawling, the set of currently supported languages comprises de, el, en, es, fr, ga, hr, it, ja, and pt. )
131 135 Vassilis Papavassiliou
132 135 Vassilis Papavassiliou
[//]: # (In order to add another language, a developer/user should: )
133 135 Vassilis Papavassiliou
[//]: # (* verify that the targeted language is supported by the default language identifier (https://code.google.com/p/language-detection/) integrated in the ILSP-FC, )
134 135 Vassilis Papavassiliou
[//]: # (* add a textline with proper content in the [[langKeys.txt]] file which is included in the ilsp-fc runnable jar, and)
135 135 Vassilis Papavassiliou
[//]: # (* add a proper analyser in the <code>gr.ilsp.fmc.utils.AnalyserFactory</code> class of the ilsp-fc source.  )
136 135 Vassilis Papavassiliou
137 135 Vassilis Papavassiliou
138 135 Vassilis Papavassiliou
139 135 Vassilis Papavassiliou
[//]: # (## Run a monolingual crawl )
140 135 Vassilis Papavassiliou
141 136 Vassilis Papavassiliou
[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr -cfg FMC_config.xml -type m -c 10 -lang en -of output_test1_list.txt -ofh output_test1_list.txt.html -tc ENV_EN_topic.txt  -u ENV_EN_seeds.txt -f -k -dom Environment</code></pre> )
142 1 Prokopis Prokopidis
143 136 Vassilis Papavassiliou
[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test2 -f -k -type m -c 5 -lang es -of output_test2_list.txt -ofh output_test2_list.txt.html -u seed_examples.txt  </code></pre> )
144 1 Prokopis Prokopidis
145 136 Vassilis Papavassiliou
[//]: # (## Run a bilingual crawl )
146 71 Vassilis Papavassiliou
147 136 Vassilis Papavassiliou
[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test3 -c 10 -f -k -l1 de -l2 it -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt -type p -u seed_examples.txt -cfg FBC_config.xml -dom HS -len 0 -mtlen 100 -xslt -oxslt</code></pre> )
148 1 Prokopis Prokopidis
149 136 Vassilis Papavassiliou
[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test4 -c 10 -f -k -l1 es -l2 en -type p -u seed_examples.txt -filter ".*uefa.com.*" -len 0 -mtlen 80 -xslt -oxslt -dest "/var/crawl_results/" -of test_U_ES-EN_output.txt -ofh test_U_ES-EN_output.txt.html -oft test_U_ES-EN_output.tmx.txt -ofth test_U_ES-EN_output.tmx.html -align  hunalign -dict </code></pre> )
150 123 Prokopis Prokopidis
151 136 Vassilis Papavassiliou
[//]: # ( <pre><code>java -jar ilsp-fc-2.2-jar-with-dependencies.jar crawlandexport -f -a abumatran -type p -align maligna -l1 en -l2 fr -u seed_examples.txt -filter ".*(nrcan|rncan).*" -n 2 -xslt -oxslt -of output_demo_EN-FR.txt -ofh output_demo_EN-FR.txt.html -oft output_demo_EN-FR.tmx.txt -ofth output_demo_EN-FR.tmx.html </code></pre>)
152 122 Prokopis Prokopidis
153 136 Vassilis Papavassiliou
[//]: # ( ## Output )
154 85 Vassilis Papavassiliou
155 136 Vassilis Papavassiliou
[//]: # (The output of the ilsp-fc in the case of a monolingual crawl consists of: )
156 136 Vassilis Papavassiliou
[//]: # (* a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See this "cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml file for an example in English for the _Environment_ domain. )
157 136 Vassilis Papavassiliou
[//]: # (* a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this "rendered cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml.html file. )
158 71 Vassilis Papavassiliou
159 136 Vassilis Papavassiliou
[//]: # (The output of the ilsp-fc in the case of a bilingual crawl consists of: )
160 136 Vassilis Papavassiliou
[//]: # (* a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example "cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml file serves as a link between a detected pair of cesDoc documents in "English":http://nlp.ilsp.gr/xslt/ilsp-fc/98.xml and "Spanish":http://nlp.ilsp.gr/xslt/ilsp-fc/44.xml.)
161 136 Vassilis Papavassiliou
[//]: # (* a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this "rendered cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml.html file.)
162 136 Vassilis Papavassiliou
[//]: # (* a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this "TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.tmx file.)
163 136 Vassilis Papavassiliou
[//]: # (* a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this "rendered TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.html file.)
164 137 Vassilis Papavassiliou
165 137 Vassilis Papavassiliou
166 137 Vassilis Papavassiliou
## Running modules of the ILSP-FC
167 137 Vassilis Papavassiliou
168 137 Vassilis Papavassiliou
The ILSP-FC, in a configuration for acquiring parallel data,  applies the following processes (one after the other):
169 137 Vassilis Papavassiliou
* [[Crawl|Crawl]]
170 137 Vassilis Papavassiliou
* [[Export|Export]] 
171 137 Vassilis Papavassiliou
* [[NearDeduplication|Near Deduplication]]
172 137 Vassilis Papavassiliou
* [[PairDetection|Pair Detection]]
173 137 Vassilis Papavassiliou
* [[SegmentAlignment|Segment Alignment]]
174 137 Vassilis Papavassiliou
* [[TMXmerging|TMX Merging]]