Project

General

Profile

Getting Started » History » Version 82

Vassilis Papavassiliou, 2014-08-13 09:54 AM

1 1 Prokopis Prokopidis
h1. Getting Started
2 2 Prokopis Prokopidis
3 2 Prokopis Prokopidis
Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this
4 2 Prokopis Prokopidis
5 70 Prokopis Prokopidis
<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar</code></pre>
6 2 Prokopis Prokopidis
7 78 Vassilis Papavassiliou
In case of monolingual crawls the required input from the user is:
8 1 Prokopis Prokopidis
* a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]].
9 1 Prokopis Prokopidis
10 78 Vassilis Papavassiliou
In case of focused crawls (i.e. the crawler aims to visit/process/store web pages that are related to a targeted domain), the input should also include:  
11 76 Vassilis Papavassiliou
* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English.
12 1 Prokopis Prokopidis
13 1 Prokopis Prokopidis
In case of bilingual crawling, the input from the user includes:
14 78 Vassilis Papavassiliou
* a seed URL list which should contain URL(s) from only one web site (e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. However, the user could use the <code> filter </code> parameter (see below) to allow visiting only links pointing to pages either inside versions of the top domain of the URL (e.g. http://www.fifa.com/,  http://es.fifa.com/ , etc.) or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca).
15 78 Vassilis Papavassiliou
16 78 Vassilis Papavassiliou
In case of focused crawls, the input should also incluce: 
17 47 Vassilis Papavassiliou
* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]].
18 47 Vassilis Papavassiliou
19 75 Prokopis Prokopidis
For both monolingual and bilingual crawling, the set of currently supported languages comprises de, el, en, es, fr, hr, it, ja, and pt. 
20 82 Vassilis Papavassiliou
In order to add another language, the user should:
21 82 Vassilis Papavassiliou
a) check that the targeted language is supported by the default language identifier (https://code.google.com/p/language-detection/) integrated in the ILSP-FC, b) add a textline with proper content in the [[langKeys.txt]] file which is included in the ilsp-fc runnable jar, and
22 82 Vassilis Papavassiliou
c) add a proper analyser in the <code>gr.ilsp.fmc.utils.AnalyserFactory</code> class of the java project.  
23 73 Prokopis Prokopidis
24 41 Prokopis Prokopidis
There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration file is [[crawler_config.xml]] and is included in the ilsp-fc runnable jar. Two typical customized examples are [[FMC_config.xml]] for monolingual crawls and [[FBC_config.xml]] for bilingual crawls.
25 1 Prokopis Prokopidis
26 40 Prokopis Prokopidis
Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows:
27 38 Prokopis Prokopidis
28 56 Vassilis Papavassiliou
<pre><code>-a : user agent name (required)
29 30 Vassilis Papavassiliou
-type : the type of crawling. Crawling for monolingual (m) or parallel (p).
30 38 Prokopis Prokopidis
-cfg : the configuration file that will be used instead of the default (see crawler_config.xml above). 
31 34 Vassilis Papavassiliou
-c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of 
32 34 Vassilis Papavassiliou
     the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time
33 34 Vassilis Papavassiliou
     will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle.
34 34 Vassilis Papavassiliou
     The default value is 10 minutes.
35 65 Vassilis Papavassiliou
-n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes.
36 34 Vassilis Papavassiliou
-t : the number of threads that will be used to fetch web pages in parallel.
37 67 Vassilis Papavassiliou
-f : Forces the crawler to start a new job (required).
38 34 Vassilis Papavassiliou
-lang : the targeted language in case of monolingual crawling (required).
39 1 Prokopis Prokopidis
-l1 : the first targeted language in case of bilingual crawling (required).
40 34 Vassilis Papavassiliou
-l2 : the second targeted language in case of bilingual crawling (required).
41 47 Vassilis Papavassiliou
-u : the text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling
42 28 Vassilis Papavassiliou
     the list should contain only 1 or 2 URLs from the same web doamin.  
43 69 Vassilis Papavassiliou
-tc : domain definition (a text file that contains a list of term triplets that describe the targeted
44 69 Vassilis Papavassiliou
      domain). If omitted, the crawl will be a "general" one (i.e. module for text-to-domain
45 69 Vassilis Papavassiliou
      classification will not be used). 
46 67 Vassilis Papavassiliou
-k : Forces the crawler to annotate boilerplate content in parsed text.
47 42 Vassilis Papavassiliou
-filter : A regular expression to filter out URLs which do NOT match this regex.
48 1 Prokopis Prokopidis
          The use of this filter forces the crawler to either focus on a specific 
49 1 Prokopis Prokopidis
          web domain (i.e. ".*ec.europa.eu.*"), or on a part of a web domain 
50 1 Prokopis Prokopidis
          (e.g.".*/legislation_summaries/environment.*"). Note that if this filter
51 44 Vassilis Papavassiliou
          is used, only the seed URLs that match this regex will be fetched.
52 71 Vassilis Papavassiliou
-u_r : This parameter should be used for bilingual craqwling when there is an already known pattern in URLs
53 71 Vassilis Papavassiliou
       which implies that one page is the candidate translation the other. It includes the two strings
54 71 Vassilis Papavassiliou
       to be replaced separated by ';'.
55 67 Vassilis Papavassiliou
-d : Forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages
56 44 Vassilis Papavassiliou
     inside the same web site). It should be used only for monolingual crawling.
57 67 Vassilis Papavassiliou
-mtlen: Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned 
58 67 Vassilis Papavassiliou
        text is less the thjis value, the document will not be stored.   
59 1 Prokopis Prokopidis
-xslt : Insert a stylesheet for rendering xml results as html.
60 1 Prokopis Prokopidis
-oxslt: Export crawl results with the help of an xslt file for better examination of results.
61 65 Vassilis Papavassiliou
-dom: Title of the targeted domain (required when domain definition, i.e. tc parameter, is used).
62 65 Vassilis Papavassiliou
-dest: The directory where the results (i.e. the crawled data) will be stored.
63 66 Vassilis Papavassiliou
-of: A text file containing a list with the exported XML files (see section Output below).
64 66 Vassilis Papavassiliou
-ofh: An HTML file containing a list with the generated XML files (see section Output below).   
65 22 Prokopis Prokopidis
</code></pre>
66 1 Prokopis Prokopidis
67 1 Prokopis Prokopidis
h2. Run a monolingual crawl
68 1 Prokopis Prokopidis
69 22 Prokopis Prokopidis
<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \
70 1 Prokopis Prokopidis
                -cfg FMC_config.xml -t 10 -type m -c 10 -lang de -of output_test1_list.txt \
71 1 Prokopis Prokopidis
                -ofh output_test1_list.txt.html -tc Automotive-seed-terms-de.txt  \
72 1 Prokopis Prokopidis
                -u Automotive-seed-urls.txt -f -k -dom Automotive</code></pre>
73 1 Prokopis Prokopidis
74 71 Vassilis Papavassiliou
<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test2 \
75 71 Vassilis Papavassiliou
                -t 10 -f -k -type m -c 5 -lang es -of output_test2_list.txt \
76 71 Vassilis Papavassiliou
                -ofh output_test2_list.txt.html -u Automotive-seed-urls.txt \
77 71 Vassilis Papavassiliou
                </code></pre>
78 71 Vassilis Papavassiliou
79 1 Prokopis Prokopidis
h2. Run a bilingual crawl
80 1 Prokopis Prokopidis
81 71 Vassilis Papavassiliou
<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test3 -c 10 -f -k -l1 de -l2 it \
82 1 Prokopis Prokopidis
                -t 10 -of test_HS_DE-IT_output.txt -ofh  test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt \
83 1 Prokopis Prokopidis
                -type p -u seed_suva.txt -cfg FBC_config.xml -dom HS</code></pre>
84 71 Vassilis Papavassiliou
85 71 Vassilis Papavassiliou
<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test4 -c 11 -f -k -l1 es -l2 pt \
86 71 Vassilis Papavassiliou
                -t 10 -of test_F_ES-PT_output.txt -ofh  test_F_ES-PT_output.txt.html \
87 71 Vassilis Papavassiliou
                -type p -u seed_uefa.txt -filter ".*uefa.com.*" </code></pre>
88 71 Vassilis Papavassiliou
89 48 Vassilis Papavassiliou
h2. Output
90 51 Vassilis Papavassiliou
91 64 Prokopis Prokopidis
The output of the ilsp-fc in the case of a monolingual crawl is a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See [[cesDOC_file]] for an example in French for the Environment domain. 
92 52 Vassilis Papavassiliou
93 64 Prokopis Prokopidis
The output of the ilsp-fc in the case of a bilingual crawl is a list of links to XML files following the cesAling Corpus Encoding Standard for linking cesDoc documents. This example [[cesAlign_file]] serves as a link between a pair of cesDOC documents in English and Greek.