Project

General

Profile

Getting Started » History » Version 47

Vassilis Papavassiliou, 2012-12-04 04:45 PM

1 1 Prokopis Prokopidis
h1. Getting Started
2 2 Prokopis Prokopidis
3 2 Prokopis Prokopidis
Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this
4 2 Prokopis Prokopidis
5 11 Prokopis Prokopidis
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar</code></pre>
6 2 Prokopis Prokopidis
7 45 Vassilis Papavassiliou
The required input from the user consists of:
8 46 Vassilis Papavassiliou
* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English.
9 47 Vassilis Papavassiliou
* a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]].
10 1 Prokopis Prokopidis
11 47 Vassilis Papavassiliou
In case of bilingual crawling, the required input from the user includes:
12 47 Vassilis Papavassiliou
* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]].
13 47 Vassilis Papavassiliou
* a seed URL list which should contain only one URL (e.g. [[ENV_EN_ES_seed.txt]]). The crawler will visit only links pointing to pages inside versions of the top domain of the URL (e.g. http://www.fifa.com/,  http://es.fifa.com/ , etc.).
14 47 Vassilis Papavassiliou
15 41 Prokopis Prokopidis
There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration file is [[crawler_config.xml]] and is included in the ilsp-fc runnable jar. Two typical customized examples are [[FMC_config.xml]] for monolingual crawls and [[FBC_config.xml]] for bilingual crawls.
16 1 Prokopis Prokopidis
17 40 Prokopis Prokopidis
Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows:
18 38 Prokopis Prokopidis
19 30 Vassilis Papavassiliou
<pre><code>-a :user agent name (required)
20 30 Vassilis Papavassiliou
-type : the type of crawling. Crawling for monolingual (m) or parallel (p).
21 38 Prokopis Prokopidis
-cfg : the configuration file that will be used instead of the default (see crawler_config.xml above). 
22 34 Vassilis Papavassiliou
-c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of 
23 34 Vassilis Papavassiliou
     the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time
24 34 Vassilis Papavassiliou
     will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle.
25 34 Vassilis Papavassiliou
     The default value is 10 minutes.
26 34 Vassilis Papavassiliou
-n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes
27 34 Vassilis Papavassiliou
-t : the number of threads that will be used to fetch web pages in parallel.
28 34 Vassilis Papavassiliou
-f : forces the crawler to start a new job (required) 
29 34 Vassilis Papavassiliou
-lang : the targeted language in case of monolingual crawling (required).
30 1 Prokopis Prokopidis
-l1 : the first targeted language in case of bilingual crawling (required).
31 34 Vassilis Papavassiliou
-l2 : the second targeted language in case of bilingual crawling (required).
32 47 Vassilis Papavassiliou
-u : the text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling
33 28 Vassilis Papavassiliou
     the list should contain only 1 or 2 URLs from the same web doamin.  
34 1 Prokopis Prokopidis
-tc : domain definition
35 42 Vassilis Papavassiliou
-k : forces the crawler to annotate boilerplate content in parsed text.
36 42 Vassilis Papavassiliou
-filter : A regular expression to filter out URLs which do NOT match this regex.
37 44 Vassilis Papavassiliou
          The use of this filter forces the crawler to either focus on a specific 
38 44 Vassilis Papavassiliou
          web domain (i.e. ".*ec.europa.eu.*"), or on a part of a web domain 
39 44 Vassilis Papavassiliou
          (e.g.".*/legislation_summaries/environment.*"). Note that if this filter
40 44 Vassilis Papavassiliou
          is used, only the seed URLs that match this regex will be fetched.
41 37 Vassilis Papavassiliou
-d : forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages
42 37 Vassilis Papavassiliou
     inside the same web site). It should be used only for monolingual crawling.
43 22 Prokopis Prokopidis
</code></pre>
44 1 Prokopis Prokopidis
45 1 Prokopis Prokopidis
h2. Run a monolingual crawl
46 1 Prokopis Prokopidis
47 22 Prokopis Prokopidis
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \
48 22 Prokopis Prokopidis
                -cfg FMC_config.xml -t 10 -type m -c 10 -lang de -of output_test1_list.txt \
49 22 Prokopis Prokopidis
                -ofh  output_test1_list.txt.html -tc Automotive-seed-terms-de.txt  \
50 22 Prokopis Prokopidis
                 -u  Automotive-seed-urls.txt -xslt -f -k</code></pre>
51 2 Prokopis Prokopidis
52 1 Prokopis Prokopidis
h2. Run a bilingual crawl
53 12 Vassilis Papavassiliou
54 29 Vassilis Papavassiliou
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a test1 -c 10 -f -k -l1 de -l2 it \
55 29 Vassilis Papavassiliou
                -t 10 -of test_HS_DE-IT_output.txt -ofh  test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt \
56 29 Vassilis Papavassiliou
                -type p -xslt -u  seed_suva.txt -cfg FBC_config.xml</code></pre>
57 12 Vassilis Papavassiliou
58 2 Prokopis Prokopidis
59 2 Prokopis Prokopidis
h2. Example of java code
60 9 Prokopis Prokopidis
61 2 Prokopis Prokopidis
<pre>
62 2 Prokopis Prokopidis
<code class="java">
63 2 Prokopis Prokopidis
package gr.ilsp.fmc.classifier;
64 2 Prokopis Prokopidis
65 2 Prokopis Prokopidis
public enum ClassifierCounters {
66 2 Prokopis Prokopidis
    CLASSIFIER_DOCUMENTS_PASSED,   // successfully classified a document.
67 1 Prokopis Prokopidis
    CLASSIFIER_DOCUMENTS_FAILED,   // failed to classify a document
68 2 Prokopis Prokopidis
    CLASSIFIER_DOCUMENTS_ABORTED, 
69 2 Prokopis Prokopidis
    CLASSIFIER_TIME
70 8 Prokopis Prokopidis
}</code></pre>
71 2 Prokopis Prokopidis
72 2 Prokopis Prokopidis
<pre>
73 1 Prokopis Prokopidis
<code class="xml">
74 2 Prokopis Prokopidis
<?xml version="1.0" encoding="UTF-8"?>
75 2 Prokopis Prokopidis
<configuration>
76 2 Prokopis Prokopidis
        <agent>
77 2 Prokopis Prokopidis
                <email>yourmail@mail.com</email>
78 2 Prokopis Prokopidis
                <web_address>www.youraddress.com</web_address>
79 2 Prokopis Prokopidis
        </agent>
80 8 Prokopis Prokopidis
</configuration></code></pre>