Project

General

Profile

Getting Started » History » Version 37

Vassilis Papavassiliou, 2012-12-03 05:27 PM

1 1 Prokopis Prokopidis
h1. Getting Started
2 2 Prokopis Prokopidis
3 2 Prokopis Prokopidis
Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this
4 2 Prokopis Prokopidis
5 11 Prokopis Prokopidis
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar</code></pre>
6 2 Prokopis Prokopidis
7 23 Prokopis Prokopidis
There are several settings that influence the crawling process and can be defined in the configuration file (the default file is [[crawler_config.xml]]) before the crawling process. Some of them can also be set in the command running the ilsp-fc runnable jar, as follows:
8 15 Vassilis Papavassiliou
9 30 Vassilis Papavassiliou
<pre><code>-a :user agent name (required)
10 30 Vassilis Papavassiliou
-type : the type of crawling. Crawling for monolingual (m) or parallel (p).
11 30 Vassilis Papavassiliou
-cfg : the configuration file that will be used instead of the default (see crawler_config.xml above). It is
12 35 Vassilis Papavassiliou
       proposed to use [[FMC_config.xml]] for monolingual crawl and [[FBC_config.xml]] for bilingual crawl.
13 34 Vassilis Papavassiliou
-c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of 
14 34 Vassilis Papavassiliou
     the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time
15 34 Vassilis Papavassiliou
     will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle.
16 34 Vassilis Papavassiliou
     The default value is 10 minutes.
17 34 Vassilis Papavassiliou
-n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes
18 34 Vassilis Papavassiliou
-t : the number of threads that will be used to fetch web pages in parallel.
19 34 Vassilis Papavassiliou
-f : forces the crawler to start a new job (required) 
20 34 Vassilis Papavassiliou
-lang : the targeted language in case of monolingual crawling (required).
21 34 Vassilis Papavassiliou
-l1 : the first targeted language in case of bilingual crawling (required).
22 34 Vassilis Papavassiliou
-l2 : the second targeted language in case of bilingual crawling (required).
23 30 Vassilis Papavassiliou
24 28 Vassilis Papavassiliou
-u : the text file that contains the seed URLs that will initialize the crawler. In case of blingual crawling
25 28 Vassilis Papavassiliou
     the list should contain only 1 or 2 URLs from the same web doamin.  
26 1 Prokopis Prokopidis
-tc : domain definition
27 37 Vassilis Papavassiliou
-d : forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages
28 37 Vassilis Papavassiliou
     inside the same web site). It should be used only for monolingual crawling.
29 22 Prokopis Prokopidis
</code></pre>
30 1 Prokopis Prokopidis
31 1 Prokopis Prokopidis
h2. Run a monolingual crawl
32 1 Prokopis Prokopidis
33 22 Prokopis Prokopidis
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \
34 22 Prokopis Prokopidis
                -cfg FMC_config.xml -t 10 -type m -c 10 -lang de -of output_test1_list.txt \
35 22 Prokopis Prokopidis
                -ofh  output_test1_list.txt.html -tc Automotive-seed-terms-de.txt  \
36 22 Prokopis Prokopidis
                 -u  Automotive-seed-urls.txt -xslt -f -k</code></pre>
37 2 Prokopis Prokopidis
38 1 Prokopis Prokopidis
h2. Run a bilingual crawl
39 12 Vassilis Papavassiliou
40 29 Vassilis Papavassiliou
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a test1 -c 10 -f -k -l1 de -l2 it \
41 29 Vassilis Papavassiliou
                -t 10 -of test_HS_DE-IT_output.txt -ofh  test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt \
42 29 Vassilis Papavassiliou
                -type p -xslt -u  seed_suva.txt -cfg FBC_config.xml</code></pre>
43 12 Vassilis Papavassiliou
44 2 Prokopis Prokopidis
45 2 Prokopis Prokopidis
h2. Example of java code
46 9 Prokopis Prokopidis
47 2 Prokopis Prokopidis
<pre>
48 2 Prokopis Prokopidis
<code class="java">
49 2 Prokopis Prokopidis
package gr.ilsp.fmc.classifier;
50 2 Prokopis Prokopidis
51 2 Prokopis Prokopidis
public enum ClassifierCounters {
52 2 Prokopis Prokopidis
    CLASSIFIER_DOCUMENTS_PASSED,   // successfully classified a document.
53 1 Prokopis Prokopidis
    CLASSIFIER_DOCUMENTS_FAILED,   // failed to classify a document
54 2 Prokopis Prokopidis
    CLASSIFIER_DOCUMENTS_ABORTED, 
55 2 Prokopis Prokopidis
    CLASSIFIER_TIME
56 8 Prokopis Prokopidis
}</code></pre>
57 2 Prokopis Prokopidis
58 2 Prokopis Prokopidis
<pre>
59 1 Prokopis Prokopidis
<code class="xml">
60 2 Prokopis Prokopidis
<?xml version="1.0" encoding="UTF-8"?>
61 2 Prokopis Prokopidis
<configuration>
62 2 Prokopis Prokopidis
        <agent>
63 2 Prokopis Prokopidis
                <email>yourmail@mail.com</email>
64 2 Prokopis Prokopidis
                <web_address>www.youraddress.com</web_address>
65 2 Prokopis Prokopidis
        </agent>
66 8 Prokopis Prokopidis
</configuration></code></pre>