Getting Started » History » Version 34
Vassilis Papavassiliou, 2012-10-26 05:10 PM
1 | 1 | Prokopis Prokopidis | h1. Getting Started |
---|---|---|---|
2 | 2 | Prokopis Prokopidis | |
3 | 2 | Prokopis Prokopidis | Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this |
4 | 2 | Prokopis Prokopidis | |
5 | 11 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar</code></pre> |
6 | 2 | Prokopis Prokopidis | |
7 | 23 | Prokopis Prokopidis | There are several settings that influence the crawling process and can be defined in the configuration file (the default file is [[crawler_config.xml]]) before the crawling process. Some of them can also be set in the command running the ilsp-fc runnable jar, as follows: |
8 | 15 | Vassilis Papavassiliou | |
9 | 30 | Vassilis Papavassiliou | <pre><code>-a :user agent name (required) |
10 | 30 | Vassilis Papavassiliou | -type : the type of crawling. Crawling for monolingual (m) or parallel (p). |
11 | 30 | Vassilis Papavassiliou | -cfg : the configuration file that will be used instead of the default (see crawler_config.xml above). It is |
12 | 33 | Vassilis Papavassiliou | proposed to use [[FMC_crawler.xml]] for monolingual crawl and [[FBC_crawler.xml]] for bilingual crawl. |
13 | 34 | Vassilis Papavassiliou | -c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of |
14 | 34 | Vassilis Papavassiliou | the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time |
15 | 34 | Vassilis Papavassiliou | will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle. |
16 | 34 | Vassilis Papavassiliou | The default value is 10 minutes. |
17 | 34 | Vassilis Papavassiliou | -n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes |
18 | 34 | Vassilis Papavassiliou | -t : the number of threads that will be used to fetch web pages in parallel. |
19 | 34 | Vassilis Papavassiliou | -f : forces the crawler to start a new job (required) |
20 | 34 | Vassilis Papavassiliou | -lang : the targeted language in case of monolingual crawling (required). |
21 | 34 | Vassilis Papavassiliou | -l1 : the first targeted language in case of bilingual crawling (required). |
22 | 34 | Vassilis Papavassiliou | -l2 : the second targeted language in case of bilingual crawling (required). |
23 | 30 | Vassilis Papavassiliou | |
24 | 28 | Vassilis Papavassiliou | -u : the text file that contains the seed URLs that will initialize the crawler. In case of blingual crawling |
25 | 28 | Vassilis Papavassiliou | the list should contain only 1 or 2 URLs from the same web doamin. |
26 | 28 | Vassilis Papavassiliou | -tc : domain definition |
27 | 22 | Prokopis Prokopidis | </code></pre> |
28 | 1 | Prokopis Prokopidis | |
29 | 1 | Prokopis Prokopidis | h2. Run a monolingual crawl |
30 | 1 | Prokopis Prokopidis | |
31 | 22 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \ |
32 | 22 | Prokopis Prokopidis | -cfg FMC_config.xml -t 10 -type m -c 10 -lang de -of output_test1_list.txt \ |
33 | 22 | Prokopis Prokopidis | -ofh output_test1_list.txt.html -tc Automotive-seed-terms-de.txt \ |
34 | 22 | Prokopis Prokopidis | -u Automotive-seed-urls.txt -xslt -f -k</code></pre> |
35 | 2 | Prokopis Prokopidis | |
36 | 1 | Prokopis Prokopidis | h2. Run a bilingual crawl |
37 | 12 | Vassilis Papavassiliou | |
38 | 29 | Vassilis Papavassiliou | <pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a test1 -c 10 -f -k -l1 de -l2 it \ |
39 | 29 | Vassilis Papavassiliou | -t 10 -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt \ |
40 | 29 | Vassilis Papavassiliou | -type p -xslt -u seed_suva.txt -cfg FBC_config.xml</code></pre> |
41 | 12 | Vassilis Papavassiliou | |
42 | 2 | Prokopis Prokopidis | |
43 | 2 | Prokopis Prokopidis | h2. Example of java code |
44 | 9 | Prokopis Prokopidis | |
45 | 2 | Prokopis Prokopidis | <pre> |
46 | 2 | Prokopis Prokopidis | <code class="java"> |
47 | 2 | Prokopis Prokopidis | package gr.ilsp.fmc.classifier; |
48 | 2 | Prokopis Prokopidis | |
49 | 2 | Prokopis Prokopidis | public enum ClassifierCounters { |
50 | 2 | Prokopis Prokopidis | CLASSIFIER_DOCUMENTS_PASSED, // successfully classified a document. |
51 | 1 | Prokopis Prokopidis | CLASSIFIER_DOCUMENTS_FAILED, // failed to classify a document |
52 | 2 | Prokopis Prokopidis | CLASSIFIER_DOCUMENTS_ABORTED, |
53 | 2 | Prokopis Prokopidis | CLASSIFIER_TIME |
54 | 8 | Prokopis Prokopidis | }</code></pre> |
55 | 2 | Prokopis Prokopidis | |
56 | 2 | Prokopis Prokopidis | <pre> |
57 | 1 | Prokopis Prokopidis | <code class="xml"> |
58 | 2 | Prokopis Prokopidis | <?xml version="1.0" encoding="UTF-8"?> |
59 | 2 | Prokopis Prokopidis | <configuration> |
60 | 2 | Prokopis Prokopidis | <agent> |
61 | 2 | Prokopis Prokopidis | <email>yourmail@mail.com</email> |
62 | 2 | Prokopis Prokopidis | <web_address>www.youraddress.com</web_address> |
63 | 2 | Prokopis Prokopidis | </agent> |
64 | 8 | Prokopis Prokopidis | </configuration></code></pre> |