Getting Started » History » Version 51
Vassilis Papavassiliou, 2012-12-05 10:28 AM
1 | 1 | Prokopis Prokopidis | h1. Getting Started |
---|---|---|---|
2 | 2 | Prokopis Prokopidis | |
3 | 2 | Prokopis Prokopidis | Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this |
4 | 2 | Prokopis Prokopidis | |
5 | 11 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar</code></pre> |
6 | 2 | Prokopis Prokopidis | |
7 | 45 | Vassilis Papavassiliou | The required input from the user consists of: |
8 | 46 | Vassilis Papavassiliou | * a list of term triplets (_<relevance,term,subtopic>_) that describe a domain and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English. |
9 | 47 | Vassilis Papavassiliou | * a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]]. |
10 | 1 | Prokopis Prokopidis | |
11 | 47 | Vassilis Papavassiliou | In case of bilingual crawling, the required input from the user includes: |
12 | 47 | Vassilis Papavassiliou | * a list of term triplets (_<relevance,term,subtopic>_) that describe a domain and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]]. |
13 | 47 | Vassilis Papavassiliou | * a seed URL list which should contain only one URL (e.g. [[ENV_EN_ES_seed.txt]]). The crawler will visit only links pointing to pages inside versions of the top domain of the URL (e.g. http://www.fifa.com/, http://es.fifa.com/ , etc.). |
14 | 47 | Vassilis Papavassiliou | |
15 | 41 | Prokopis Prokopidis | There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration file is [[crawler_config.xml]] and is included in the ilsp-fc runnable jar. Two typical customized examples are [[FMC_config.xml]] for monolingual crawls and [[FBC_config.xml]] for bilingual crawls. |
16 | 1 | Prokopis Prokopidis | |
17 | 40 | Prokopis Prokopidis | Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows: |
18 | 38 | Prokopis Prokopidis | |
19 | 30 | Vassilis Papavassiliou | <pre><code>-a :user agent name (required) |
20 | 30 | Vassilis Papavassiliou | -type : the type of crawling. Crawling for monolingual (m) or parallel (p). |
21 | 38 | Prokopis Prokopidis | -cfg : the configuration file that will be used instead of the default (see crawler_config.xml above). |
22 | 34 | Vassilis Papavassiliou | -c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of |
23 | 34 | Vassilis Papavassiliou | the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time |
24 | 34 | Vassilis Papavassiliou | will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle. |
25 | 34 | Vassilis Papavassiliou | The default value is 10 minutes. |
26 | 34 | Vassilis Papavassiliou | -n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes |
27 | 34 | Vassilis Papavassiliou | -t : the number of threads that will be used to fetch web pages in parallel. |
28 | 34 | Vassilis Papavassiliou | -f : forces the crawler to start a new job (required) |
29 | 34 | Vassilis Papavassiliou | -lang : the targeted language in case of monolingual crawling (required). |
30 | 1 | Prokopis Prokopidis | -l1 : the first targeted language in case of bilingual crawling (required). |
31 | 34 | Vassilis Papavassiliou | -l2 : the second targeted language in case of bilingual crawling (required). |
32 | 47 | Vassilis Papavassiliou | -u : the text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling |
33 | 28 | Vassilis Papavassiliou | the list should contain only 1 or 2 URLs from the same web doamin. |
34 | 1 | Prokopis Prokopidis | -tc : domain definition |
35 | 42 | Vassilis Papavassiliou | -k : forces the crawler to annotate boilerplate content in parsed text. |
36 | 42 | Vassilis Papavassiliou | -filter : A regular expression to filter out URLs which do NOT match this regex. |
37 | 44 | Vassilis Papavassiliou | The use of this filter forces the crawler to either focus on a specific |
38 | 44 | Vassilis Papavassiliou | web domain (i.e. ".*ec.europa.eu.*"), or on a part of a web domain |
39 | 44 | Vassilis Papavassiliou | (e.g.".*/legislation_summaries/environment.*"). Note that if this filter |
40 | 44 | Vassilis Papavassiliou | is used, only the seed URLs that match this regex will be fetched. |
41 | 37 | Vassilis Papavassiliou | -d : forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages |
42 | 37 | Vassilis Papavassiliou | inside the same web site). It should be used only for monolingual crawling. |
43 | 22 | Prokopis Prokopidis | </code></pre> |
44 | 1 | Prokopis Prokopidis | |
45 | 1 | Prokopis Prokopidis | h2. Run a monolingual crawl |
46 | 1 | Prokopis Prokopidis | |
47 | 22 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \ |
48 | 22 | Prokopis Prokopidis | -cfg FMC_config.xml -t 10 -type m -c 10 -lang de -of output_test1_list.txt \ |
49 | 22 | Prokopis Prokopidis | -ofh output_test1_list.txt.html -tc Automotive-seed-terms-de.txt \ |
50 | 22 | Prokopis Prokopidis | -u Automotive-seed-urls.txt -xslt -f -k</code></pre> |
51 | 2 | Prokopis Prokopidis | |
52 | 1 | Prokopis Prokopidis | h2. Run a bilingual crawl |
53 | 12 | Vassilis Papavassiliou | |
54 | 29 | Vassilis Papavassiliou | <pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a test1 -c 10 -f -k -l1 de -l2 it \ |
55 | 29 | Vassilis Papavassiliou | -t 10 -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt \ |
56 | 29 | Vassilis Papavassiliou | -type p -xslt -u seed_suva.txt -cfg FBC_config.xml</code></pre> |
57 | 12 | Vassilis Papavassiliou | |
58 | 48 | Vassilis Papavassiliou | h2. Output |
59 | 51 | Vassilis Papavassiliou | |
60 | 50 | Vassilis Papavassiliou | The output of the ilsp-fc in case of monolingual crawl is a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See http://nlp.ilsp.gr/nlp/examples/2547.xml for an example in French for the Environment domain. |
61 | 50 | Vassilis Papavassiliou | The output of the ilsp-fc in case of bilingual crawl is a list of links to XML files following the cesAlign Corpus Encoding Standard for linking (parts of) cesDoc documents. This example http://nlp.ilsp.gr/panacea/xces-xslt/202_225.xml serves as a link between a pair of documents in English and Greek. |