Crawl » History » Version 16
Vassilis Papavassiliou, 2016-05-31 04:48 PM
1 | 1 | Prokopis Prokopidis | # Crawl |
---|---|---|---|
2 | 1 | Prokopis Prokopidis | |
3 | 9 | Vassilis Papavassiliou | In general, the crawler initializes its frontier (i.e. the list of pages to be visited) from a seed URL list, fetches the web pages, extracts links from fetched web pages, adds the links to the list of pages to be visited and so on. During this process, modules for page fetching, content normalization, boilerplate removal, metadata extraction, text classification, and link extraction and prioritization are used. Users can configure several settings that determine the fetching process or the strictness of the text classifier by modifying the default configuration file [[FBC_config.xml]], when crawl for multilingual data, or [[FMC_config.xml]] for acquiring monolingual data. |
4 | 1 | Prokopis Prokopidis | |
5 | 1 | Prokopis Prokopidis | The required input from the user consists of a list of seed URLs to initiate the crawler's frontier, and a list of terms that describe a targeted topic. If the user does not provide a list of terms, the software can be used as a general crawler. |
6 | 1 | Prokopis Prokopidis | |
7 | 2 | Vassilis Papavassiliou | The following example starts a new crawl for acquiring multilingual data. In the defined destination (argument of option -dest), a directory (the crawl directory) is created (its name is based on the argument of option -a and date). |
8 | 2 | Vassilis Papavassiliou | In this crawl directory a child directory (auto-generated directory) denoted the crawl id. Next in this auto-generated directory, the data acquired in each crawl cycle is stored in run directories. |
9 | 2 | Vassilis Papavassiliou | |
10 | 2 | Vassilis Papavassiliou | ``` |
11 | 2 | Vassilis Papavassiliou | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-SNAPSHOT-jar-with-dependencies.jar -crawl -f -k -a test -type p \ |
12 | 2 | Vassilis Papavassiliou | -lang "L1;L2;L3;L4" -n 2 -t 20 -len 0 -mtlen 100 -dest (fullpath of the destination to store crawl results) -u (fullpath of the text file containing the seed URLs) \ |
13 | 12 | Vassilis Papavassiliou | -filter (regex to control URLs to be visited) -tc (full path of topic file) -dom (title of targeted topic) &>"log_crawl" |
14 | 2 | Vassilis Papavassiliou | ``` |
15 | 2 | Vassilis Papavassiliou | |
16 | 7 | Vassilis Papavassiliou | By changing arguments of options -type and -lang to "m" and "L1" respectively, the command could be used for acquiring monolingual data. |
17 | 6 | Vassilis Papavassiliou | |
18 | 7 | Vassilis Papavassiliou | This is an example of running a crawl: |
19 | 1 | Prokopis Prokopidis | |
20 | 1 | Prokopis Prokopidis | ``` |
21 | 16 | Vassilis Papavassiliou | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-jar-with-dependencies.jar \ |
22 | 1 | Prokopis Prokopidis | -crawl -f -type p -lang "eng;ita" -a www_esteri_it \ |
23 | 1 | Prokopis Prokopidis | -u "/home/user/seeds/eng-ita-seeds" -filter ".*www\.esteri\.it\/.*" \ |
24 | 1 | Prokopis Prokopidis | -n 100 -t 20 -len 0 -mtlen 100 -k \ |
25 | 1 | Prokopis Prokopidis | -dest "/var/www/html/tests/eng-ita/" \ |
26 | 1 | Prokopis Prokopidis | &> "/var/www/html/tests/eng-ita/log-crawl_www_esteri_it_eng-ita" |
27 | 11 | Prokopis Prokopidis | |
28 | 10 | Prokopis Prokopidis | ``` |
29 | 13 | Vassilis Papavassiliou | |
30 | 15 | Vassilis Papavassiliou | ## Options |
31 | 13 | Vassilis Papavassiliou | |
32 | 13 | Vassilis Papavassiliou | ``` |
33 | 14 | Vassilis Papavassiliou | -crawl : For applying crawling process. |
34 | 1 | Prokopis Prokopidis | |
35 | 14 | Vassilis Papavassiliou | -f : Forces the crawler to start a new job. |
36 | 1 | Prokopis Prokopidis | |
37 | 14 | Vassilis Papavassiliou | -type : The type of crawling. Crawling for monolingual (m) or parallel (p). |
38 | 1 | Prokopis Prokopidis | |
39 | 14 | Vassilis Papavassiliou | -lang : The language iso codes of the targeted languages separated by ";". |
40 | 1 | Prokopis Prokopidis | |
41 | 14 | Vassilis Papavassiliou | -cfg : The full path to a configuration file that can be used to override default parameters. |
42 | 1 | Prokopis Prokopidis | |
43 | 14 | Vassilis Papavassiliou | -a : User agent name. It is proposed to use a name similar to the targeted site. |
44 | 1 | Prokopis Prokopidis | |
45 | 14 | Vassilis Papavassiliou | -u : fullpath of text file that contains the seed URLs that will initialize the crawler. |
46 | 14 | Vassilis Papavassiliou | In case of bilingual crawling the list should contain the URL of the main page of the targeted website, |
47 | 14 | Vassilis Papavassiliou | or (of course) other URLs of this website. |
48 | 1 | Prokopis Prokopidis | |
49 | 14 | Vassilis Papavassiliou | -filter : A regular expression to filter out URLs which do NOT match this regex. |
50 | 14 | Vassilis Papavassiliou | The use of this filter forces the crawler to either focus on a specific web domain (i.e. ".*ec.europa.eu.*"), |
51 | 14 | Vassilis Papavassiliou | or on a part of a web domain (e.g.".*/legislation_summaries/environment.*") or in different web sites (i.e. |
52 | 14 | Vassilis Papavassiliou | in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). |
53 | 14 | Vassilis Papavassiliou | Note that if this filter is used, only the seed URLs that match this regex will be fetched. |
54 | 1 | Prokopis Prokopidis | |
55 | 14 | Vassilis Papavassiliou | -n : The crawl duration in cycles. Since the crawler runs in cycles (during which links stored at the top |
56 | 14 | Vassilis Papavassiliou | of the crawler’s frontier are extracted and new links are examined) it is proposed to use this parameter either |
57 | 14 | Vassilis Papavassiliou | for testing purposes or selecting a large number (i.e. 100) to "verify" that the crawler will visit the entire website. |
58 | 1 | Prokopis Prokopidis | |
59 | 14 | Vassilis Papavassiliou | -dest : The directory where the results (i.e. the crawled data) will be stored. |
60 | 14 | Vassilis Papavassiliou | The tool will create the file structure dest/agent/crawl-id (where dest and agent stand for the arguments |
61 | 14 | Vassilis Papavassiliou | of parameters dest and agent respectively and crawl-id is generated automatically). In this directory, |
62 | 14 | Vassilis Papavassiliou | the tool will create the "run" directories (i.e. directories containing all resources |
63 | 14 | Vassilis Papavassiliou | fetched/extracted/used/required for each cycle of this crawl. |
64 | 14 | Vassilis Papavassiliou | In addition a pdf directory for storing acquired pdf files will be created. |
65 | 1 | Prokopis Prokopidis | |
66 | 14 | Vassilis Papavassiliou | -t : The number of threads that will be used to fetch web pages in parallel. |
67 | 1 | Prokopis Prokopidis | |
68 | 14 | Vassilis Papavassiliou | -k : Forces the crawler to annotate boilerplate content in parsed text. |
69 | 1 | Prokopis Prokopidis | |
70 | 14 | Vassilis Papavassiliou | -len : Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph |
71 | 14 | Vassilis Papavassiliou | is less than this value the paragraph will be annotated as "out of interest" |
72 | 14 | Vassilis Papavassiliou | and will not be included into the clean text of the web page. |
73 | 1 | Prokopis Prokopidis | |
74 | 14 | Vassilis Papavassiliou | -mtlen : Minimum number of tokens in cleaned document. If the length (in terms of tokens) |
75 | 14 | Vassilis Papavassiliou | of the cleaned text is less than this value, the document will not be stored. |
76 | 1 | Prokopis Prokopidis | |
77 | 14 | Vassilis Papavassiliou | -tc : fullpath of topic file (a text file that contains a list of term triplets that describe the targeted topic). |
78 | 14 | Vassilis Papavassiliou | An example domain definition of "Environment" for the English-Spanish pair can be found at |
79 | 14 | Vassilis Papavassiliou | http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki/ENV_EN_ES_topic. If omitted, |
80 | 14 | Vassilis Papavassiliou | the crawl will be a "general" one (i.e. module for text-to-domain classification will not be used). |
81 | 1 | Prokopis Prokopidis | |
82 | 14 | Vassilis Papavassiliou | -dom : Title of the targeted domain (required when domain definition, i.e. tc parameter, is used). |
83 | 13 | Vassilis Papavassiliou | |
84 | 14 | Vassilis Papavassiliou | -storefilter : A regular expression to discard (i.e. visit/fetch/process but do not store) webpages with URLs which do NOT match this regex. |
85 | 1 | Prokopis Prokopidis | |
86 | 1 | Prokopis Prokopidis | ``` |