Crawler config » History » Version 1
Prokopis Prokopidis, 2012-10-26 11:24 AM
1 | 1 | Prokopis Prokopidis | <pre><code class="xml"><?xml version="1.0" encoding="UTF-8"?> |
---|---|---|---|
2 | 1 | Prokopis Prokopidis | <configuration> |
3 | 1 | Prokopis Prokopidis | <agent> |
4 | 1 | Prokopis Prokopidis | <email>yourmail@mail.com</email> |
5 | 1 | Prokopis Prokopidis | <web_address>www.youraddress.com</web_address> |
6 | 1 | Prokopis Prokopidis | </agent> |
7 | 1 | Prokopis Prokopidis | <classifier> |
8 | 1 | Prokopis Prokopidis | <min_content_terms> |
9 | 1 | Prokopis Prokopidis | <value>2</value> |
10 | 1 | Prokopis Prokopidis | <description>Minimum number of terms that must exist in clean |
11 | 1 | Prokopis Prokopidis | content of each web page in order to be stored.</description> |
12 | 1 | Prokopis Prokopidis | </min_content_terms> |
13 | 1 | Prokopis Prokopidis | <min_unique_content_terms> |
14 | 1 | Prokopis Prokopidis | <value>2</value> |
15 | 1 | Prokopis Prokopidis | <description>Minimum unique terms that must exist in clean content</description> |
16 | 1 | Prokopis Prokopidis | </min_unique_content_terms> |
17 | 1 | Prokopis Prokopidis | <max_depth> |
18 | 1 | Prokopis Prokopidis | <value>10</value> |
19 | 1 | Prokopis Prokopidis | <description>Maximum depth to crawl before abandoning a specific path. Depth |
20 | 1 | Prokopis Prokopidis | is increased every time a link is extracted from a non-relevant web page.</description> |
21 | 1 | Prokopis Prokopidis | </max_depth> |
22 | 1 | Prokopis Prokopidis | </classifier> |
23 | 1 | Prokopis Prokopidis | <fetcher> |
24 | 1 | Prokopis Prokopidis | <fetch_buffer_size> |
25 | 1 | Prokopis Prokopidis | <description>Max number of urls to fetch per run</description> |
26 | 1 | Prokopis Prokopidis | <value>512</value> |
27 | 1 | Prokopis Prokopidis | </fetch_buffer_size> |
28 | 1 | Prokopis Prokopidis | <socket_timeout> |
29 | 1 | Prokopis Prokopidis | <value>10000</value> |
30 | 1 | Prokopis Prokopidis | <description>Socket timeout in milliseconds(per URL)</description> |
31 | 1 | Prokopis Prokopidis | </socket_timeout> |
32 | 1 | Prokopis Prokopidis | <connection_timeout> |
33 | 1 | Prokopis Prokopidis | <value>10000</value> |
34 | 1 | Prokopis Prokopidis | <description>Connection timeout in milliseconds(per URL)</description> |
35 | 1 | Prokopis Prokopidis | </connection_timeout> |
36 | 1 | Prokopis Prokopidis | <max_retry_count> |
37 | 1 | Prokopis Prokopidis | <value>2</value> |
38 | 1 | Prokopis Prokopidis | <description>Max number of attempts to fetch a Web page before giving up</description> |
39 | 1 | Prokopis Prokopidis | </max_retry_count> |
40 | 1 | Prokopis Prokopidis | <min_response_rate> |
41 | 1 | Prokopis Prokopidis | <value>0</value> |
42 | 1 | Prokopis Prokopidis | <description>Min bytes-per-seconds for fetching a web page</description> |
43 | 1 | Prokopis Prokopidis | </min_response_rate> |
44 | 1 | Prokopis Prokopidis | <valid_mime_types> |
45 | 1 | Prokopis Prokopidis | <mime_type value="text/html" /> |
46 | 1 | Prokopis Prokopidis | <mime_type value="text/plain" /> |
47 | 1 | Prokopis Prokopidis | <mime_type value="application/xhtml+xml" /> |
48 | 1 | Prokopis Prokopidis | <description>Accepted mime types</description> |
49 | 1 | Prokopis Prokopidis | </valid_mime_types> |
50 | 1 | Prokopis Prokopidis | <crawl_delay> |
51 | 1 | Prokopis Prokopidis | <value>1500</value> |
52 | 1 | Prokopis Prokopidis | <description>delay in milliseconds between requests</description> |
53 | 1 | Prokopis Prokopidis | </crawl_delay> |
54 | 1 | Prokopis Prokopidis | <max_content_size> |
55 | 1 | Prokopis Prokopidis | <value>531072</value> |
56 | 1 | Prokopis Prokopidis | <description>Max content size (bytes) for downloading a web page</description> |
57 | 1 | Prokopis Prokopidis | </max_content_size> |
58 | 1 | Prokopis Prokopidis | <max_requests_per_run> |
59 | 1 | Prokopis Prokopidis | <value>512</value> |
60 | 1 | Prokopis Prokopidis | <description>Max fetch set size per run (Sets are made by URLs from the same host)</description> |
61 | 1 | Prokopis Prokopidis | </max_requests_per_run> |
62 | 1 | Prokopis Prokopidis | <max_requests_per_host_per_run> |
63 | 1 | Prokopis Prokopidis | <value>512</value> |
64 | 1 | Prokopis Prokopidis | <description>Max URLs from a specific host per run</description> |
65 | 1 | Prokopis Prokopidis | </max_requests_per_host_per_run> |
66 | 1 | Prokopis Prokopidis | <max_connections_per_host> |
67 | 1 | Prokopis Prokopidis | <value>32</value> |
68 | 1 | Prokopis Prokopidis | <description>Max number of fetching threads for each host</description> |
69 | 1 | Prokopis Prokopidis | </max_connections_per_host> |
70 | 1 | Prokopis Prokopidis | <max_fetched_per_host> |
71 | 1 | Prokopis Prokopidis | <value>500000</value> |
72 | 1 | Prokopis Prokopidis | <description>Max web pages to fetch per host</description> |
73 | 1 | Prokopis Prokopidis | </max_fetched_per_host> |
74 | 1 | Prokopis Prokopidis | <max_redirects> |
75 | 1 | Prokopis Prokopidis | <value>5</value> |
76 | 1 | Prokopis Prokopidis | <descriptions>Max number of redirects</descriptions> |
77 | 1 | Prokopis Prokopidis | </max_redirects> |
78 | 1 | Prokopis Prokopidis | <request_timeout> |
79 | 1 | Prokopis Prokopidis | <value>600000</value> |
80 | 1 | Prokopis Prokopidis | <description>Max time to wait for Fetcher to get all URLs in a run</description> |
81 | 1 | Prokopis Prokopidis | </request_timeout> |
82 | 1 | Prokopis Prokopidis | </fetcher> |
83 | 1 | Prokopis Prokopidis | </configuration> |
84 | 1 | Prokopis Prokopidis | </code></pre> |