FMC config » History » Version 4
Vassilis Papavassiliou, 2014-08-15 12:54 PM
1 | 3 | Vassilis Papavassiliou | <pre><code class="xml"> |
---|---|---|---|
2 | 1 | Vassilis Papavassiliou | <?xml version="1.0" encoding="UTF-8"?> |
3 | 1 | Vassilis Papavassiliou | <configuration> |
4 | 1 | Vassilis Papavassiliou | <agent> |
5 | 1 | Vassilis Papavassiliou | <email>yourmail@mail.com</email> |
6 | 1 | Vassilis Papavassiliou | <web_address>www.youraddress.com</web_address> |
7 | 1 | Vassilis Papavassiliou | </agent> |
8 | 1 | Vassilis Papavassiliou | <classifier> |
9 | 1 | Vassilis Papavassiliou | <min_content_terms> |
10 | 1 | Vassilis Papavassiliou | <value>4</value> |
11 | 1 | Vassilis Papavassiliou | <description>Minimum number of terms that must exist in clean |
12 | 4 | Vassilis Papavassiliou | content of each web page in order to be stored.This number |
13 | 4 | Vassilis Papavassiliou | is multiplied with the median value of the terms'weights and |
14 | 4 | Vassilis Papavassiliou | the result is the threshold for the absolute relevance score.</description> |
15 | 1 | Vassilis Papavassiliou | </min_content_terms> |
16 | 1 | Vassilis Papavassiliou | <min_unique_content_terms> |
17 | 4 | Vassilis Papavassiliou | <value>4</value> |
18 | 1 | Vassilis Papavassiliou | <description>Minimum unique terms that must exist in clean content</description> |
19 | 1 | Vassilis Papavassiliou | </min_unique_content_terms> |
20 | 4 | Vassilis Papavassiliou | <relative_relevance_threshold> |
21 | 4 | Vassilis Papavassiliou | <value>0.2</value> |
22 | 4 | Vassilis Papavassiliou | <description>The absolute relevance score is divided by the length |
23 | 4 | Vassilis Papavassiliou | (in terms of tokens) of the clean content of a document and the |
24 | 4 | Vassilis Papavassiliou | calculated relative relevance score is compared with this value</description> |
25 | 4 | Vassilis Papavassiliou | </relative_relevance_threshold> |
26 | 1 | Vassilis Papavassiliou | <max_depth> |
27 | 4 | Vassilis Papavassiliou | <value>4</value> |
28 | 1 | Vassilis Papavassiliou | <description>Maximum depth to crawl before abandoning a specific path. Depth |
29 | 1 | Vassilis Papavassiliou | is increased every time a link is extracted from a non-relevant web page.</description> |
30 | 1 | Vassilis Papavassiliou | </max_depth> |
31 | 1 | Vassilis Papavassiliou | </classifier> |
32 | 1 | Vassilis Papavassiliou | <fetcher> |
33 | 1 | Vassilis Papavassiliou | <fetch_buffer_size> |
34 | 1 | Vassilis Papavassiliou | <description>Max number of urls to fetch per run</description> |
35 | 1 | Vassilis Papavassiliou | <value>512</value> |
36 | 1 | Vassilis Papavassiliou | </fetch_buffer_size> |
37 | 1 | Vassilis Papavassiliou | <socket_timeout> |
38 | 1 | Vassilis Papavassiliou | <value>10000</value> |
39 | 1 | Vassilis Papavassiliou | <description>Socket timeout in milliseconds(per URL)</description> |
40 | 1 | Vassilis Papavassiliou | </socket_timeout> |
41 | 1 | Vassilis Papavassiliou | <connection_timeout> |
42 | 1 | Vassilis Papavassiliou | <value>10000</value> |
43 | 1 | Vassilis Papavassiliou | <description>Connection timeout in milliseconds(per URL)</description> |
44 | 1 | Vassilis Papavassiliou | </connection_timeout> |
45 | 1 | Vassilis Papavassiliou | <max_retry_count> |
46 | 1 | Vassilis Papavassiliou | <value>2</value> |
47 | 1 | Vassilis Papavassiliou | <description>Max number of attempts to fetch a Web page before giving up</description> |
48 | 1 | Vassilis Papavassiliou | </max_retry_count> |
49 | 1 | Vassilis Papavassiliou | <min_response_rate> |
50 | 1 | Vassilis Papavassiliou | <value>0</value> |
51 | 1 | Vassilis Papavassiliou | <description>Min bytes-per-seconds for fetching a web page</description> |
52 | 1 | Vassilis Papavassiliou | </min_response_rate> |
53 | 1 | Vassilis Papavassiliou | <valid_mime_types> |
54 | 1 | Vassilis Papavassiliou | <mime_type value="text/html" /> |
55 | 1 | Vassilis Papavassiliou | <mime_type value="text/plain" /> |
56 | 1 | Vassilis Papavassiliou | <mime_type value="application/xhtml+xml" /> |
57 | 4 | Vassilis Papavassiliou | <!--<mime_type value="application/pdf" /> |
58 | 4 | Vassilis Papavassiliou | <mime_type value="application/x-pdf" /> --> |
59 | 1 | Vassilis Papavassiliou | <description>Accepted mime types</description> |
60 | 1 | Vassilis Papavassiliou | </valid_mime_types> |
61 | 1 | Vassilis Papavassiliou | <crawl_delay> |
62 | 1 | Vassilis Papavassiliou | <value>1500</value> |
63 | 1 | Vassilis Papavassiliou | <description>delay in milliseconds between requests</description> |
64 | 1 | Vassilis Papavassiliou | </crawl_delay> |
65 | 1 | Vassilis Papavassiliou | <max_content_size> |
66 | 1 | Vassilis Papavassiliou | <value>531072</value> |
67 | 1 | Vassilis Papavassiliou | <description>Max content size (bytes) for downloading a web page</description> |
68 | 1 | Vassilis Papavassiliou | </max_content_size> |
69 | 1 | Vassilis Papavassiliou | <max_requests_per_run> |
70 | 1 | Vassilis Papavassiliou | <value>512</value> |
71 | 1 | Vassilis Papavassiliou | <description>Max fetch set size per run (Sets are made by URLs from the same host)</description> |
72 | 1 | Vassilis Papavassiliou | </max_requests_per_run> |
73 | 1 | Vassilis Papavassiliou | <max_requests_per_host_per_run> |
74 | 1 | Vassilis Papavassiliou | <value>512</value> |
75 | 1 | Vassilis Papavassiliou | <description>Max URLs from a specific host per run</description> |
76 | 1 | Vassilis Papavassiliou | </max_requests_per_host_per_run> |
77 | 1 | Vassilis Papavassiliou | <max_connections_per_host> |
78 | 1 | Vassilis Papavassiliou | <value>32</value> |
79 | 1 | Vassilis Papavassiliou | <description>Max number of fetching threads for each host</description> |
80 | 1 | Vassilis Papavassiliou | </max_connections_per_host> |
81 | 1 | Vassilis Papavassiliou | <max_fetched_per_host> |
82 | 1 | Vassilis Papavassiliou | <value>1000</value> |
83 | 1 | Vassilis Papavassiliou | <description>Max web pages to fetch per host</description> |
84 | 1 | Vassilis Papavassiliou | </max_fetched_per_host> |
85 | 1 | Vassilis Papavassiliou | <max_redirects> |
86 | 1 | Vassilis Papavassiliou | <value>5</value> |
87 | 1 | Vassilis Papavassiliou | <descriptions>Max number of redirects</descriptions> |
88 | 1 | Vassilis Papavassiliou | </max_redirects> |
89 | 1 | Vassilis Papavassiliou | <request_timeout> |
90 | 1 | Vassilis Papavassiliou | <value>600000</value> |
91 | 1 | Vassilis Papavassiliou | <description>Max time to wait for Fetcher to get all URLs in a run</description> |
92 | 1 | Vassilis Papavassiliou | </request_timeout> |
93 | 1 | Vassilis Papavassiliou | </fetcher> |
94 | 1 | Vassilis Papavassiliou | </configuration> |
95 | 3 | Vassilis Papavassiliou | </code></pre> |