FBC config » History » Version 3
Vassilis Papavassiliou, 2014-08-15 11:40 AM
1 | 2 | Vassilis Papavassiliou | <pre><code class="xml"> |
---|---|---|---|
2 | 1 | Vassilis Papavassiliou | <?xml version="1.0" encoding="UTF-8"?> |
3 | 1 | Vassilis Papavassiliou | <configuration> |
4 | 1 | Vassilis Papavassiliou | <agent> |
5 | 1 | Vassilis Papavassiliou | <email>yourmail@mail.com</email> |
6 | 1 | Vassilis Papavassiliou | <web_address>www.youraddress.com</web_address> |
7 | 1 | Vassilis Papavassiliou | </agent> |
8 | 1 | Vassilis Papavassiliou | <classifier> |
9 | 1 | Vassilis Papavassiliou | <min_content_terms> |
10 | 1 | Vassilis Papavassiliou | <value>2</value> |
11 | 1 | Vassilis Papavassiliou | <description>Minimum number of terms that must exist in clean |
12 | 3 | Vassilis Papavassiliou | content of each web page in order to be stored.This number |
13 | 3 | Vassilis Papavassiliou | is multiplied with the median value of the terms' weights and |
14 | 3 | Vassilis Papavassiliou | the result is the threshold for the absolute relevance score.</description> |
15 | 1 | Vassilis Papavassiliou | </min_content_terms> |
16 | 1 | Vassilis Papavassiliou | <min_unique_content_terms> |
17 | 1 | Vassilis Papavassiliou | <value>2</value> |
18 | 1 | Vassilis Papavassiliou | <description>Minimum unique terms that must exist in clean content</description> |
19 | 1 | Vassilis Papavassiliou | </min_unique_content_terms> |
20 | 3 | Vassilis Papavassiliou | <relative_relevance_threshold> |
21 | 3 | Vassilis Papavassiliou | <value>0.2</value> |
22 | 3 | Vassilis Papavassiliou | <description>The absolute relevance score is divided by the length |
23 | 3 | Vassilis Papavassiliou | (in terms of tokens) of the clean content of a document and the |
24 | 3 | Vassilis Papavassiliou | calculated relative relevance score is compared with this value</description> |
25 | 3 | Vassilis Papavassiliou | </relative_relevance_threshold> |
26 | 1 | Vassilis Papavassiliou | <max_depth> |
27 | 1 | Vassilis Papavassiliou | <value>10</value> |
28 | 1 | Vassilis Papavassiliou | <description>Maximum depth to crawl before abandoning a specific path. Depth |
29 | 1 | Vassilis Papavassiliou | is increased every time a link is extracted from a non-relevant web page.</description> |
30 | 1 | Vassilis Papavassiliou | </max_depth> |
31 | 1 | Vassilis Papavassiliou | </classifier> |
32 | 3 | Vassilis Papavassiliou | <aligner> |
33 | 3 | Vassilis Papavassiliou | <win_align_path> |
34 | 3 | Vassilis Papavassiliou | <value>hunalign-1.1/win/hunalign.exe</value> |
35 | 3 | Vassilis Papavassiliou | <description>relative path to executable of hunalign for windows. |
36 | 3 | Vassilis Papavassiliou | The main hugnalign directory is supposed to be next to the crawler's jar</description> |
37 | 3 | Vassilis Papavassiliou | </win_align_path> |
38 | 3 | Vassilis Papavassiliou | <lin_align_path> |
39 | 3 | Vassilis Papavassiliou | <value>hunalign-1.1/linux/src/hunalign/hunalign</value> |
40 | 3 | Vassilis Papavassiliou | <description>relative path to executable of hunalign for linux. |
41 | 3 | Vassilis Papavassiliou | The main hugnalign directory is supposed to be next to the crawler's jar</description> |
42 | 3 | Vassilis Papavassiliou | </lin_align_path> |
43 | 3 | Vassilis Papavassiliou | <align_dict> |
44 | 3 | Vassilis Papavassiliou | <value>hunalign-1.1/dict</value> |
45 | 3 | Vassilis Papavassiliou | <description>relative path to the dictionaries of hunalign. |
46 | 3 | Vassilis Papavassiliou | The main hugnalign directory is supposed to be next to the crawler's jar</description> |
47 | 3 | Vassilis Papavassiliou | </align_dict> |
48 | 3 | Vassilis Papavassiliou | </aligner> |
49 | 1 | Vassilis Papavassiliou | <fetcher> |
50 | 1 | Vassilis Papavassiliou | <fetch_buffer_size> |
51 | 1 | Vassilis Papavassiliou | <description>Max number of urls to fetch per run</description> |
52 | 1 | Vassilis Papavassiliou | <value>512</value> |
53 | 1 | Vassilis Papavassiliou | </fetch_buffer_size> |
54 | 1 | Vassilis Papavassiliou | <socket_timeout> |
55 | 3 | Vassilis Papavassiliou | <value>1000</value> |
56 | 1 | Vassilis Papavassiliou | <description>Socket timeout in milliseconds(per URL)</description> |
57 | 1 | Vassilis Papavassiliou | </socket_timeout> |
58 | 1 | Vassilis Papavassiliou | <connection_timeout> |
59 | 3 | Vassilis Papavassiliou | <value>1000</value> |
60 | 1 | Vassilis Papavassiliou | <description>Connection timeout in milliseconds(per URL)</description> |
61 | 1 | Vassilis Papavassiliou | </connection_timeout> |
62 | 1 | Vassilis Papavassiliou | <max_retry_count> |
63 | 1 | Vassilis Papavassiliou | <value>2</value> |
64 | 1 | Vassilis Papavassiliou | <description>Max number of attempts to fetch a Web page before giving up</description> |
65 | 1 | Vassilis Papavassiliou | </max_retry_count> |
66 | 1 | Vassilis Papavassiliou | <min_response_rate> |
67 | 1 | Vassilis Papavassiliou | <value>0</value> |
68 | 1 | Vassilis Papavassiliou | <description>Min bytes-per-seconds for fetching a web page</description> |
69 | 1 | Vassilis Papavassiliou | </min_response_rate> |
70 | 1 | Vassilis Papavassiliou | <valid_mime_types> |
71 | 1 | Vassilis Papavassiliou | <mime_type value="text/html" /> |
72 | 1 | Vassilis Papavassiliou | <mime_type value="text/plain" /> |
73 | 1 | Vassilis Papavassiliou | <mime_type value="application/xhtml+xml" /> |
74 | 1 | Vassilis Papavassiliou | <description>Accepted mime types</description> |
75 | 1 | Vassilis Papavassiliou | </valid_mime_types> |
76 | 1 | Vassilis Papavassiliou | <crawl_delay> |
77 | 3 | Vassilis Papavassiliou | <value>1000</value> |
78 | 1 | Vassilis Papavassiliou | <description>delay in milliseconds between requests</description> |
79 | 1 | Vassilis Papavassiliou | </crawl_delay> |
80 | 1 | Vassilis Papavassiliou | <max_content_size> |
81 | 1 | Vassilis Papavassiliou | <value>531072</value> |
82 | 1 | Vassilis Papavassiliou | <description>Max content size (bytes) for downloading a web page</description> |
83 | 1 | Vassilis Papavassiliou | </max_content_size> |
84 | 1 | Vassilis Papavassiliou | <max_requests_per_run> |
85 | 1 | Vassilis Papavassiliou | <value>512</value> |
86 | 1 | Vassilis Papavassiliou | <description>Max fetch set size per run (Sets are made by URLs from the same host)</description> |
87 | 1 | Vassilis Papavassiliou | </max_requests_per_run> |
88 | 1 | Vassilis Papavassiliou | <max_requests_per_host_per_run> |
89 | 1 | Vassilis Papavassiliou | <value>512</value> |
90 | 1 | Vassilis Papavassiliou | <description>Max URLs from a specific host per run</description> |
91 | 1 | Vassilis Papavassiliou | </max_requests_per_host_per_run> |
92 | 1 | Vassilis Papavassiliou | <max_connections_per_host> |
93 | 3 | Vassilis Papavassiliou | <value>100</value> |
94 | 1 | Vassilis Papavassiliou | <description>Max number of fetching threads for each host</description> |
95 | 1 | Vassilis Papavassiliou | </max_connections_per_host> |
96 | 1 | Vassilis Papavassiliou | <max_fetched_per_host> |
97 | 1 | Vassilis Papavassiliou | <value>10000000</value> |
98 | 1 | Vassilis Papavassiliou | <description>Max web pages to fetch per host</description> |
99 | 1 | Vassilis Papavassiliou | </max_fetched_per_host> |
100 | 1 | Vassilis Papavassiliou | <max_redirects> |
101 | 1 | Vassilis Papavassiliou | <value>5</value> |
102 | 1 | Vassilis Papavassiliou | <descriptions>Max number of redirects</descriptions> |
103 | 1 | Vassilis Papavassiliou | </max_redirects> |
104 | 1 | Vassilis Papavassiliou | <request_timeout> |
105 | 1 | Vassilis Papavassiliou | <value>600000</value> |
106 | 1 | Vassilis Papavassiliou | <description>Max time to wait for Fetcher to get all URLs in a run</description> |
107 | 1 | Vassilis Papavassiliou | </request_timeout> |
108 | 1 | Vassilis Papavassiliou | </fetcher> |
109 | 1 | Vassilis Papavassiliou | </configuration> |
110 | 2 | Vassilis Papavassiliou | </code></pre> |