Project

General

Profile

Crawler config » History » Version 1

Version 1/4 - Next » - Current version
Prokopis Prokopidis, 2012-10-26 11:24 AM


<?xml version="1.0" encoding="UTF-8"?>


yourmail@mail.com
www.youraddress.com



2
Minimum number of terms that must exist in clean
content of each web page in order to be stored.


2
Minimum unique terms that must exist in clean content


10
Maximum depth to crawl before abandoning a specific path. Depth
is increased every time a link is extracted from a non-relevant web page.




Max number of urls to fetch per run
512


10000
Socket timeout in milliseconds(per URL)


10000
Connection timeout in milliseconds(per URL)


2
Max number of attempts to fetch a Web page before giving up


0
Min bytes-per-seconds for fetching a web page





Accepted mime types


1500
delay in milliseconds between requests


531072
Max content size (bytes) for downloading a web page


512
Max fetch set size per run (Sets are made by URLs from the same host)


512
Max URLs from a specific host per run


32
Max number of fetching threads for each host



500000
Max web pages to fetch per host


5
Max number of redirects


600000
Max time to wait for Fetcher to get all URLs in a run