Version 3 - History - FBC config - FBC config - ILSP Focused Crawler - ILSP NLP

FBC config » History » Version 3

« Previous - Version 3/6 (diff) - Next » - Current version
Vassilis Papavassiliou, 2014-08-15 11:40 AM

<?xml version="1.0" encoding="UTF-8"?>

yourmail@mail.com
www.youraddress.com

2
Minimum number of terms that must exist in clean
content of each web page in order to be stored.This number
is multiplied with the median value of the terms' weights and
the result is the threshold for the absolute relevance score.

2
Minimum unique terms that must exist in clean content

0.2
The absolute relevance score is divided by the length
(in terms of tokens) of the clean content of a document and the
calculated relative relevance score is compared with this value

10
Maximum depth to crawl before abandoning a specific path. Depth
is increased every time a link is extracted from a non-relevant web page.

hunalign-1.1/win/hunalign.exe
relative path to executable of hunalign for windows.
The main hugnalign directory is supposed to be next to the crawler's jar

hunalign-1.1/linux/src/hunalign/hunalign
relative path to executable of hunalign for linux.
The main hugnalign directory is supposed to be next to the crawler's jar

hunalign-1.1/dict
relative path to the dictionaries of hunalign.
The main hugnalign directory is supposed to be next to the crawler's jar

Max number of urls to fetch per run
512

1000
Socket timeout in milliseconds(per URL)

1000
Connection timeout in milliseconds(per URL)

2
Max number of attempts to fetch a Web page before giving up

0
Min bytes-per-seconds for fetching a web page

Accepted mime types

1000
delay in milliseconds between requests

531072
Max content size (bytes) for downloading a web page

512
Max fetch set size per run (Sets are made by URLs from the same host)

512
Max URLs from a specific host per run

100
Max number of fetching threads for each host

10000000
Max web pages to fetch per host

5
Max number of redirects

600000
Max time to wait for Fetcher to get all URLs in a run

Project

General

Profile

ILSP Focused Crawler

Wiki

FBC config » History » Version 3