Project

General

Profile

Getting Started » History » Version 17

« Previous - Version 17/167 (diff) - Next » - Current version
Vassilis Papavassiliou, 2012-10-25 02:38 PM


h1. Getting Started

Once you build or download an ilsp-fc runnable jar, you can run it like this

java -jar ilsp-fc-1.1-jar-with-dependencies.jar

There are several settings that influence the crawling process and can be defined in the configuration file (the default file is crawler_config.xml) before the crawling process. Some of them can also be set in the command running the ilsp-fc runnable jar, as follows:

-a :user agent name
-t :the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle. The default value is 10 minutes.
-n :the crawl duration in cycles.

h2. Run a monolingual crawl

java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr -cfg FMC_config.xml -t 10 -type m -c 10 -lang de -of output_test1_list.txt -ofh output_test1_list.txt.html -tc Automotive-seed-terms-de.txt -u Automotive-seed-urls.txt -xslt -f -k

h2. Run a bilingual crawl

java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a test1 -c 10 -f -k -l1 de -l2 it -t 10 -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt -type p -xslt -u seed_suva.txt -cfg FBC_config.xml

h2. Example of java code

package gr.ilsp.fmc.classifier;

public enum ClassifierCounters {
CLASSIFIER_DOCUMENTS_PASSED, // successfully classified a document.
CLASSIFIER_DOCUMENTS_FAILED, // failed to classify a document
CLASSIFIER_DOCUMENTS_ABORTED,
CLASSIFIER_TIME
}

<?xml version="1.0" encoding="UTF-8"?>


yourmail@mail.com
www.youraddress.com