Version 38 - History - Getting Started - ILSP Focused Crawler - ILSP NLP

Getting Started » History » Version 38

Version 37 (Vassilis Papavassiliou, 2012-12-03 05:27 PM) → Version 38/167 (Prokopis Prokopidis, 2012-12-03 05:34 PM)

h1. Getting Started

Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this

<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar</code></pre>

There are several settings that influence the crawling process and can be defined in a the configuration file (the default file is [[crawler_config.xml]]) before the crawling process. The default configuration file is [[crawler_config.xml]] and it is included in the ilsp-fc runnable jar. Two typical examples are [[FMC_config.xml]] for monolingual crawls and [[FBC_config.xml]] for bilingual crawls.

Some of the settings them can also be overriden set in the command running the ilsp-fc runnable jar, as follows:

<pre><code>-a :user agent name (required)
-type : the type of crawling. Crawling for monolingual (m) or parallel (p).
-cfg : the configuration file that will be used instead of the default (see crawler_config.xml above).
It is
proposed to use [[FMC_config.xml]] for monolingual crawl and [[FBC_config.xml]] for bilingual crawl.
-c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of
the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time
will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle.
The default value is 10 minutes.
-n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes
-t : the number of threads that will be used to fetch web pages in parallel.
-f : forces the crawler to start a new job (required)
-lang : the targeted language in case of monolingual crawling (required).
-l1 : the first targeted language in case of bilingual crawling (required).
-l2 : the second targeted language in case of bilingual crawling (required).

-u : the text file that contains the seed URLs that will initialize the crawler. In case of blingual crawling
the list should contain only 1 or 2 URLs from the same web doamin.
-tc : domain definition
-d : forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages
inside the same web site). It should be used only for monolingual crawling.
</code></pre>

h2. Run a monolingual crawl

<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \
-cfg FMC_config.xml -t 10 -type m -c 10 -lang de -of output_test1_list.txt \
-ofh output_test1_list.txt.html -tc Automotive-seed-terms-de.txt \
-u Automotive-seed-urls.txt -xslt -f -k</code></pre>

h2. Run a bilingual crawl

<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a test1 -c 10 -f -k -l1 de -l2 it \
-t 10 -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt \
-type p -xslt -u seed_suva.txt -cfg FBC_config.xml</code></pre>

h2. Example of java code

<pre>
<code class="java">
package gr.ilsp.fmc.classifier;

public enum ClassifierCounters {
CLASSIFIER_DOCUMENTS_PASSED, // successfully classified a document.
CLASSIFIER_DOCUMENTS_FAILED, // failed to classify a document
CLASSIFIER_DOCUMENTS_ABORTED,
CLASSIFIER_TIME
}</code></pre>

<pre>
<code class="xml">
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<agent>
<email>yourmail@mail.com</email>
<web_address>www.youraddress.com</web_address>
</agent>
</configuration></code></pre>

Project

General

Profile

ILSP Focused Crawler

Getting Started » History » Version 38