Getting Started » History » Version 25
Version 24 (Vassilis Papavassiliou, 2012-10-26 03:56 PM) → Version 25/167 (Vassilis Papavassiliou, 2012-10-26 03:58 PM)
h1. Getting Started
Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar</code></pre>
There are several settings that influence the crawling process and can be defined in the configuration file (the default file is [[crawler_config.xml]]) before the crawling process. Some of them can also be set in the command running the ilsp-fc runnable jar, as follows:
<pre><code>-a :user agent name
-c :the crawl duration in minutes. Since the crawler runs in cycles
(during which links stored at the top of the crawler’s frontier
are extracted and new links are examined) it is very likely that
the defined time will expire during a cycle run. Then, the crawler
will stop only after the end of the running cycle.
The default value is 10 minutes.
-n :the crawl duration in cycles.
-t :the number of threads that will be used to fetch web pages in parallel.
-type : the type of crawling. Crawling for monolingual (m) or parallel (p).
-lang : the targeted language in case of monolingual crawling.
-l1 : the first targeted language in case of bilingual crawling.
-l2 : the second targeted language in case of bilingual crawling. -
</code></pre>
h2. Run a monolingual crawl
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \
-cfg FMC_config.xml -t 10 -type m -c 10 -lang de -of output_test1_list.txt \
-ofh output_test1_list.txt.html -tc Automotive-seed-terms-de.txt \
-u Automotive-seed-urls.txt -xslt -f -k</code></pre>
h2. Run a bilingual crawl
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a test1 -c 10 -f -k -l1 de -l2 it -t 10 -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt -type p -xslt -u seed_suva.txt -cfg FBC_config.xml</code></pre>
h2. Example of java code
<pre>
<code class="java">
package gr.ilsp.fmc.classifier;
public enum ClassifierCounters {
CLASSIFIER_DOCUMENTS_PASSED, // successfully classified a document.
CLASSIFIER_DOCUMENTS_FAILED, // failed to classify a document
CLASSIFIER_DOCUMENTS_ABORTED,
CLASSIFIER_TIME
}</code></pre>
<pre>
<code class="xml">
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<agent>
<email>yourmail@mail.com</email>
<web_address>www.youraddress.com</web_address>
</agent>
</configuration></code></pre>
Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar</code></pre>
There are several settings that influence the crawling process and can be defined in the configuration file (the default file is [[crawler_config.xml]]) before the crawling process. Some of them can also be set in the command running the ilsp-fc runnable jar, as follows:
<pre><code>-a :user agent name
-c :the crawl duration in minutes. Since the crawler runs in cycles
(during which links stored at the top of the crawler’s frontier
are extracted and new links are examined) it is very likely that
the defined time will expire during a cycle run. Then, the crawler
will stop only after the end of the running cycle.
The default value is 10 minutes.
-n :the crawl duration in cycles.
-t :the number of threads that will be used to fetch web pages in parallel.
-type : the type of crawling. Crawling for monolingual (m) or parallel (p).
-lang : the targeted language in case of monolingual crawling.
-l1 : the first targeted language in case of bilingual crawling.
-l2 : the second targeted language in case of bilingual crawling. -
</code></pre>
h2. Run a monolingual crawl
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \
-cfg FMC_config.xml -t 10 -type m -c 10 -lang de -of output_test1_list.txt \
-ofh output_test1_list.txt.html -tc Automotive-seed-terms-de.txt \
-u Automotive-seed-urls.txt -xslt -f -k</code></pre>
h2. Run a bilingual crawl
<pre><code>java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a test1 -c 10 -f -k -l1 de -l2 it -t 10 -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt -type p -xslt -u seed_suva.txt -cfg FBC_config.xml</code></pre>
h2. Example of java code
<pre>
<code class="java">
package gr.ilsp.fmc.classifier;
public enum ClassifierCounters {
CLASSIFIER_DOCUMENTS_PASSED, // successfully classified a document.
CLASSIFIER_DOCUMENTS_FAILED, // failed to classify a document
CLASSIFIER_DOCUMENTS_ABORTED,
CLASSIFIER_TIME
}</code></pre>
<pre>
<code class="xml">
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<agent>
<email>yourmail@mail.com</email>
<web_address>www.youraddress.com</web_address>
</agent>
</configuration></code></pre>