Version 51 - History - Getting Started - Getting Started - ILSP Focused Crawler - ILSP NLP

Getting Started » History » Version 51

« Previous - Version 51/167 (diff) - Next » - Current version
Vassilis Papavassiliou, 2012-12-05 10:28 AM

h1. Getting Started

Once you build or download an ilsp-fc runnable jar, you can run it like this

java -jar ilsp-fc-1.1-jar-with-dependencies.jar

The required input from the user consists of:
* a list of term triplets (_) that describe a domain and, optionally, subcategories of this domain. An example domain definition can be found at ENV_EN_topic.txt for the _Environment domain in English.
* a list of seed URLs pointing to relevant web pages. An example seed URL list for Environment in English can be found at ENV_EN_seeds.txt.

In case of bilingual crawling, the required input from the user includes:
* a list of term triplets (_) that describe a domain and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition for the English-Spanish pair can be found at ENV_EN_ES_topic.txt.
* a seed URL list which should contain only one URL (e.g. ENV</em>EN_ES_seed.txt). The crawler will visit only links pointing to pages inside versions of the top domain of the URL (e.g. http://www.fifa.com/, http://es.fifa.com/ , etc.).

There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration file is crawler_config.xml and is included in the ilsp-fc runnable jar. Two typical customized examples are FMC_config.xml for monolingual crawls and FBC_config.xml for bilingual crawls.

Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows:

-a :user agent name (required)
-type : the type of crawling. Crawling for monolingual (m) or parallel (p).
-cfg : the configuration file that will be used instead of the default (see crawler_config.xml above).
-c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of
the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time
will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle.
The default value is 10 minutes.
-n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes
-t : the number of threads that will be used to fetch web pages in parallel.
-f : forces the crawler to start a new job (required)
-lang : the targeted language in case of monolingual crawling (required).
-l1 : the first targeted language in case of bilingual crawling (required).
-l2 : the second targeted language in case of bilingual crawling (required).
-u : the text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling
the list should contain only 1 or 2 URLs from the same web doamin.

-tc : domain definition
-k : forces the crawler to annotate boilerplate content in parsed text.
-filter : A regular expression to filter out URLs which do NOT match this regex.
The use of this filter forces the crawler to either focus on a specific
web domain (i.e. ".ec.europa.eu."), or on a part of a web domain
(e.g."./legislation_summaries/environment."). Note that if this filter
is used, only the seed URLs that match this regex will be fetched.
-d : forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages
inside the same web site). It should be used only for monolingual crawling.

h2. Run a monolingual crawl

java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \
-cfg FMC_config.xml -t 10 -type m -c 10 -lang de -of output_test1_list.txt \
-ofh output_test1_list.txt.html -tc Automotive-seed-terms-de.txt \
-u Automotive-seed-urls.txt -xslt -f -k

h2. Run a bilingual crawl

java -jar ilsp-fc-1.1-jar-with-dependencies.jar crawlandexport -a test1 -c 10 -f -k -l1 de -l2 it \
-t 10 -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt \
-type p -xslt -u seed_suva.txt -cfg FBC_config.xml

h2. Output

The output of the ilsp-fc in case of monolingual crawl is a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See http://nlp.ilsp.gr/nlp/examples/2547.xml for an example in French for the Environment domain.
The output of the ilsp-fc in case of bilingual crawl is a list of links to XML files following the cesAlign Corpus Encoding Standard for linking (parts of) cesDoc documents. This example http://nlp.ilsp.gr/panacea/xces-xslt/202_225.xml serves as a link between a pair of documents in English and Greek.

Project

General

Profile

ILSP Focused Crawler

Wiki

Getting Started » History » Version 51