Version 112 - History - Getting Started - Getting Started - ILSP Focused Crawler - ILSP NLP

Getting Started » History » Version 112

« Previous - Version 112/167 (diff) - Next » - Current version
Vassilis Papavassiliou, 2014-08-15 03:23 PM

h1. Getting Started

Once you build or download an ilsp-fc runnable jar, you can run it like this

java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar

h2. Input

In case of general monolingual crawls the required input from the user is:
* a list of seed URLs (i.e. a text file with one URL per text line).

In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are related to a targeted domain), the input should include:

* a list of seed URLs pointing to relevant web pages. An example seed URL list for Environment in English can be found at ENV_EN_seeds.txt.
* a list of term triplets (_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at ENV_EN_topic.txt for the _Environment domain in English. Details on how to construct/bootstrap such lists and how they are used in text to topic classification could be found at this paper http://www.aclweb.org/anthology/W13-2506.pdf

In case of general bilingual crawling, the input from the user includes:
* a seed URL list which should contain URL(s) from only one web site (e.g. ENV_EN_ES_seed.txt). The crawler will follow only links pointing to pages inside this web site. However, the user could use the filter parameter (see below) to allow visiting only links pointing to pages either inside versions of the top domain of the URL (e.g. http://www.fifa.com/, http://es.fifa.com/ , etc.) or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Examples of seed URLs can ve found at seed_examples.txt.

In case of focused bilingual crawls, the input should also include:
* a list of term triplets (_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of _Environment for the English-Spanish pair can be found at ENV_EN_ES_topic.txt.

h2. Language support

For both monolingual and bilingual crawling, the set of currently supported languages comprises de, el, en, es, fr, hr, it, ja, and pt.

In order to add another language, a developer/user should:
* verify that the targeted language is supported by the default language identifier (https://code.google.com/p/language-detection/) integrated in the ILSP-FC,
* add a textline with proper content in the langKeys.txt file which is included in the ilsp-fc runnable jar, and
* add a proper analyser in the gr.ilsp.fmc.utils.AnalyserFactory class of the ilsp-fc source.

h2. Other settings

There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration files for monolingual and bilingual crawls are FMC_config.xml and FBC_config.xml respectively. They are included in the ilsp-fc runnable jar.

Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows:

crawlandexport : Forces the crawler to crawl and export the results.
-a : user agent name (required)
-type : the type of crawling. Crawling for monolingual (m) or parallel (p).
-cfg : the configuration file that will be used instead of the default (see crawler_config.xml above).
-c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of
the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time
will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle.
The default value is 10 minutes.
-n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes.
-t : the number of threads that will be used to fetch web pages in parallel.
-f : Forces the crawler to start a new job (required).
-lang : the targeted language in case of monolingual crawling (required).
-l1 : the first targeted language in case of bilingual crawling (required).
-l2 : the second targeted language in case of bilingual crawling (required).
-u : the text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling
the list should contain only 1 or 2 URLs from the same web doamin.

-tc : domain definition (a text file that contains a list of term triplets that describe the targeted
domain). If omitted, the crawl will be a "general" one (i.e. module for text-to-domain
classification will not be used).
-k : Forces the crawler to annotate boilerplate content in parsed text.
-filter : A regular expression to filter out URLs which do NOT match this regex.
The use of this filter forces the crawler to either focus on a specific
web domain (i.e. ".ec.europa.eu."), or on a part of a web domain
(e.g."./legislation_summaries/environment."). Note that if this filter
is used, only the seed URLs that match this regex will be fetched.
-u_r : This parameter should be used for bilingual crawling when there is an already known pattern in URLs
which implies that one page is the candidate translation the other. It includes the two strings
to be replaced separated by ';'.
-d : Forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages
inside the same web site). It should be used only for monolingual crawling.
-len : Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is
less than this value (default is 3) the paragraph will be annotated as "out of interest" and
will not be included into the clean text of the web page.
-mtlen : Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned
text is less than this value (default is 200), the document will not be stored.
-align : Extracts sentences from the detected document pairs and alignes the extracted sentences
by using an aligner (default is hunalign).
-dict : Uses this dictionary for the sentence alignment. If has no argument the default dictionary
of the aligner will be used if exists.
-xslt : Insert a stylesheet for rendering xml results as html.
-oxslt : Export crawl results with the help of an xslt file for better examination of results.
-dom : Title of the targeted domain (required when domain definition, i.e. tc parameter, is used).
-dest : The directory where the results (i.e. the crawled data) will be stored.
-of : A text file containing a list with the exported XML files (see section Output below).
-ofh : An HTML file containing a list with the generated XML files (see section Output below).
-oft : A text file containing a list with the exported TMX files (see section Output below).
-ofth : An HTML file containing a list with the generated TMX files (see section Output below).

h2. Run a monolingual crawl

java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \
-cfg FMC_config.xml -type m -c 10 -lang en -of output_test1_list.txt \
-ofh output_test1_list.txt.html -tc ENV_EN_topic.txt \
-u ENV_EN_seeds.txt -f -k -dom Environment

java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test2 \
-f -k -type m -c 5 -lang es -of output_test2_list.txt \
-ofh output_test2_list.txt.html -u seed_examples.txt \

h2. Run a bilingual crawl

java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test3 -c 10 -f -k -l1 de -l2 it \
-of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt \
-type p -u seed_examples.txt -cfg FBC_config.xml -dom HS -len 0 -mtlen 100 -xslt -oxslt

java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test4 -c 10 -f -k -l1 es -l2 en \
-type p -u seed_examples.txt -filter ".uefa.com." \
-len 0 -mtlen 80 -xslt -oxslt -dest "/var/crawl_results/" \
-of test_U_ES-EN_output.txt -ofh test_U_ES-EN_output.txt.html \
-oft test_U_ES-EN_output.tmx.txt -ofth test_U_ES-EN_output.tmx.html \
-align -dict

h2. Output

The output of the ilsp-fc in the case of a monolingual crawl consists of:
* a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). As an example, see this cesDOC_file for an example in English for the Environment domain.
* a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this rendered_cesDOC_file.

The output of the ilsp-fc in the case of a bilingual crawl consists of:
* a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example "cesAlign":http://nlp.ilsp.gr/ file serves as a link between a detected pair of cesDoc documents in "English":http://nlp.ilsp.gr/ and "Spanish":http://nlp.ilsp.gr/.
* a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this "rendered cesAlign":http://nlp.ilsp.gr file.
* a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this "TMX":http://nlp.ilsp.gr file..
* a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this "rendered TMX":http://nlp.ilsp.gr file.

Project

General

Profile

ILSP Focused Crawler

Wiki

Getting Started » History » Version 112