Version 140 - History - Getting Started - ILSP Focused Crawler - ILSP NLP

Getting Started » History » Version 140

Version 139 (Vassilis Papavassiliou, 2016-02-16 06:55 PM) → Version 140/167 (Vassilis Papavassiliou, 2016-02-16 07:01 PM)

# Getting Started

Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this

<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar</code></pre>

## Examples of running monolingual crawls

* Given a seed URL list [[ENV_EN_seeds.txt]], the following example crawls the web for 5 minutes and constructs a collection containing English web pages.

```
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
-crawl -export -dedup -a test -f -type m -c 5 -lang en -k -u ENV_EN_seeds.txt -xslt -oxslt\
-dest crawlResults -of output-test-list.txt -ofh output-test-list.txt.html
```

In this and other example commands in this documentation, a `log4j.xml` file is being used to set logging configuration details. An example `log4j.xml` file can be downloaded from [[Log4j_xml|here]].

* Given a seed URL list [[ENV_EN_seeds.txt]] and a topic definition for the _Environment_ domain in Engish [[ENV_EN_topictxt|ENV_EN_topic.txt]], the following example crawls the web for 10 cycles and constructs a collection containing English web pages related to this domain.

```
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
-crawl -export -dedup -a test1 -f -type m -n 10 -lang en -k -u seed-examples.txt -xslt -oxslt \
-tc ENV-EN-topic.txt -dom Environment -dest crawlResults -of output-test1-list.txt -ofh output-test1-list.txt.html
```

## Example of running bilingual crawls

```
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
-crawl -export -dedup -pairdetect -align -tmxmerge -f -k -xslt -oxslt -type p -n 10 -t 20 -len 0 -mtlen 80 \
-lang "en;es" -doctypes "auidh" -segtypes "1:1" -a test -u ENV_EN_ES_seed.txt \
-dest "crawlResults" -of "output_xml_list.txt" -ofh "output_xml_list.html" \
-oft "output_tmx_list.tmx.txt" -ofth "output_tmx_list.tmx.html" -tmx "output.tmx" -metadata
```

## Other settings

There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration files for monolingual and bilingual crawls are [[FMC_config.xml]] and [[FBC_config.xml]] respectively. They are included in the ilsp-fc runnable jar.

Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows:

-crawl : For applying crawling process.

-f : Forces the crawler to start a new job.

-type : The type of crawling. Crawling for monolingual (m) or parallel (p).

-lang : The language iso codes of the targeted languages separated by ";".

-cfg : The full path to a configuration file that can be used to override default parameters.

-a : User agent name. It is proposed to use a name similar to the targeted site in case of bilingual crawls. site.

-u : The fullpath of text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling the list should contain the URL of the main page of the targeted website, or (of course) other URLs of this website.

-filter : A regular expression to filter out URLs which do NOT match this regex.

The use of this filter forces the crawler to either focus on a specific web domain (i.e. ".*ec.europa.eu.*"), or on a part of a web domain (e.g.".*/legislation_summaries/environment.*") or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Note that if this filter is used, only the seed URLs that match this regex will be fetched.

-n : The crawl duration in cycles. Since the crawler runs in cycles (during which links stored at the top of the crawler’s frontier are extracted and new links are examined) it is proposed to use this parameter either for testing purposes or selecting a large number (i.e. 100) to "verify" that the crawler will visit the entire website.

-c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle.

-dest : The directory where the results (i.e. the crawled data) will be stored. The tool will create the file structure dest/agent/crawl-id (where dest and agent stand for the arguments of parameters dest and agent respectively and crawl-id is generated automatically). In this directory, the tool will create the "run" directories (i.e. directories containing all resources fetched/extracted/used/required for each cycle of this crawl. In addition a pdf directory for storing acquired pdf files will be created.

-t : The number of threads that will be used to fetch web pages in parallel.

-k : Forces the crawler to annotate boilerplate content in parsed text.

-len : Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is

less than this value the paragraph will be annotated as "out of interest" and will not be included into the clean text of the web page.

-mtlen : Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned text is less than this value, the document will not be stored.

-tc : The fullpath of topic file (a text file that contains a list of term triplets that describe the targeted topic). An example domain definition of "Environment" for the English-Spanish pair can be found at http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki/ENV_EN_ES_topic. If omitted, the crawl will be a "general" one (i.e. module for text-to-domain classification will not be used).

-dom : Title of the targeted domain (required when domain definition, i.e. tc parameter, is used).

-storefilter: A regular expression to discard (i.e. visit/fetch/process but do not store) webpages with URLs which do NOT match this regex.

-d : Forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages inside the same web site). It should be used only for monolingual crawling.

[//]: # ( ## Input )

[//]: # (In case of general monolingual crawls the required input from the user is: )
[//]: # (* a list of seed URLs (i.e. a text file with one URL per text line). )

[//]: # (In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are related to a targeted domain), the input should include: )
[//]: # (* a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]]. )
[//]: # (* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English. Details on how to construct/bootstrap such lists and how they are used in text to topic classification could be found at this paper http://www.aclweb.org/anthology/W13-2506.pdf )

[//]: # (In case of general bilingual crawling, the input from the user includes:)
[//]: # (* a seed URL list which should contain URL(s) from only one web site (e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. However, the user could use the <code> filter </code> parameter (see below) to allow visiting only links pointing to pages either inside versions of the top domain of the URL (e.g. http://www.fifa.com/, http://es.fifa.com/ , etc.) or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Examples of seed URLs can be found at [[seed_examples.txt]]. )

[//]: # (In case of focused bilingual crawls, the input should also include: )
[//]: # (* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]].)

[//]: # (## Language support )

[//]: # (For both monolingual and bilingual crawling, the set of currently supported languages comprises de, el, en, es, fr, ga, hr, it, ja, and pt. )

[//]: # (In order to add another language, a developer/user should: )
[//]: # (* verify that the targeted language is supported by the default language identifier (https://code.google.com/p/language-detection/) integrated in the ILSP-FC, )
[//]: # (* add a textline with proper content in the [[langKeys.txt]] file which is included in the ilsp-fc runnable jar, and)
[//]: # (* add a proper analyser in the <code>gr.ilsp.fmc.utils.AnalyserFactory</code> class of the ilsp-fc source. )

[//]: # (## Run a monolingual crawl )

[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr -cfg FMC_config.xml -type m -c 10 -lang en -of output_test1_list.txt -ofh output_test1_list.txt.html -tc ENV_EN_topic.txt -u ENV_EN_seeds.txt -f -k -dom Environment</code></pre> )

[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test2 -f -k -type m -c 5 -lang es -of output_test2_list.txt -ofh output_test2_list.txt.html -u seed_examples.txt </code></pre> )

[//]: # (## Run a bilingual crawl )

[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test3 -c 10 -f -k -l1 de -l2 it -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt -type p -u seed_examples.txt -cfg FBC_config.xml -dom HS -len 0 -mtlen 100 -xslt -oxslt</code></pre> )

[//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test4 -c 10 -f -k -l1 es -l2 en -type p -u seed_examples.txt -filter ".*uefa.com.*" -len 0 -mtlen 80 -xslt -oxslt -dest "/var/crawl_results/" -of test_U_ES-EN_output.txt -ofh test_U_ES-EN_output.txt.html -oft test_U_ES-EN_output.tmx.txt -ofth test_U_ES-EN_output.tmx.html -align hunalign -dict </code></pre> )

[//]: # ( <pre><code>java -jar ilsp-fc-2.2-jar-with-dependencies.jar crawlandexport -f -a abumatran -type p -align maligna -l1 en -l2 fr -u seed_examples.txt -filter ".*(nrcan|rncan).*" -n 2 -xslt -oxslt -of output_demo_EN-FR.txt -ofh output_demo_EN-FR.txt.html -oft output_demo_EN-FR.tmx.txt -ofth output_demo_EN-FR.tmx.html </code></pre>)

[//]: # ( ## Output )

[//]: # (The output of the ilsp-fc in the case of a monolingual crawl consists of: )
[//]: # (* a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See this "cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml file for an example in English for the _Environment_ domain. )
[//]: # (* a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this "rendered cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml.html file. )

[//]: # (The output of the ilsp-fc in the case of a bilingual crawl consists of: )
[//]: # (* a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example "cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml file serves as a link between a detected pair of cesDoc documents in "English":http://nlp.ilsp.gr/xslt/ilsp-fc/98.xml and "Spanish":http://nlp.ilsp.gr/xslt/ilsp-fc/44.xml.)
[//]: # (* a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this "rendered cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml.html file.)
[//]: # (* a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this "TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.tmx file.)
[//]: # (* a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this "rendered TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.html file.)

## Running modules of the ILSP-FC

The ILSP-FC, in a configuration for acquiring parallel data, applies the following processes (one after the other):
* [[Crawl|Crawl]]
* [[Export|Export]]
* [[NearDeduplication|Near Deduplication]]
* [[PairDetection|Pair Detection]]
* [[SegmentAlignment|Segment Alignment]]
* [[TMXmerging|TMX Merging]]

Project

General

Profile

ILSP Focused Crawler

Getting Started » History » Version 140