Getting Started » History » Version 138
« Previous -
Version 138/167
(diff) -
Next » -
Current version
Vassilis Papavassiliou, 2016-02-16 06:46 PM
Getting Started¶
Once you build or download an ilsp-fc runnable jar, you can run it like this
java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar
Examples of running monolingual crawls¶
- Given a seed URL list ENV_EN_seeds.txt, the following example crawls the web for 5 minutes and constructs a collection containing English web pages.
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \ -crawl -export -dedup -a test -f -type m -c 5 -lang en -k -u ENV_EN_seeds.txt -xslt -oxslt\ -dest crawlResults -of output-test-list.txt -ofh output-test-list.txt.html
In this and other example commands in this documentation, a log4j.xml
file is being used to set logging configuration details. An example log4j.xml
file can be downloaded from here.
- Given a seed URL list ENV_EN_seeds.txt and a topic definition for the Environment domain in Engish ENV_EN_topic.txt, the following example crawls the web for 10 cycles and constructs a collection containing English web pages related to this domain.
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \ -crawl -export -dedup -a test1 -f -type m -n 10 -lang en -k -u seed-examples.txt -xslt -oxslt \ -tc ENV-EN-topic.txt -dom Environment -dest crawlResults -of output-test1-list.txt -ofh output-test1-list.txt.html
Example of running bilingual crawls¶
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-X.Y.Z-jar-with-dependencies.jar \ -crawl -export -dedup -pairdetect -align -tmxmerge -f -k -xslt -oxslt -type p -n 10 -t 20 -len 0 -mtlen 80 \ -lang "en;es" -doctypes "auidh" -segtypes "1:1" -a test -u ENV_EN_ES_seed.txt \ -dest "crawlResults" -of "output_xml_list.txt" -ofh "output_xml_list.html" \ -oft "output_tmx_list.tmx.txt" -ofth "output_tmx_list.tmx.html" -tmx "output.tmx" -metadata
Other settings¶
There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration files for monolingual and bilingual crawls are FMC_config.xml and FBC_config.xml respectively. They are included in the ilsp-fc runnable jar.
Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows:
-a : user agent name (required)
-type : the type of crawling. Crawling for monolingual (m) or parallel (p).
-cfg : the configuration file that will be used instead of the default (see crawler_config.xml above).
-c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of
the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time
will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle.
The default value is 10 minutes.
-n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes.
-t : the number of threads that will be used to fetch web pages in parallel.
-f : Forces the crawler to start a new job (required).
-lang : the targeted language in case of monolingual crawling (required).
-l1 : the first targeted language in case of bilingual crawling (required).
-l2 : the second targeted language in case of bilingual crawling (required).
-u : the text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling
the list should contain only 1 or 2 URLs from the same web doamin.
-tc : domain definition (a text file that contains a list of term triplets that describe the targeted
domain). If omitted, the crawl will be a "general" one (i.e. module for text-to-domain
classification will not be used).
-k : Forces the crawler to annotate boilerplate content in parsed text.
-filter : A regular expression to filter out URLs which do NOT match this regex.
The use of this filter forces the crawler to either focus on a specific
web domain (i.e. ".ec.europa.eu."), or on a part of a web domain
(e.g."./legislation_summaries/environment."). Note that if this filter
is used, only the seed URLs that match this regex will be fetched.
-u_r : This parameter should be used for bilingual crawling when there is an already known pattern in URLs
which implies that one page is the candidate translation the other. It includes the two strings
to be replaced separated by ';'.
-d : Forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages
inside the same web site). It should be used only for monolingual crawling.
-len : Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is
less than this value (default is 3) the paragraph will be annotated as "out of interest" and
will not be included into the clean text of the web page.
-mtlen : Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned
text is less than this value (default is 200), the document will not be stored.
-align : Name of aligner to be used for sentence alignment (default is maligna).
-dict : A dictionary for sentence alignment if hunalign is used. The default L1-L2 dictionary of hunalign will be used if it
exists.
-xslt : Insert a stylesheet for rendering xml results as html.
-oxslt : Export crawl results with the help of an xslt file for better examination of results.
-dom : Title of the targeted domain (required when domain definition, i.e. tc parameter, is used).
-dest : The directory where the results (i.e. the crawled data) will be stored.
-of : A text file containing a list with the exported XML files (see section Output below).
-ofh : An HTML file containing a list with the generated XML files (see section Output below).
-oft : A text file containing a list with the exported TMX files (see section Output below).
-ofth : An HTML file containing a list with the generated TMX files (see section Output below).
Running modules of the ILSP-FC¶
The ILSP-FC, in a configuration for acquiring parallel data, applies the following processes (one after the other):
* Crawl
* Export
* Near Deduplication
* Pair Detection
* Segment Alignment
* TMX Merging