Getting Started » History » Version 155

« Previous - Version 155/167 (diff) - Next » - Current version
Vassilis Papavassiliou, 2016-05-31 03:09 PM

Getting Started¶

Once you build or download an ilsp-fc runnable jar, you can run it like this

java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar

Examples of running monolingual crawls¶

Given a seed URL list ENV_EN_seeds.txt, the following example crawls the web for 5 minutes and constructs a collection containing English web pages.

java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
-crawl -export -dedup -a test -f -type m -c 5 -lang en -k -u ENV_EN_seeds.txt -oxslt\
-dest crawlResults -bs "output_test"

In this and other example commands in this documentation, a log4j.xml file is being used to set logging configuration details. An example log4j.xml file can be downloaded from here.

Given a seed URL list ENV_EN_seeds.txt and a topic definition for the Environment domain in Engish ENV_EN_topic.txt, the following example crawls the web for 10 cycles and constructs a collection containing English web pages related to this domain.

java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
-crawl -export -dedup -a test1 -f -type m -n 10 -lang en -k -u seed-examples.txt -oxslt \
-tc ENV-EN-topic.txt -dom Environment -dest crawlResults -bs "output-test"

Example of running bilingual crawls¶

This is a test example to verify that the whole workflow (crawl, export, deuplication, pair detection, alingment) works successfully.

java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-SNAPSHOT-jar-with-dependencies.jar \
-crawl -export -dedup -pairdetect -align -tmxmerge -f -k -oxslt -type p -n 1 -t 20 -len 0 -mtlen 100  \
-pdm "aupdih" -segtypes "1:1" -lang "end;lt;deu;lv" -a test -filter ".*www\.airbaltic\.com.*" \
-u "/var/www/html/elrc/test-seeds" -dest "/var/www/html/elrc/test" \
-bs "/var/www/html/elrc/test/output_test" &> "/var/www/html/elrc/test/log_test"

Seed URLs :

https://www.airbaltic.com/lv/bernu-atlaide
https://www.airbaltic.com/lv/profila-registracija
https://www.airbaltic.com/de/ermaessigung-kinder
https://www.airbaltic.com/de/profil-erstellen
https://www.airbaltic.com/en/child-discount
https://www.airbaltic.com/en/create-account
https://www.airbaltic.com/lt/child-discount
https://www.airbaltic.com/lt/sukurti-paskira

Options¶

There are several options concerning the applied processes. Besides the following comprehensive list, you could see the options that are supported for each module.

 -a,--agentname <arg>                  Agent name to identify the person or the organization
                                       responsible for the crawl
 -align,--align_sentences <arg>        Sentence align document pairs using this aligner (default is
                                       maligna)
 -bs,--basename <arg>                  Basename to be used in generating all files for easier
                                       content navigation
 -c,--crawlduration <arg>              Maximum crawl duration in minutes
 -cc,--creative_commons                Force the alignment process to generate a merged TMX with
                                       sentence alignments only from document pairs for which an
                                       open content license has been detected.
 -cfg,--config <arg>                   Path to the XML configuration file
 -crawl,--crawl                        Start a crawl
 -d,--stay_in_webdomain                Force the monolingual crawler to stay in a specific web
                                       domain
 -dbg,--debug                          Use debug level for logging
 -dedup,--deduplicate                  Deduplicate and discard (near) duplicate documents
 -del,--delete_redundant_files         Delete redundant crawled documents that have not been
                                       detected as members of a document pair
 -dest,--destination <arg>             Path to a directory where the acquired/generated resources
                                       will be stored
 -pdm,--pairDetectMethods <arg>        When creating a merged TMX file, only use sentence alignments
                                       from document pairs that have been identified by specific
                                       methods, e.g. auidh. See the pdm option.
 -dom,--domain <arg>                   A descriptive title for the targeted domain
 -export,--export                      Export crawled documents to cesDoc XML files
 -f,--force                            Force a new crawl. Caution: This will remove any previously
                                       crawled data
 -filter,--fetchfilter <arg>           Use this regex to force the crawler to crawl only in specific
                                       sub webdomains. Webpages with urls that do not match this
                                       regex will not be fetched.
 -h,--help                             This message
 -i,--inputdir <arg>                   Input directory for deduplication, pairdetection, or
                                       alignment
 -ifp,--image_urls                     Full image URLs (and not only their basenames) will be used
                                       in pair detection with common images
 -k,--keepboiler                       Keep and annotate boilerplate content in parsed text
 -l,--loggingAppender <arg>            Logging appender (console, DRFA) to use
 -lang,--languages <arg>               Two or three letter ISO code(s) of target language(s), e.g.
                                       el (for a monolingual crawl for Greek content) or en;el (for
                                       a bilingual crawl)
 -len,--length <arg>                   Μinimum number of tokens per text block. Shorter text blocks
                                       will be annoteted as "ooi-length"
 -mtlen,--minlength <arg>              Minimum number of tokens in crawled documents (after
                                       boilerplate detection). Shorter documents will be discarded.
 -n,--numloops <arg>                   Maximum number of fetch/update loops
 -oxslt,--offline_xslt                 Apply an xsl transformation to generate html files during
                                       exporting.
 -p_r,--path_replacements <arg>        Put the strings to be replaced, separated by ';'. This might
                                       be useful for crawling via the web service
 -pairdetect,--pair_detection          Detect document pairs in crawled documents
 -pdm,--pair_detection_methods <arg>   Α string forcing the crawler to detect pairs using one or
                                       more specific methods: a (links between documents), u
                                       (patterns in urls), p (common images and similar digit
                                       sequences),i (common images), d (similar digit sequences), h, or m, or l
                                       (high/medium/low similarity of html structure)
 -segtypes,--segtypes <arg>            When creating a merged TMX file, only use sentence alignments
                                       of specific types, ie. 1:1
 -storefilter,--storefilter <arg>      Use this regex to force the crawler to store only webpages
                                       with urls that match this regex.
 -t,--threads <arg>                    Maximum number of fetcher threads to use
 -tc,--topic <arg>                     Path to a file with the topic definition
 -tmxmerge,--tmxmerge                  Merge aligned segments from each document pair into one tmx
                                       file
 -type,--type <arg>                    Crawl type: m (monolingual) or  p (parallel)
 -u,--urls <arg>                       File with seed urls used to initialize the crawl
 -u_r,--url_replacements <arg>         A string to be replaced, separated by ';'.

Other settings¶

There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration files for monolingual and bilingual crawls are FMC_config.xml and FBC_config.xml respectively. They are included in the ilsp-fc runnable jar.

Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows:

-type : The type of crawling. Crawling for monolingual (m) or parallel (p).

-lang : The language iso codes of the targeted languages separated by ";".

-cfg : The full path to a configuration file that can be used to override default parameters.

-a : User agent name. It is proposed to use a name similar to the targeted site in case of bilingual crawls.

-u : The fullpath of text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling the list should contain the URL of the main page of the targeted website, or (of course) other URLs of this website.

-filter : A regular expression to filter out URLs which do NOT match this regex.
The use of this filter forces the crawler to either focus on a specific web domain (i.e. ".ec.europa.eu."), or on a part of a web domain (e.g."./legislation_summaries/environment.") or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Note that if this filter is used, only the seed URLs that match this regex will be fetched.

-n : The crawl duration in cycles. Since the crawler runs in cycles (during which links stored at the top of the crawler’s frontier are extracted and new links are examined) it is proposed to use this parameter either for testing purposes or selecting a large number (i.e. 100) to "verify" that the crawler will visit the entire website.

-c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle.

-dest : The directory where the results (i.e. the crawled data) will be stored. The tool will create the file structure dest/agent/crawl-id (where dest and agent stand for the arguments of parameters dest and agent respectively and crawl-id is generated automatically). In this directory, the tool will create the "run" directories (i.e. directories containing all resources fetched/extracted/used/required for each cycle of this crawl). In addition a pdf directory for storing acquired pdf files will be created.

-t : The number of threads that will be used to fetch web pages in parallel.

-k : Forces the crawler to annotate boilerplate content in parsed text.

-len : Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is
less than this value the paragraph will be annotated as "out of interest" and will not be included into the clean text of the web page.

-mtlen : Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned text is less than this value, the document will not be stored.

-tc : The fullpath of topic file (a text file that contains a list of term triplets that describe the targeted topic). An example domain definition of "Environment" for the English-Spanish pair can be found at http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki/ENV_EN_ES_topic. If omitted, the crawl will be a "general" one (i.e. module for text-to-domain classification will not be used).

-dom : Title of the targeted domain (required when domain definition, i.e. tc parameter, is used).

-storefilter: A regular expression to discard (i.e. visit/fetch/process but do not store) webpages with URLs which do NOT match this regex.

-d : Forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages inside the same web site). It should be used only for monolingual crawling.

-export : For exporting process

-of : The fullpath of text file containing a list with fullpaths of the exported cesDoc files, or cesAling files.

-xslt : If exists, it inserts a stylesheet for rendering XML results as HTML.

-oxslt : If exists, Export crawl results with the help of an xslt file for better examination of results.

-ofh : The fullpath of HTML file containing a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection.

-dedup : for (near) deduplication.

-pairdetect : for identification of candidate parallel documents

-meth : methods to be used for pair detection. Put a string which contains a for checking links, u for checking urls for patterns, p for combining common images and digits, i for using common images, d for examining digit sequences, s for examining structures.

-u_r : url_replacements. Besides the default patterns , the user could add more patterns separated by ;

-ifp : image_fullpath. Keep image fullpath for pair detection for representing an image instead of its name only.

-del : delete redundant files. Deletes cesDoc files that have not been paired

-align : for segment alignment

-oft : The fullpath of text file containing a list with fullpaths of the generated TMX files

-ofth : The fullpath of HTML file containing a list of links pointing to generated transformed TMX files

-tmxmerge : for merging generated TMX files (i.e. construct a bilingual corpus).

-doctypes : Defines the types of the document pairs from which the segment pairs will be selected. The proposed value is "aupidh" since pairs of type "m" and "l" (e.g. eng-1_lav-3_m.xml or eng-2_lav-8_l.xml) are only used for testing or examining the tool.

-thres : thresholds for 0:1 alignments per type. It should be of the same length with the types parameter. If a TMX of type X contains more 0:1 segment pairs than the corresponding threshold, it will not be selected

-segtypes : Types of segment alignments that will be selected for the final output. The value "1:1" (deault) is proposed. If omitted, segments of all types will be processed. "Otherwise put segment types seperated by ; (i.e. 1:1;1:2;2:1)

-tmx : A TMX files that includes filtered segment pairs of the generated TMX. This is the final output of the process (i.e. the parallel corpus)

-cc : If exists, only document pairs for which a license has been detected will be selected in merged TMX.

-metadata : Generates an XML file which contains metadata of the generated corpus.

Running modules of the ILSP-FC¶

The ILSP-FC, in a configuration for acquiring parallel data, applies the following processes (one after the other):
* Crawl
* Export
* Near Deduplication
* Pair Detection
* Segment Alignment
* TMX Merging

Project

General

Profile

ILSP Focused Crawler