Project

General

Profile

Getting Started

Once you build or download an ilsp-fc runnable jar, you can run it like this

java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar

Examples of running monolingual crawls

  • Given a seed URL list ENV_EN_seeds.txt, the following example crawls the web for 5 minutes and constructs a collection containing English web pages.
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
-crawl -export -dedup -a test -f -type m -cdm 5 -lang en -k -u ENV_EN_seeds.txt -oxslt\
-dest crawlResults -bs "output_test" 

In this and other example commands in this documentation, a log4j.xml file is being used to set logging configuration details. An example log4j.xml file can be downloaded from here.

  • Given a seed URL list ENV_EN_seeds.txt and a topic definition for the Environment domain in Engish ENV_EN_topic.txt, the following example crawls the web for 10 cycles and constructs a collection containing English web pages related to this domain.
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
-crawl -export -dedup -a test1 -f -type m -cdl 10 -lang en -k -u seed-examples.txt -oxslt \
-tc ENV-EN-topic.txt -dom Environment -dest crawlResults -bs "output-test"

Example of running bilingual crawls

This is a test example to verify that the whole workflow (crawl, export, deuplication, pair detection, alingment) works successfully.

java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.4-SNAPSHOT-jar-with-dependencies.jar \
-crawl -export -dedup -pairdetect -align -tmxmerge -f -k -oxslt -type p -cdl 1 -t 20 -len 0 -mtlen 100  \
-pdm "aupdih" -segtypes "1:1" -lang "eng;lt;deu;lv" -a test -filter ".*www\.airbaltic\.com.*" \
-u "/var/www/html/elrc/test-seeds" -dest "/var/www/html/elrc/test" \
-bs "/var/www/html/elrc/test/output_test" &> "/var/www/html/elrc/test/log_test"

Seed URLs :

https://www.airbaltic.com/lv/bernu-atlaide
https://www.airbaltic.com/lv/profila-registracija
https://www.airbaltic.com/de/ermaessigung-kinder
https://www.airbaltic.com/de/profil-erstellen
https://www.airbaltic.com/en/child-discount
https://www.airbaltic.com/en/create-account
https://www.airbaltic.com/lt/child-discount
https://www.airbaltic.com/lt/sukurti-paskira

Options

There are several options concerning the applied processes. Besides the following comprehensive list, you could see the options that are supported for each module.

 -a,--agentname <arg>                  Agent name to identify the person or the organization
                                       responsible for the crawl
 -align,--align_sentences <arg>      Sentence align document pairs using this aligner (default is
                                       maligna)
 -bs,--basename <arg>                  Basename to be used in generating all files for easier
                                       content navigation
 -cdm,--crawlduration <arg>            Maximum crawl duration in minutes
 -cc,--creative_commons                Force the alignment process to generate a merged TMX with
                                       sentence alignments only from document pairs for which an
                                       open content license has been detected.
 -cfg,--config <arg>                   Path to the XML configuration file
 -crawl,--crawl                        Start a crawl
 -d,--stay_in_webdomain                Force the monolingual crawler to stay in a specific web
                                       domain
 -dbg,--debug                          Use debug level for logging
 -dedup,--deduplicate                  Deduplicate and discard (near) duplicate documents
 -del,--delete_redundant_files       Delete redundant crawled documents that have not been
                                       detected as members of a document pair
 -dest,--destination <arg>            Path to a directory where the acquired/generated resources
                                       will be stored
 -pdm,--pairDetectMethods <arg>        When creating a merged TMX file, only use sentence alignments
                                       from document pairs that have been identified by specific
                                       methods, e.g. auidh. See the pdm option.
 -dom,--domain <arg>                   A descriptive title for the targeted domain
 -export,--export                      Export crawled documents to cesDoc XML files
 -f,--force                            Force a new crawl. Caution: This will remove any previously
                                       crawled data
 -filter,--fetchfilter <arg>           Use this regex to force the crawler to crawl only in specific
                                       sub webdomains. Webpages with urls that do not match this
                                       regex will not be fetched.
 -h,--help                             This message
 -i,--inputdir <arg>                   Input directory for deduplication, pairdetection, or
                                       alignment
 -ifp,--image_urls                     Full image URLs (and not only their basenames) will be used
                                       in pair detection with common images
 -k,--keepboiler                       Keep and annotate boilerplate content in parsed text
 -l,--loggingAppender <arg>            Logging appender (console, DRFA) to use
 -lang,--languages <arg>               Two or three letter ISO code(s) of target language(s), e.g.
                                       el (for a monolingual crawl for Greek content) or eng;el (for
                                       a bilingual crawl)
 -len,--length <arg>                   Μinimum number of tokens per text block. Shorter text blocks
                                       will be annoteted as "ooi-length"
 -mtlen,--minlength <arg>              Minimum number of tokens in crawled documents (after
                                       boilerplate detection). Shorter documents will be discarded.
 -cdl,--numloops <arg>            Maximum number of fetch/update loops
 -oxslt,--offline_xslt                 Apply an xsl transformation to generate html files during
                                       exporting.
 -p_r,--path_replacements <arg>        Put the strings to be replaced, separated by ';'. This might
                                       be useful for crawling via the web service
 -pairdetect,--pair_detection          Detect document pairs in crawled documents
 -pdm,--pair_detection_methods <arg>   Α string forcing the crawler to detect pairs using one or
                                       more specific methods: a (links between documents), u
                                       (patterns in urls), p (common images and similar digit
                                       sequences),i (common images), d (similar digit sequences), h, or m, or l
                                       (high/medium/low similarity of html structure)
 -segtypes,--segtypes <arg>            When creating a merged TMX file, only use sentence alignments
                                       of specific types, ie. 1:1
 -storefilter,--storefilter <arg>      Use this regex to force the crawler to store only webpages
                                       with urls that match this regex.
 -t,--threads <arg>                    Maximum number of fetcher threads to use
 -tc,--topic <arg>                     Path to a file with the topic definition
 -tmxmerge,--tmxmerge                  Merge aligned segments from each document pair into one tmx
                                       file
 -type,--type <arg>                    Crawl type: m (monolingual) or  p (parallel)
 -u,--urls <arg>                       File with seed urls used to initialize the crawl
 -u_r,--url_replacements <arg>         A string to be replaced, separated by ';'.

Running modules of the ILSP-FC

The ILSP-FC, in a configuration for acquiring parallel data, applies the following processes (one after the other):
* Crawl
* Export
* Near Deduplication
* Pair Detection
* Segment Alignment
* TMX Merging