Getting Started
Once you build or download an ilsp-fc runnable jar, you can run it like this
java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar
Examples of running monolingual crawls
- Given a seed URL list ENV_EN_seeds.txt, the following example crawls the web for 5 minutes and constructs a collection containing English web pages.
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
-crawl -export -dedup -a test -f -type m -cdm 5 -lang en -k -u ENV_EN_seeds.txt -oxslt\
-dest crawlResults -bs "output_test"
In this and other example commands in this documentation, a log4j.xml
file is being used to set logging configuration details. An example log4j.xml
file can be downloaded from here.
- Given a seed URL list ENV_EN_seeds.txt and a topic definition for the Environment domain in Engish ENV_EN_topic.txt, the following example crawls the web for 10 cycles and constructs a collection containing English web pages related to this domain.
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \
-crawl -export -dedup -a test1 -f -type m -cdl 10 -lang en -k -u seed-examples.txt -oxslt \
-tc ENV-EN-topic.txt -dom Environment -dest crawlResults -bs "output-test"
Example of running bilingual crawls
This is a test example to verify that the whole workflow (crawl, export, deuplication, pair detection, alingment) works successfully.
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.4-SNAPSHOT-jar-with-dependencies.jar \
-crawl -export -dedup -pairdetect -align -tmxmerge -f -k -oxslt -type p -cdl 1 -t 20 -len 0 -mtlen 100 \
-pdm "aupdih" -segtypes "1:1" -lang "eng;lt;deu;lv" -a test -filter ".*www\.airbaltic\.com.*" \
-u "/var/www/html/elrc/test-seeds" -dest "/var/www/html/elrc/test" \
-bs "/var/www/html/elrc/test/output_test" &> "/var/www/html/elrc/test/log_test"
Seed URLs :
https://www.airbaltic.com/lv/bernu-atlaide
https://www.airbaltic.com/lv/profila-registracija
https://www.airbaltic.com/de/ermaessigung-kinder
https://www.airbaltic.com/de/profil-erstellen
https://www.airbaltic.com/en/child-discount
https://www.airbaltic.com/en/create-account
https://www.airbaltic.com/lt/child-discount
https://www.airbaltic.com/lt/sukurti-paskira
Options
There are several options concerning the applied processes. Besides the following comprehensive list, you could see the options that are supported for each module.
-a,--agentname <arg> Agent name to identify the person or the organization
responsible for the crawl
-align,--align_sentences <arg> Sentence align document pairs using this aligner (default is
maligna)
-bs,--basename <arg> Basename to be used in generating all files for easier
content navigation
-cdm,--crawlduration <arg> Maximum crawl duration in minutes
-cc,--creative_commons Force the alignment process to generate a merged TMX with
sentence alignments only from document pairs for which an
open content license has been detected.
-cfg,--config <arg> Path to the XML configuration file
-crawl,--crawl Start a crawl
-d,--stay_in_webdomain Force the monolingual crawler to stay in a specific web
domain
-dbg,--debug Use debug level for logging
-dedup,--deduplicate Deduplicate and discard (near) duplicate documents
-del,--delete_redundant_files Delete redundant crawled documents that have not been
detected as members of a document pair
-dest,--destination <arg> Path to a directory where the acquired/generated resources
will be stored
-pdm,--pairDetectMethods <arg> When creating a merged TMX file, only use sentence alignments
from document pairs that have been identified by specific
methods, e.g. auidh. See the pdm option.
-dom,--domain <arg> A descriptive title for the targeted domain
-export,--export Export crawled documents to cesDoc XML files
-f,--force Force a new crawl. Caution: This will remove any previously
crawled data
-filter,--fetchfilter <arg> Use this regex to force the crawler to crawl only in specific
sub webdomains. Webpages with urls that do not match this
regex will not be fetched.
-h,--help This message
-i,--inputdir <arg> Input directory for deduplication, pairdetection, or
alignment
-ifp,--image_urls Full image URLs (and not only their basenames) will be used
in pair detection with common images
-k,--keepboiler Keep and annotate boilerplate content in parsed text
-l,--loggingAppender <arg> Logging appender (console, DRFA) to use
-lang,--languages <arg> Two or three letter ISO code(s) of target language(s), e.g.
el (for a monolingual crawl for Greek content) or eng;el (for
a bilingual crawl)
-len,--length <arg> Μinimum number of tokens per text block. Shorter text blocks
will be annoteted as "ooi-length"
-mtlen,--minlength <arg> Minimum number of tokens in crawled documents (after
boilerplate detection). Shorter documents will be discarded.
-cdl,--numloops <arg> Maximum number of fetch/update loops
-oxslt,--offline_xslt Apply an xsl transformation to generate html files during
exporting.
-p_r,--path_replacements <arg> Put the strings to be replaced, separated by ';'. This might
be useful for crawling via the web service
-pairdetect,--pair_detection Detect document pairs in crawled documents
-pdm,--pair_detection_methods <arg> Α string forcing the crawler to detect pairs using one or
more specific methods: a (links between documents), u
(patterns in urls), p (common images and similar digit
sequences),i (common images), d (similar digit sequences), h, or m, or l
(high/medium/low similarity of html structure)
-segtypes,--segtypes <arg> When creating a merged TMX file, only use sentence alignments
of specific types, ie. 1:1
-storefilter,--storefilter <arg> Use this regex to force the crawler to store only webpages
with urls that match this regex.
-t,--threads <arg> Maximum number of fetcher threads to use
-tc,--topic <arg> Path to a file with the topic definition
-tmxmerge,--tmxmerge Merge aligned segments from each document pair into one tmx
file
-type,--type <arg> Crawl type: m (monolingual) or p (parallel)
-u,--urls <arg> File with seed urls used to initialize the crawl
-u_r,--url_replacements <arg> A string to be replaced, separated by ';'.
Running modules of the ILSP-FC
The ILSP-FC, in a configuration for acquiring parallel data, applies the following processes (one after the other): * [[Crawl|Crawl]] * [[Export|Export]] * [[NearDeduplication|Near Deduplication]] * [[PairDetection|Pair Detection]] * [[SegmentAlignment|Segment Alignment]] * [[TMXmerging|TMX Merging]]