Getting Started » History » Version 164
« Previous -
Version 164/167
(diff) -
Next » -
Current version
Vassilis Papavassiliou, 2016-06-06 03:21 PM
Getting Started¶
Once you build or download an ilsp-fc runnable jar, you can run it like this
java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar
Examples of running monolingual crawls¶
- Given a seed URL list ENV_EN_seeds.txt, the following example crawls the web for 5 minutes and constructs a collection containing English web pages.
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \ -crawl -export -dedup -a test -f -type m -c 5 -lang en -k -u ENV_EN_seeds.txt -oxslt\ -dest crawlResults -bs "output_test"
In this and other example commands in this documentation, a log4j.xml
file is being used to set logging configuration details. An example log4j.xml
file can be downloaded from here.
- Given a seed URL list ENV_EN_seeds.txt and a topic definition for the Environment domain in Engish ENV_EN_topic.txt, the following example crawls the web for 10 cycles and constructs a collection containing English web pages related to this domain.
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \ -crawl -export -dedup -a test1 -f -type m -n 10 -lang en -k -u seed-examples.txt -oxslt \ -tc ENV-EN-topic.txt -dom Environment -dest crawlResults -bs "output-test"
Example of running bilingual crawls¶
This is a test example to verify that the whole workflow (crawl, export, deuplication, pair detection, alingment) works successfully.
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-SNAPSHOT-jar-with-dependencies.jar \ -crawl -export -dedup -pairdetect -align -tmxmerge -f -k -oxslt -type p -n 1 -t 20 -len 0 -mtlen 100 \ -pdm "aupdih" -segtypes "1:1" -lang "eng;lt;deu;lv" -a test -filter ".*www\.airbaltic\.com.*" \ -u "/var/www/html/elrc/test-seeds" -dest "/var/www/html/elrc/test" \ -bs "/var/www/html/elrc/test/output_test" &> "/var/www/html/elrc/test/log_test"
Seed URLs :
https://www.airbaltic.com/lv/bernu-atlaide https://www.airbaltic.com/lv/profila-registracija https://www.airbaltic.com/de/ermaessigung-kinder https://www.airbaltic.com/de/profil-erstellen https://www.airbaltic.com/en/child-discount https://www.airbaltic.com/en/create-account https://www.airbaltic.com/lt/child-discount https://www.airbaltic.com/lt/sukurti-paskira
Options¶
There are several options concerning the applied processes. Besides the following comprehensive list, you could see the options that are supported for each module.
-a,--agentname <arg> Agent name to identify the person or the organization responsible for the crawl -align,--align_sentences <arg> Sentence align document pairs using this aligner (default is maligna) -bs,--basename <arg> Basename to be used in generating all files for easier content navigation -c,--crawlduration <arg> Maximum crawl duration in minutes -cc,--creative_commons Force the alignment process to generate a merged TMX with sentence alignments only from document pairs for which an open content license has been detected. -cfg,--config <arg> Path to the XML configuration file -crawl,--crawl Start a crawl -d,--stay_in_webdomain Force the monolingual crawler to stay in a specific web domain -dbg,--debug Use debug level for logging -dedup,--deduplicate Deduplicate and discard (near) duplicate documents -del,--delete_redundant_files Delete redundant crawled documents that have not been detected as members of a document pair -dest,--destination <arg> Path to a directory where the acquired/generated resources will be stored -pdm,--pairDetectMethods <arg> When creating a merged TMX file, only use sentence alignments from document pairs that have been identified by specific methods, e.g. auidh. See the pdm option. -dom,--domain <arg> A descriptive title for the targeted domain -export,--export Export crawled documents to cesDoc XML files -f,--force Force a new crawl. Caution: This will remove any previously crawled data -filter,--fetchfilter <arg> Use this regex to force the crawler to crawl only in specific sub webdomains. Webpages with urls that do not match this regex will not be fetched. -h,--help This message -i,--inputdir <arg> Input directory for deduplication, pairdetection, or alignment -ifp,--image_urls Full image URLs (and not only their basenames) will be used in pair detection with common images -k,--keepboiler Keep and annotate boilerplate content in parsed text -l,--loggingAppender <arg> Logging appender (console, DRFA) to use -lang,--languages <arg> Two or three letter ISO code(s) of target language(s), e.g. el (for a monolingual crawl for Greek content) or eng;el (for a bilingual crawl) -len,--length <arg> Μinimum number of tokens per text block. Shorter text blocks will be annoteted as "ooi-length" -mtlen,--minlength <arg> Minimum number of tokens in crawled documents (after boilerplate detection). Shorter documents will be discarded. -n,--numloops <arg> Maximum number of fetch/update loops -oxslt,--offline_xslt Apply an xsl transformation to generate html files during exporting. -p_r,--path_replacements <arg> Put the strings to be replaced, separated by ';'. This might be useful for crawling via the web service -pairdetect,--pair_detection Detect document pairs in crawled documents -pdm,--pair_detection_methods <arg> Α string forcing the crawler to detect pairs using one or more specific methods: a (links between documents), u (patterns in urls), p (common images and similar digit sequences),i (common images), d (similar digit sequences), h, or m, or l (high/medium/low similarity of html structure) -segtypes,--segtypes <arg> When creating a merged TMX file, only use sentence alignments of specific types, ie. 1:1 -storefilter,--storefilter <arg> Use this regex to force the crawler to store only webpages with urls that match this regex. -t,--threads <arg> Maximum number of fetcher threads to use -tc,--topic <arg> Path to a file with the topic definition -tmxmerge,--tmxmerge Merge aligned segments from each document pair into one tmx file -type,--type <arg> Crawl type: m (monolingual) or p (parallel) -u,--urls <arg> File with seed urls used to initialize the crawl -u_r,--url_replacements <arg> A string to be replaced, separated by ';'.
Running modules of the ILSP-FC¶
The ILSP-FC, in a configuration for acquiring parallel data, applies the following processes (one after the other):
* Crawl
* Export
* Near Deduplication
* Pair Detection
* Segment Alignment
* TMX Merging