Version 114 - History - Getting Started - ILSP Focused Crawler - ILSP NLP

Getting Started » History » Version 114

Prokopis Prokopidis, 2014-08-15 03:35 PM

-Prokopis Prokopidis
+h1. Getting Started
 Prokopis Prokopidis
-Prokopis Prokopidis
+Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this
 Prokopis Prokopidis
-Prokopis Prokopidis
+<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar</code></pre>
 Prokopis Prokopidis
-Vassilis Papavassiliou
+h2. Input
 Vassilis Papavassiliou
-Vassilis Papavassiliou
+In case of general monolingual crawls the required input from the user is:
-Vassilis Papavassiliou
+* a list of seed URLs (i.e. a text file with one URL per text line).
 Prokopis Prokopidis
-Vassilis Papavassiliou
+In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are related to a targeted domain), the input should include:
-Vassilis Papavassiliou
+* a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]].
-Prokopis Prokopidis
+* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English. Details on how to construct/bootstrap such lists and how they are used in text to topic classification could be found at this paper http://www.aclweb.org/anthology/W13-2506.pdf
 Prokopis Prokopidis
-Vassilis Papavassiliou
+In case of general bilingual crawling, the input from the user includes:
-Vassilis Papavassiliou
+* a seed URL list which should contain URL(s) from only one web site (e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. However, the user could use the <code> filter </code> parameter (see below) to allow visiting only links pointing to pages either inside versions of the top domain of the URL (e.g. http://www.fifa.com/,  http://es.fifa.com/ , etc.) or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Examples of seed URLs can ve found at [[seed_examples.txt]].
 Prokopis Prokopidis
-Prokopis Prokopidis
+In case of focused bilingual crawls, the input should also include:
-Vassilis Papavassiliou
+* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of  _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]].
 Vassilis Papavassiliou
-Prokopis Prokopidis
+h2. Language support
 Prokopis Prokopidis
-Vassilis Papavassiliou
+For both monolingual and bilingual crawling, the set of currently supported languages comprises de, el, en, es, fr, hr, it, ja, and pt.
 Prokopis Prokopidis
-Prokopis Prokopidis
+In order to add another language, a developer/user should:
-Prokopis Prokopidis
+* verify that the targeted language is supported by the default language identifier (https://code.google.com/p/language-detection/) integrated in the ILSP-FC,
-Prokopis Prokopidis
+* add a textline with proper content in the [[langKeys.txt]] file which is included in the ilsp-fc runnable jar, and
-Prokopis Prokopidis
+* add a proper analyser in the <code>gr.ilsp.fmc.utils.AnalyserFactory</code> class of the ilsp-fc source.
 Prokopis Prokopidis
-Prokopis Prokopidis
+h2. Other settings
 Prokopis Prokopidis
-Vassilis Papavassiliou
+There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration files for monolingual and bilingual crawls are [[FMC_config.xml]]  and [[FBC_config.xml]] respectively. They are included in the ilsp-fc runnable jar.
 Prokopis Prokopidis
-Prokopis Prokopidis
+Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows:
 Prokopis Prokopidis
-Vassilis Papavassiliou
+<pre><code>
-Vassilis Papavassiliou
+crawlandexport : Forces the crawler to crawl and export the results.
-Vassilis Papavassiliou
+-a : user agent name (required)
-Vassilis Papavassiliou
+-type : the type of crawling. Crawling for monolingual (m) or parallel (p).
-Prokopis Prokopidis
+-cfg : the configuration file that will be used instead of the default (see crawler_config.xml above).
-Vassilis Papavassiliou
+-c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of
-Vassilis Papavassiliou
+     the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time
-Vassilis Papavassiliou
+     will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle.
-Vassilis Papavassiliou
+     The default value is 10 minutes.
-Vassilis Papavassiliou
+-n : the crawl duration in cycles. The default is 1. It is proposed to use this parameter for testing purposes.
-Vassilis Papavassiliou
+-t : the number of threads that will be used to fetch web pages in parallel.
-Vassilis Papavassiliou
+-f : Forces the crawler to start a new job (required).
-Vassilis Papavassiliou
+-lang : the targeted language in case of monolingual crawling (required).
-Prokopis Prokopidis
+-l1 : the first targeted language in case of bilingual crawling (required).
-Vassilis Papavassiliou
+-l2 : the second targeted language in case of bilingual crawling (required).
-Vassilis Papavassiliou
+-u : the text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling
-Vassilis Papavassiliou
+     the list should contain only 1 or 2 URLs from the same web doamin.
-Vassilis Papavassiliou
+-tc : domain definition (a text file that contains a list of term triplets that describe the targeted
-Vassilis Papavassiliou
+      domain). If omitted, the crawl will be a "general" one (i.e. module for text-to-domain
-Vassilis Papavassiliou
+      classification will not be used).
-Vassilis Papavassiliou
+-k : Forces the crawler to annotate boilerplate content in parsed text.
-Vassilis Papavassiliou
+-filter : A regular expression to filter out URLs which do NOT match this regex.
-Prokopis Prokopidis
+          The use of this filter forces the crawler to either focus on a specific
-Prokopis Prokopidis
+          web domain (i.e. ".*ec.europa.eu.*"), or on a part of a web domain
-Prokopis Prokopidis
+          (e.g.".*/legislation_summaries/environment.*"). Note that if this filter
-Vassilis Papavassiliou
+          is used, only the seed URLs that match this regex will be fetched.
-Vassilis Papavassiliou
+-u_r : This parameter should be used for bilingual crawling when there is an already known pattern in URLs
-Vassilis Papavassiliou
+       which implies that one page is the candidate translation the other. It includes the two strings
-Prokopis Prokopidis
+       to be replaced separated by ';'.
-Prokopis Prokopidis
+-d : Forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages
-Prokopis Prokopidis
+     inside the same web site). It should be used only for monolingual crawling.
-Vassilis Papavassiliou
+-len : Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is
-Vassilis Papavassiliou
+       less than this value (default is 3) the paragraph will be annotated as "out of interest" and
-Vassilis Papavassiliou
+       will not be included into the clean text of the web page.
-Vassilis Papavassiliou
+-mtlen : Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned
-Vassilis Papavassiliou
+        text is less than this value (default is 200), the document will not be stored.
-Vassilis Papavassiliou
+-align : Extracts sentences from the detected document pairs and alignes the extracted sentences
-Vassilis Papavassiliou
+         by using an aligner (default is hunalign).
-Vassilis Papavassiliou
+-dict :  Uses this dictionary for the sentence alignment. If has no argument the default dictionary
-Vassilis Papavassiliou
+         of the aligner will be used if exists.
-Prokopis Prokopidis
+-xslt : Insert a stylesheet for rendering xml results as html.
-Vassilis Papavassiliou
+-oxslt : Export crawl results with the help of an xslt file for better examination of results.
-Vassilis Papavassiliou
+-dom : Title of the targeted domain (required when domain definition, i.e. tc parameter, is used).
-Vassilis Papavassiliou
+-dest : The directory where the results (i.e. the crawled data) will be stored.
-Vassilis Papavassiliou
+-of : A text file containing a list with the exported XML files (see section Output below).
-Vassilis Papavassiliou
+-ofh : An HTML file containing a list with the generated XML files (see section Output below).
-Vassilis Papavassiliou
+-oft : A text file containing a list with the exported TMX files (see section Output below).
-Vassilis Papavassiliou
+-ofth : An HTML file containing a list with the generated TMX files (see section Output below).
-Prokopis Prokopidis
+</code></pre>
 Prokopis Prokopidis
-Prokopis Prokopidis
+h2. Run a monolingual crawl
 Prokopis Prokopidis
-Prokopis Prokopidis
+<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr \
-Vassilis Papavassiliou
+                -cfg FMC_config.xml -type m -c 10 -lang en -of output_test1_list.txt \
-Vassilis Papavassiliou
+                -ofh output_test1_list.txt.html -tc ENV_EN_topic.txt  \
-Vassilis Papavassiliou
+                -u ENV_EN_seeds.txt -f -k -dom Environment</code></pre>
 Prokopis Prokopidis
-Prokopis Prokopidis
+<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test2 \
-Vassilis Papavassiliou
+                -f -k -type m -c 5 -lang es -of output_test2_list.txt \
-Vassilis Papavassiliou
+                -ofh output_test2_list.txt.html -u seed_examples.txt \
-Vassilis Papavassiliou
+                </code></pre>
 Vassilis Papavassiliou
-Vassilis Papavassiliou
+h2. Run a bilingual crawl
 Vassilis Papavassiliou
-Vassilis Papavassiliou
+<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test3 -c 10 -f -k -l1 de -l2 it \
-Vassilis Papavassiliou
+                -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt \
-Vassilis Papavassiliou
+                -type p -u seed_examples.txt -cfg FBC_config.xml -dom HS -len 0 -mtlen 100 -xslt -oxslt</code></pre>
 Vassilis Papavassiliou
-Vassilis Papavassiliou
+<pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test4 -c 10 -f -k -l1 es -l2 en \
-Vassilis Papavassiliou
+                -type p -u seed_examples.txt -filter ".*uefa.com.*" \
-Prokopis Prokopidis
+                -len 0 -mtlen 80 -xslt -oxslt -dest "/var/crawl_results/" \
-Vassilis Papavassiliou
+                -of test_U_ES-EN_output.txt -ofh test_U_ES-EN_output.txt.html \
-Vassilis Papavassiliou
+                -oft test_U_ES-EN_output.tmx.txt -ofth test_U_ES-EN_output.tmx.html \
-Prokopis Prokopidis
+                -align -dict </code></pre>
 Prokopis Prokopidis
-Vassilis Papavassiliou
+h2. Output
 Vassilis Papavassiliou
-Vassilis Papavassiliou
+The output of the ilsp-fc in the case of a monolingual crawl consists of:
-Vassilis Papavassiliou
+* a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). As an example, see this [[cesDOC_file]] for an example in English for the Environment domain.
-Prokopis Prokopidis
+* a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this [[rendered_cesDOC_file]].
 Vassilis Papavassiliou
-Vassilis Papavassiliou
+The output of the ilsp-fc in the case of a bilingual crawl consists of:
-Prokopis Prokopidis
+* a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example "cesAlign":http://nlp.ilsp.gr/ file serves as a link between a detected pair of cesDoc documents in "English":http://nlp.ilsp.gr/ and "Spanish":http://nlp.ilsp.gr/.
-Prokopis Prokopidis
+* a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this "rendered cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/16_71_h.xml.html file.
-Prokopis Prokopidis
+* a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this "TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/16_71_h.tmx file.
-Prokopis Prokopidis
+* a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this "rendered TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/16_71_h.html file.

Project

General

Profile

ILSP Focused Crawler

Getting Started » History » Version 114