Version 3 - History - Export - ILSP Focused Crawler - ILSP NLP

Export » History » Version 3

Version 2 (Vassilis Papavassiliou, 2016-05-31 03:39 PM) → Version 3/9 (Vassilis Papavassiliou, 2016-05-31 03:55 PM)

# Export

After crawling, the export process comes. The Exporter module generates a cesDoc file for each stored web document. Each file contains metadata (e.g. language, domain, URL, etc.) about the corresponding document inside a header element. Moreover, a `<body>` element contains the content of the document segmented in paragraphs. Apart from normalized text, each paragraph element `<p>` is enriched with attributes providing more information about the process outcome. Once a crawl has finished, the acquired data can be exported with:

~~~
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-jar-with-dependencies.jar /opt/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar \
-export -lang "en;it" -len 0 -mtlen 100 \
-i (crawlpath up to the auto-generated dir) -tc (full path of topic file) \
-dom (title of targeted topic) -bs -of (fullpath and basename on which all files for easier content navigation will be generated) of file with paths of generated cesDoc) \
-ofh (fullpath of file with paths of generated transformed cesDoc) \
-oxslt &>"/var/www/html/tests/eng-ita/log-export_www_esteri_it_eng-ita"
~~~

-export : for exporting process
-i -of : crawlpath up to fullpath of text file containing a list with fullpaths of the auto-generated dir by the crawl module exported cesDoc files.
-lang -xslt : two or three letter ISO code(s) of target language(s),
e.g. el (for Insert a monolingual crawl stylesheet for Greek content) or en;el (for a bilingual crawl)
CesDoc files will be generated only for crawled web documents that are in the targeted language(s) rendering xml results as html.
-tc -ofh : fullpath of topic HTML file (a text file that contains containing a list of term triplets that describe with the targeted topic). generated XML files.
-dom : title of the targeted domain (required when domain definition, i.e. tc parameter, is used).
-bs : Basename to be used in generating all files for easier content navigation
-oxslt : Export crawl results with the help of an xslt file for better examination of results.
-len : Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is
less than this value the paragraph will be annotated as "out of interest" and will not be included into the clean text of the web page.
-mtlen : Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned text is less than this value, the document will not be stored.

The tool will create the directory "xml" next to the "run" directories. In this directory the downloaded documents (html, pdf) and the generated XML files will be stored. Each file is named by the language iso code (i.e. eng for an English document) followed by a unique id (e.g. eng-1.xml and eng-1.html, eng-2.xml and eng-2.html, etc.)

Project

General

Profile

ILSP Focused Crawler

Export » History » Version 3