Version 2 - History - Export - ILSP Focused Crawler - ILSP NLP

Export » History » Version 2

Vassilis Papavassiliou, 2016-05-31 03:39 PM

-Prokopis Prokopidis
+# Export
 Prokopis Prokopidis
-Prokopis Prokopidis
+After crawling, the export process comes. The Exporter module generates a cesDoc file for each stored web document. Each file contains metadata (e.g. language, domain, URL, etc.) about the corresponding document inside a header element.  Moreover, a `<body>` element contains the content of the document segmented in paragraphs. Apart from normalized text, each paragraph element `<p>` is enriched with attributes providing more information about the process outcome. Once a crawl has finished, the acquired data can be exported with:
 Prokopis Prokopidis
-Prokopis Prokopidis
+~~~
-Prokopis Prokopidis
+java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar \
-Prokopis Prokopidis
+-export -lang "en;it" -len 0 -mtlen 100 \
-Prokopis Prokopidis
+-i (crawlpath up to the auto-generated dir) -tc (full path of topic file) \
-Prokopis Prokopidis
+-dom (title of targeted topic) -of (fullpath of file with paths of generated cesDoc) \
-Prokopis Prokopidis
+-ofh (fullpath of file with paths of generated transformed cesDoc) \
-Vassilis Papavassiliou
+-oxslt  &>"/var/www/html/tests/eng-ita/log-export_www_esteri_it_eng-ita"
-Prokopis Prokopidis
+~~~
 Prokopis Prokopidis
-Prokopis Prokopidis
+-export	:	for exporting process
-Prokopis Prokopidis
+-of	:	fullpath of text file containing a list with fullpaths of the exported cesDoc files.
-Prokopis Prokopidis
+-xslt	:	Insert a stylesheet for rendering xml results as html.
-Prokopis Prokopidis
+-ofh	:	fullpath of HTML file containing a list with the generated XML files.
-Prokopis Prokopidis
+-oxslt	:	Export crawl results with the help of an xslt file for better examination of results.
 Prokopis Prokopidis
-Prokopis Prokopidis
+The tool will create the directory "xml" next to the "run" directories. In this directory the downloaded documents (html, pdf) and the generated XML files will be stored. Each file is named by the language iso code (i.e. eng for an English document) followed by a unique id (e.g. eng-1.xml and eng-1.html, eng-2.xml and eng-2.html, etc.)

Project

General

Profile

ILSP Focused Crawler

Export » History » Version 2