Project

General

Profile

Export » History » Version 2

Vassilis Papavassiliou, 2016-05-31 03:39 PM

1 1 Prokopis Prokopidis
# Export
2 1 Prokopis Prokopidis
3 1 Prokopis Prokopidis
After crawling, the export process comes. The Exporter module generates a cesDoc file for each stored web document. Each file contains metadata (e.g. language, domain, URL, etc.) about the corresponding document inside a header element.  Moreover, a `<body>` element contains the content of the document segmented in paragraphs. Apart from normalized text, each paragraph element `<p>` is enriched with attributes providing more information about the process outcome. Once a crawl has finished, the acquired data can be exported with:
4 1 Prokopis Prokopidis
5 1 Prokopis Prokopidis
~~~
6 1 Prokopis Prokopidis
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar \
7 1 Prokopis Prokopidis
-export -lang "en;it" -len 0 -mtlen 100 \
8 1 Prokopis Prokopidis
-i (crawlpath up to the auto-generated dir) -tc (full path of topic file) \
9 1 Prokopis Prokopidis
-dom (title of targeted topic) -of (fullpath of file with paths of generated cesDoc) \
10 1 Prokopis Prokopidis
-ofh (fullpath of file with paths of generated transformed cesDoc) \
11 2 Vassilis Papavassiliou
-oxslt  &>"/var/www/html/tests/eng-ita/log-export_www_esteri_it_eng-ita"
12 1 Prokopis Prokopidis
~~~
13 1 Prokopis Prokopidis
14 1 Prokopis Prokopidis
-export	:	for exporting process
15 1 Prokopis Prokopidis
-of	:	fullpath of text file containing a list with fullpaths of the exported cesDoc files.
16 1 Prokopis Prokopidis
-xslt	:	Insert a stylesheet for rendering xml results as html.
17 1 Prokopis Prokopidis
-ofh	:	fullpath of HTML file containing a list with the generated XML files.
18 1 Prokopis Prokopidis
-oxslt	:	Export crawl results with the help of an xslt file for better examination of results.
19 1 Prokopis Prokopidis
20 1 Prokopis Prokopidis
The tool will create the directory "xml" next to the "run" directories. In this directory the downloaded documents (html, pdf) and the generated XML files will be stored. Each file is named by the language iso code (i.e. eng for an English document) followed by a unique id (e.g. eng-1.xml and eng-1.html, eng-2.xml and eng-2.html, etc.)