Export » History » Version 2
Version 1 (Prokopis Prokopidis, 2016-02-16 12:28 PM) → Version 2/9 (Vassilis Papavassiliou, 2016-05-31 03:39 PM)
# Export
After crawling, the export process comes. The Exporter module generates a cesDoc file for each stored web document. Each file contains metadata (e.g. language, domain, URL, etc.) about the corresponding document inside a header element. Moreover, a `<body>` element contains the content of the document segmented in paragraphs. Apart from normalized text, each paragraph element `<p>` is enriched with attributes providing more information about the process outcome. Once a crawl has finished, the acquired data can be exported with:
~~~
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar \
-export -lang "en;it" -len 0 -mtlen 100 \
-i (crawlpath up to the auto-generated dir) -tc (full path of topic file) \
-dom (title of targeted topic) -of (fullpath of file with paths of generated cesDoc) \
-ofh (fullpath of file with paths of generated transformed cesDoc) \
-oxslt -xslt &>"/var/www/html/tests/eng-ita/log-export_www_esteri_it_eng-ita"
~~~
-export : for exporting process
-of : fullpath of text file containing a list with fullpaths of the exported cesDoc files.
-xslt : Insert a stylesheet for rendering xml results as html.
-ofh : fullpath of HTML file containing a list with the generated XML files.
-oxslt : Export crawl results with the help of an xslt file for better examination of results.
The tool will create the directory "xml" next to the "run" directories. In this directory the downloaded documents (html, pdf) and the generated XML files will be stored. Each file is named by the language iso code (i.e. eng for an English document) followed by a unique id (e.g. eng-1.xml and eng-1.html, eng-2.xml and eng-2.html, etc.)
After crawling, the export process comes. The Exporter module generates a cesDoc file for each stored web document. Each file contains metadata (e.g. language, domain, URL, etc.) about the corresponding document inside a header element. Moreover, a `<body>` element contains the content of the document segmented in paragraphs. Apart from normalized text, each paragraph element `<p>` is enriched with attributes providing more information about the process outcome. Once a crawl has finished, the acquired data can be exported with:
~~~
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar \
-export -lang "en;it" -len 0 -mtlen 100 \
-i (crawlpath up to the auto-generated dir) -tc (full path of topic file) \
-dom (title of targeted topic) -of (fullpath of file with paths of generated cesDoc) \
-ofh (fullpath of file with paths of generated transformed cesDoc) \
-oxslt -xslt &>"/var/www/html/tests/eng-ita/log-export_www_esteri_it_eng-ita"
~~~
-export : for exporting process
-of : fullpath of text file containing a list with fullpaths of the exported cesDoc files.
-xslt : Insert a stylesheet for rendering xml results as html.
-ofh : fullpath of HTML file containing a list with the generated XML files.
-oxslt : Export crawl results with the help of an xslt file for better examination of results.
The tool will create the directory "xml" next to the "run" directories. In this directory the downloaded documents (html, pdf) and the generated XML files will be stored. Each file is named by the language iso code (i.e. eng for an English document) followed by a unique id (e.g. eng-1.xml and eng-1.html, eng-2.xml and eng-2.html, etc.)