Project

General

Profile

Export » History » Version 8

Vassilis Papavassiliou, 2016-05-31 04:02 PM

1 1 Prokopis Prokopidis
# Export
2 1 Prokopis Prokopidis
3 1 Prokopis Prokopidis
After crawling, the export process comes. The Exporter module generates a cesDoc file for each stored web document. Each file contains metadata (e.g. language, domain, URL, etc.) about the corresponding document inside a header element.  Moreover, a `<body>` element contains the content of the document segmented in paragraphs. Apart from normalized text, each paragraph element `<p>` is enriched with attributes providing more information about the process outcome. Once a crawl has finished, the acquired data can be exported with:
4 1 Prokopis Prokopidis
5 1 Prokopis Prokopidis
~~~
6 3 Vassilis Papavassiliou
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-jar-with-dependencies.jar \
7 1 Prokopis Prokopidis
-export -lang "en;it" -len 0 -mtlen 100 \
8 1 Prokopis Prokopidis
-i (crawlpath up to the auto-generated dir) -tc (full path of topic file) \
9 3 Vassilis Papavassiliou
-dom (title of targeted topic) -bs (fullpath and basename on which all files for easier content navigation will be generated) \
10 2 Vassilis Papavassiliou
-oxslt  &>"/var/www/html/tests/eng-ita/log-export_www_esteri_it_eng-ita"
11 1 Prokopis Prokopidis
~~~
12 1 Prokopis Prokopidis
13 6 Vassilis Papavassiliou
## Options
14 6 Vassilis Papavassiliou
15 5 Vassilis Papavassiliou
```
16 8 Vassilis Papavassiliou
-export :     for exporting process
17 4 Vassilis Papavassiliou
18 8 Vassilis Papavassiliou
-i      :     crawlpath up to the auto-generated dir by the crawl module
19 4 Vassilis Papavassiliou
20 8 Vassilis Papavassiliou
-lang   :     two or three letter ISO code(s) of target language(s), 
21 8 Vassilis Papavassiliou
              e.g.  el (for a monolingual crawl for Greek content) or en;el (for a bilingual crawl)
22 8 Vassilis Papavassiliou
              CesDoc files will be generated only for crawled web documents that are in the targeted language(s)
23 4 Vassilis Papavassiliou
24 8 Vassilis Papavassiliou
-tc     :     fullpath of topic file (a text file that contains a list of term triplets that describe the targeted topic).
25 4 Vassilis Papavassiliou
26 8 Vassilis Papavassiliou
-dom    :     title of the targeted domain (required when domain definition, i.e. tc parameter, is used).
27 4 Vassilis Papavassiliou
28 8 Vassilis Papavassiliou
-bs     :     Basename to be used in generating all files for easier content navigation
29 4 Vassilis Papavassiliou
30 8 Vassilis Papavassiliou
-oxslt  :     Export crawl results with the help of an xslt file for better examination of results.
31 3 Vassilis Papavassiliou
32 8 Vassilis Papavassiliou
-len    :     Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is
33 8 Vassilis Papavassiliou
              less than this value the paragraph will be annotated as "out of interest" 
34 8 Vassilis Papavassiliou
              and will not be included into the clean text of the web page.
35 1 Prokopis Prokopidis
36 8 Vassilis Papavassiliou
-mtlen  :     Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned text
37 8 Vassilis Papavassiliou
              is less than this value, the document will not be stored.
38 5 Vassilis Papavassiliou
39 5 Vassilis Papavassiliou
```
40 1 Prokopis Prokopidis
41 1 Prokopis Prokopidis
The tool will create the directory "xml" next to the "run" directories. In this directory the downloaded documents (html, pdf) and the generated XML files will be stored. Each file is named by the language iso code (i.e. eng for an English document) followed by a unique id (e.g. eng-1.xml and eng-1.html, eng-2.xml and eng-2.html, etc.)