Export » History » Version 9
Vassilis Papavassiliou, 2021-05-07 03:35 PM
1 | 1 | Prokopis Prokopidis | # Export |
---|---|---|---|
2 | 1 | Prokopis Prokopidis | |
3 | 1 | Prokopis Prokopidis | After crawling, the export process comes. The Exporter module generates a cesDoc file for each stored web document. Each file contains metadata (e.g. language, domain, URL, etc.) about the corresponding document inside a header element. Moreover, a `<body>` element contains the content of the document segmented in paragraphs. Apart from normalized text, each paragraph element `<p>` is enriched with attributes providing more information about the process outcome. Once a crawl has finished, the acquired data can be exported with: |
4 | 1 | Prokopis Prokopidis | |
5 | 1 | Prokopis Prokopidis | ~~~ |
6 | 9 | Vassilis Papavassiliou | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.4-jar-with-dependencies.jar \ |
7 | 1 | Prokopis Prokopidis | -export -lang "en;it" -len 0 -mtlen 100 \ |
8 | 1 | Prokopis Prokopidis | -i (crawlpath up to the auto-generated dir) -tc (full path of topic file) \ |
9 | 3 | Vassilis Papavassiliou | -dom (title of targeted topic) -bs (fullpath and basename on which all files for easier content navigation will be generated) \ |
10 | 2 | Vassilis Papavassiliou | -oxslt &>"/var/www/html/tests/eng-ita/log-export_www_esteri_it_eng-ita" |
11 | 1 | Prokopis Prokopidis | ~~~ |
12 | 1 | Prokopis Prokopidis | |
13 | 6 | Vassilis Papavassiliou | ## Options |
14 | 6 | Vassilis Papavassiliou | |
15 | 5 | Vassilis Papavassiliou | ``` |
16 | 8 | Vassilis Papavassiliou | -export : for exporting process |
17 | 4 | Vassilis Papavassiliou | |
18 | 8 | Vassilis Papavassiliou | -i : crawlpath up to the auto-generated dir by the crawl module |
19 | 4 | Vassilis Papavassiliou | |
20 | 8 | Vassilis Papavassiliou | -lang : two or three letter ISO code(s) of target language(s), |
21 | 8 | Vassilis Papavassiliou | e.g. el (for a monolingual crawl for Greek content) or en;el (for a bilingual crawl) |
22 | 8 | Vassilis Papavassiliou | CesDoc files will be generated only for crawled web documents that are in the targeted language(s) |
23 | 4 | Vassilis Papavassiliou | |
24 | 8 | Vassilis Papavassiliou | -tc : fullpath of topic file (a text file that contains a list of term triplets that describe the targeted topic). |
25 | 4 | Vassilis Papavassiliou | |
26 | 8 | Vassilis Papavassiliou | -dom : title of the targeted domain (required when domain definition, i.e. tc parameter, is used). |
27 | 4 | Vassilis Papavassiliou | |
28 | 8 | Vassilis Papavassiliou | -bs : Basename to be used in generating all files for easier content navigation |
29 | 4 | Vassilis Papavassiliou | |
30 | 8 | Vassilis Papavassiliou | -oxslt : Export crawl results with the help of an xslt file for better examination of results. |
31 | 3 | Vassilis Papavassiliou | |
32 | 8 | Vassilis Papavassiliou | -len : Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is |
33 | 8 | Vassilis Papavassiliou | less than this value the paragraph will be annotated as "out of interest" |
34 | 8 | Vassilis Papavassiliou | and will not be included into the clean text of the web page. |
35 | 1 | Prokopis Prokopidis | |
36 | 8 | Vassilis Papavassiliou | -mtlen : Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned text |
37 | 8 | Vassilis Papavassiliou | is less than this value, the document will not be stored. |
38 | 5 | Vassilis Papavassiliou | |
39 | 5 | Vassilis Papavassiliou | ``` |
40 | 1 | Prokopis Prokopidis | |
41 | 1 | Prokopis Prokopidis | The tool will create the directory "xml" next to the "run" directories. In this directory the downloaded documents (html, pdf) and the generated XML files will be stored. Each file is named by the language iso code (i.e. eng for an English document) followed by a unique id (e.g. eng-1.xml and eng-1.html, eng-2.xml and eng-2.html, etc.) |