Near Deduplication » History » Version 1
Prokopis Prokopidis, 2016-02-16 12:29 PM
1 | 1 | Prokopis Prokopidis | # Near Deduplication |
---|---|---|---|
2 | 1 | Prokopis Prokopidis | |
3 | 1 | Prokopis Prokopidis | Examines the cesDoc files in a directory and removes (near)duplicates. |
4 | 1 | Prokopis Prokopidis | |
5 | 1 | Prokopis Prokopidis | ~~~ |
6 | 1 | Prokopis Prokopidis | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar \ |
7 | 1 | Prokopis Prokopidis | -dedup -lang "en;it" -oxslt -xslt \ |
8 | 1 | Prokopis Prokopidis | -o (crawlpath up to the auto-generated xml dir) \ |
9 | 1 | Prokopis Prokopidis | -of (fullpath of file with paths of generated cesDoc) \ |
10 | 1 | Prokopis Prokopidis | -ofh (fullpath of file with paths of generated transformed cesDoc) \ |
11 | 1 | Prokopis Prokopidis | &>"/var/www/html/tests/eng-ita/log-dedup_www_esteri_it_eng-ita" |
12 | 1 | Prokopis Prokopidis | ~~~ |
13 | 1 | Prokopis Prokopidis | |
14 | 1 | Prokopis Prokopidis | -dedup : for (near) deduplication. |
15 | 1 | Prokopis Prokopidis | |
16 | 1 | Prokopis Prokopidis | Some more supported parameters are explained in gr.ilsp.fc.dedup.DeduplicatorOptions |
17 | 1 | Prokopis Prokopidis | |
18 | 1 | Prokopis Prokopidis | -m : Method type for deduplication: 1 for deduplication based on lists of words and quantized frequencies, 2 for deduplication based on common paragraphs, and 0 for applying both methods (default). |
19 | 1 | Prokopis Prokopidis | |
20 | 1 | Prokopis Prokopidis | -mtl : minimum length of a token. Tokens with less than MIN_TOK_LEN (default is 3) are excluded from content and are not included into list of words. |
21 | 1 | Prokopis Prokopidis | |
22 | 1 | Prokopis Prokopidis | -mpl : minimum length of a paragraph in tokens. Paragraphs with less than MIN_PAR_LEN (default is 3) tokens are excluded from content. |
23 | 1 | Prokopis Prokopidis | |
24 | 1 | Prokopis Prokopidis | -ithr : intersection of paragraphs. Documents for which the ratio the common paragraphs |
25 | 1 | Prokopis Prokopidis | with the shortest of them is more than this threshold are considered duplicates |
26 | 1 | Prokopis Prokopidis | |
27 | 1 | Prokopis Prokopidis | -int : inputType. Type of input files, default is xml, also supports txt. |
28 | 1 | Prokopis Prokopidis | |
29 | 1 | Prokopis Prokopidis | -ex : exclude files. List of CesDocFiles (separated by ;) to be excluded from deduplication. |