Near Deduplication » History » Version 4
Vassilis Papavassiliou, 2016-05-31 04:56 PM
1 | 1 | Prokopis Prokopidis | # Near Deduplication |
---|---|---|---|
2 | 1 | Prokopis Prokopidis | |
3 | 1 | Prokopis Prokopidis | Examines the cesDoc files in a directory and removes (near)duplicates. |
4 | 2 | Vassilis Papavassiliou | It processes the exported cesDoc files (so the argument of option -i should be the crawlpath up to the "xml" directory) and detects near duplicates. |
5 | 2 | Vassilis Papavassiliou | It also creates a text file (based on the argument of option -bs) with a list of the fullpaths of the remaining cesDoc files. |
6 | 2 | Vassilis Papavassiliou | If asked (-oxslt option is used) an HTML file with links pointing to the xls transformations is generated too. |
7 | 1 | Prokopis Prokopidis | |
8 | 1 | Prokopis Prokopidis | ~~~ |
9 | 2 | Vassilis Papavassiliou | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-jar-with-dependencies.jar \ |
10 | 3 | Vassilis Papavassiliou | -dedup -i (crawlpath up to the "xml" dir) -oxslt -bs (fullpath of file with paths of generated cesDoc) \ |
11 | 1 | Prokopis Prokopidis | &>"/var/www/html/tests/eng-ita/log-dedup_www_esteri_it_eng-ita" |
12 | 1 | Prokopis Prokopidis | ~~~ |
13 | 1 | Prokopis Prokopidis | |
14 | 3 | Vassilis Papavassiliou | ## Options |
15 | 1 | Prokopis Prokopidis | |
16 | 4 | Vassilis Papavassiliou | ``` |
17 | 3 | Vassilis Papavassiliou | -dedup : for (near) deduplication. |
18 | 1 | Prokopis Prokopidis | |
19 | 1 | Prokopis Prokopidis | -m : Method type for deduplication: 1 for deduplication based on lists of words and quantized frequencies, 2 for deduplication based on common paragraphs, and 0 for applying both methods (default). |
20 | 1 | Prokopis Prokopidis | |
21 | 1 | Prokopis Prokopidis | -mtl : minimum length of a token. Tokens with less than MIN_TOK_LEN (default is 3) are excluded from content and are not included into list of words. |
22 | 1 | Prokopis Prokopidis | |
23 | 1 | Prokopis Prokopidis | -mpl : minimum length of a paragraph in tokens. Paragraphs with less than MIN_PAR_LEN (default is 3) tokens are excluded from content. |
24 | 1 | Prokopis Prokopidis | |
25 | 1 | Prokopis Prokopidis | -ithr : intersection of paragraphs. Documents for which the ratio the common paragraphs |
26 | 1 | Prokopis Prokopidis | with the shortest of them is more than this threshold are considered duplicates |
27 | 1 | Prokopis Prokopidis | |
28 | 1 | Prokopis Prokopidis | -int : inputType. Type of input files, default is xml, also supports txt. |
29 | 1 | Prokopis Prokopidis | |
30 | 1 | Prokopis Prokopidis | -ex : exclude files. List of CesDocFiles (separated by ;) to be excluded from deduplication. |
31 | 4 | Vassilis Papavassiliou | |
32 | 4 | Vassilis Papavassiliou | -bs : Basename to be used in generating all files for easier content navigation |
33 | 4 | Vassilis Papavassiliou | |
34 | 4 | Vassilis Papavassiliou | -oxslt : Export crawl results with the help of an xslt file for better examination of results. |
35 | 4 | Vassilis Papavassiliou | ``` |