Near Deduplication » History » Version 6
Vassilis Papavassiliou, 2021-05-07 03:35 PM
1 | 1 | Prokopis Prokopidis | # Near Deduplication |
---|---|---|---|
2 | 1 | Prokopis Prokopidis | |
3 | 1 | Prokopis Prokopidis | Examines the cesDoc files in a directory and removes (near)duplicates. |
4 | 5 | Vassilis Papavassiliou | It processes the exported cesDoc files (so the argument of option -i should be the crawlpath up to the "xml" directory), detects near duplicates, |
5 | 5 | Vassilis Papavassiliou | and discards the shortest (in terms of tokens) file in a pair of near duplicates. |
6 | 2 | Vassilis Papavassiliou | It also creates a text file (based on the argument of option -bs) with a list of the fullpaths of the remaining cesDoc files. |
7 | 2 | Vassilis Papavassiliou | If asked (-oxslt option is used) an HTML file with links pointing to the xls transformations is generated too. |
8 | 1 | Prokopis Prokopidis | |
9 | 1 | Prokopis Prokopidis | ~~~ |
10 | 6 | Vassilis Papavassiliou | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.4-jar-with-dependencies.jar \ |
11 | 3 | Vassilis Papavassiliou | -dedup -i (crawlpath up to the "xml" dir) -oxslt -bs (fullpath of file with paths of generated cesDoc) \ |
12 | 1 | Prokopis Prokopidis | &>"/var/www/html/tests/eng-ita/log-dedup_www_esteri_it_eng-ita" |
13 | 1 | Prokopis Prokopidis | ~~~ |
14 | 1 | Prokopis Prokopidis | |
15 | 3 | Vassilis Papavassiliou | ## Options |
16 | 1 | Prokopis Prokopidis | |
17 | 4 | Vassilis Papavassiliou | ``` |
18 | 5 | Vassilis Papavassiliou | -dedup : for (near) deduplication. |
19 | 1 | Prokopis Prokopidis | |
20 | 5 | Vassilis Papavassiliou | -i : crawlpath up to the "xml" dir generated by the export module |
21 | 5 | Vassilis Papavassiliou | |
22 | 5 | Vassilis Papavassiliou | -bs : Basename to be used in generating all files for easier content navigation |
23 | 5 | Vassilis Papavassiliou | |
24 | 5 | Vassilis Papavassiliou | -oxslt : Export crawl results with the help of an xslt file for better examination of results. |
25 | 5 | Vassilis Papavassiliou | |
26 | 5 | Vassilis Papavassiliou | -ex : exclude files. List of CesDocFiles (separated by ;) to be excluded from deduplication. |
27 | 5 | Vassilis Papavassiliou | |
28 | 5 | Vassilis Papavassiliou | ``` |
29 | 5 | Vassilis Papavassiliou | |
30 | 5 | Vassilis Papavassiliou | ## Other options |
31 | 5 | Vassilis Papavassiliou | |
32 | 5 | Vassilis Papavassiliou | |
33 | 1 | Prokopis Prokopidis | -m : Method type for deduplication: 1 for deduplication based on lists of words and quantized frequencies, 2 for deduplication based on common paragraphs, and 0 for applying both methods (default). |
34 | 1 | Prokopis Prokopidis | |
35 | 1 | Prokopis Prokopidis | -mtl : minimum length of a token. Tokens with less than MIN_TOK_LEN (default is 3) are excluded from content and are not included into list of words. |
36 | 1 | Prokopis Prokopidis | |
37 | 1 | Prokopis Prokopidis | -mpl : minimum length of a paragraph in tokens. Paragraphs with less than MIN_PAR_LEN (default is 3) tokens are excluded from content. |
38 | 4 | Vassilis Papavassiliou | |
39 | 4 | Vassilis Papavassiliou | -ithr : intersection of paragraphs. Documents for which the ratio the common paragraphs |
40 | 4 | Vassilis Papavassiliou | with the shortest of them is more than this threshold are considered duplicates |
41 | 4 | Vassilis Papavassiliou | |
42 | 4 | Vassilis Papavassiliou | -int : inputType. Type of input files, default is xml, also supports txt. |