Project

General

Profile

Near Deduplication » History » Version 5

Vassilis Papavassiliou, 2016-05-31 05:10 PM

1 1 Prokopis Prokopidis
# Near Deduplication
2 1 Prokopis Prokopidis
3 1 Prokopis Prokopidis
Examines the cesDoc files in a directory and removes (near)duplicates.
4 5 Vassilis Papavassiliou
It processes the exported cesDoc files (so the argument of option -i should be the crawlpath up to the "xml" directory), detects near duplicates,
5 5 Vassilis Papavassiliou
and discards the shortest (in terms of tokens) file in a pair of near duplicates.
6 2 Vassilis Papavassiliou
It also creates a text file (based on the argument of option -bs) with a list of the fullpaths of the remaining cesDoc files.
7 2 Vassilis Papavassiliou
If asked (-oxslt option is used) an HTML file with links pointing to the xls transformations is generated too.
8 1 Prokopis Prokopidis
9 1 Prokopis Prokopidis
~~~
10 2 Vassilis Papavassiliou
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-jar-with-dependencies.jar \
11 3 Vassilis Papavassiliou
-dedup -i (crawlpath up to the "xml" dir) -oxslt -bs (fullpath of file with paths of generated cesDoc) \
12 1 Prokopis Prokopidis
 &>"/var/www/html/tests/eng-ita/log-dedup_www_esteri_it_eng-ita"
13 1 Prokopis Prokopidis
~~~
14 1 Prokopis Prokopidis
15 3 Vassilis Papavassiliou
## Options
16 1 Prokopis Prokopidis
17 4 Vassilis Papavassiliou
```
18 5 Vassilis Papavassiliou
-dedup     :     for (near) deduplication.
19 1 Prokopis Prokopidis
20 5 Vassilis Papavassiliou
-i         :     crawlpath up to the "xml" dir generated by the export module
21 5 Vassilis Papavassiliou
22 5 Vassilis Papavassiliou
-bs        :     Basename to be used in generating all files for easier content navigation
23 5 Vassilis Papavassiliou
24 5 Vassilis Papavassiliou
-oxslt     :     Export crawl results with the help of an xslt file for better examination of results.
25 5 Vassilis Papavassiliou
26 5 Vassilis Papavassiliou
-ex        :     exclude files. List of CesDocFiles (separated by ;) to be excluded from deduplication.
27 5 Vassilis Papavassiliou
28 5 Vassilis Papavassiliou
```
29 5 Vassilis Papavassiliou
30 5 Vassilis Papavassiliou
## Other options 
31 5 Vassilis Papavassiliou
32 5 Vassilis Papavassiliou
33 1 Prokopis Prokopidis
-m 		: Method type for deduplication: 1 for deduplication based on lists of words and quantized frequencies, 2 for deduplication based on common paragraphs, and 0 for applying both methods (default).
34 1 Prokopis Prokopidis
35 1 Prokopis Prokopidis
-mtl 		: minimum length of a token. Tokens with less than MIN_TOK_LEN (default is 3) are excluded from content and are not included into list of words.
36 1 Prokopis Prokopidis
37 1 Prokopis Prokopidis
-mpl 		: minimum length of a paragraph in tokens. Paragraphs with less than MIN_PAR_LEN (default is 3) tokens are excluded from content.
38 4 Vassilis Papavassiliou
39 4 Vassilis Papavassiliou
-ithr 	: intersection of paragraphs. Documents for which the ratio the common paragraphs
40 4 Vassilis Papavassiliou
        		with the shortest of them is more than this threshold are considered duplicates
41 4 Vassilis Papavassiliou
42 4 Vassilis Papavassiliou
-int 		: inputType. Type of input files, default is xml, also supports txt.