Near Deduplication » History » Version 4
Version 3 (Vassilis Papavassiliou, 2016-05-31 04:54 PM) → Version 4/6 (Vassilis Papavassiliou, 2016-05-31 04:56 PM)
# Near Deduplication
Examines the cesDoc files in a directory and removes (near)duplicates.
It processes the exported cesDoc files (so the argument of option -i should be the crawlpath up to the "xml" directory) and detects near duplicates.
It also creates a text file (based on the argument of option -bs) with a list of the fullpaths of the remaining cesDoc files.
If asked (-oxslt option is used) an HTML file with links pointing to the xls transformations is generated too.
~~~
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-jar-with-dependencies.jar \
-dedup -i (crawlpath up to the "xml" dir) -oxslt -bs (fullpath of file with paths of generated cesDoc) \
&>"/var/www/html/tests/eng-ita/log-dedup_www_esteri_it_eng-ita"
~~~
## Options
```
-dedup : for (near) deduplication.
-m : Method type for deduplication: 1 for deduplication based on lists of words and quantized frequencies, 2 for deduplication based on common paragraphs, and 0 for applying both methods (default).
-mtl : minimum length of a token. Tokens with less than MIN_TOK_LEN (default is 3) are excluded from content and are not included into list of words.
-mpl : minimum length of a paragraph in tokens. Paragraphs with less than MIN_PAR_LEN (default is 3) tokens are excluded from content.
-ithr : intersection of paragraphs. Documents for which the ratio the common paragraphs
with the shortest of them is more than this threshold are considered duplicates
-int : inputType. Type of input files, default is xml, also supports txt.
-ex : exclude files. List of CesDocFiles (separated by ;) to be excluded from deduplication.
-bs : Basename to be used in generating all files for easier content navigation
-oxslt : Export crawl results with the help of an xslt file for better examination of results.
```
Examines the cesDoc files in a directory and removes (near)duplicates.
It processes the exported cesDoc files (so the argument of option -i should be the crawlpath up to the "xml" directory) and detects near duplicates.
It also creates a text file (based on the argument of option -bs) with a list of the fullpaths of the remaining cesDoc files.
If asked (-oxslt option is used) an HTML file with links pointing to the xls transformations is generated too.
~~~
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-jar-with-dependencies.jar \
-dedup -i (crawlpath up to the "xml" dir) -oxslt -bs (fullpath of file with paths of generated cesDoc) \
&>"/var/www/html/tests/eng-ita/log-dedup_www_esteri_it_eng-ita"
~~~
## Options
```
-dedup : for (near) deduplication.
-m : Method type for deduplication: 1 for deduplication based on lists of words and quantized frequencies, 2 for deduplication based on common paragraphs, and 0 for applying both methods (default).
-mtl : minimum length of a token. Tokens with less than MIN_TOK_LEN (default is 3) are excluded from content and are not included into list of words.
-mpl : minimum length of a paragraph in tokens. Paragraphs with less than MIN_PAR_LEN (default is 3) tokens are excluded from content.
-ithr : intersection of paragraphs. Documents for which the ratio the common paragraphs
with the shortest of them is more than this threshold are considered duplicates
-int : inputType. Type of input files, default is xml, also supports txt.
-ex : exclude files. List of CesDocFiles (separated by ;) to be excluded from deduplication.
-bs : Basename to be used in generating all files for easier content navigation
-oxslt : Export crawl results with the help of an xslt file for better examination of results.
```