Project

General

Profile

Near Deduplication » History » Version 1

Version 1/6 - Next » - Current version
Prokopis Prokopidis, 2016-02-16 12:29 PM


Near Deduplication

Examines the cesDoc files in a directory and removes (near)duplicates.

java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar \
-dedup -lang "en;it" -oxslt -xslt \
-o (crawlpath up to the auto-generated xml dir)  \
-of (fullpath of file with paths of generated cesDoc) \
-ofh (fullpath of file with paths of generated transformed cesDoc) \
 &>"/var/www/html/tests/eng-ita/log-dedup_www_esteri_it_eng-ita"

-dedup : for (near) deduplication.

Some more supported parameters are explained in gr.ilsp.fc.dedup.DeduplicatorOptions

-m : Method type for deduplication: 1 for deduplication based on lists of words and quantized frequencies, 2 for deduplication based on common paragraphs, and 0 for applying both methods (default).

-mtl : minimum length of a token. Tokens with less than MIN_TOK_LEN (default is 3) are excluded from content and are not included into list of words.

-mpl : minimum length of a paragraph in tokens. Paragraphs with less than MIN_PAR_LEN (default is 3) tokens are excluded from content.

-ithr : intersection of paragraphs. Documents for which the ratio the common paragraphs
with the shortest of them is more than this threshold are considered duplicates

-int : inputType. Type of input files, default is xml, also supports txt.

-ex : exclude files. List of CesDocFiles (separated by ;) to be excluded from deduplication.