Project

General

Profile

Pair Detection » History » Version 1

Prokopis Prokopidis, 2016-02-16 12:30 PM

1 1 Prokopis Prokopidis
# Pair Detection
2 1 Prokopis Prokopidis
3 1 Prokopis Prokopidis
It detects pairs of parallel documents based on website graph, specific patterns in URLs, occurrences of common images, similarity of sequences of digits and similarity of structure. For each detected pair, a cesAlign is generated. The basename of a cesAling file consists of the basenames of the paired documents and the identifier of the method which provided this pair (e.g. eng-12_ell-18_x.xml, where x stands for a, u, p, i, d, h, m and l). This file holds references to the paired documents.
4 1 Prokopis Prokopidis
5 1 Prokopis Prokopidis
```
6 1 Prokopis Prokopidis
java -cp /opt/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar gr.ilsp.fc.bitext.PairDetector \
7 1 Prokopis Prokopidis
-meth "aupids" -lang "en;it" -xslt -oxslt\
8 1 Prokopis Prokopidis
-i (crawlpath up to the auto-generated xml dir) \
9 1 Prokopis Prokopidis
-o (crawlpath up to the auto-generated xml dir) \
10 1 Prokopis Prokopidis
-of (fullpath of file with paths of generated cesAlign) \
11 1 Prokopis Prokopidis
-ofh (fullpath of file with paths of generated transformed cesAlign) \
12 1 Prokopis Prokopidis
 &>"/var/www/html/tests/eng-ita/log-pairdetect_www_esteri_it_eng-ita"
13 1 Prokopis Prokopidis
```
14 1 Prokopis Prokopidis
15 1 Prokopis Prokopidis
-meth   : methods to be used for pair detection. Put a string which contains a for checking links, u for checking urls for patterns, p for combining common images and digits, i for using common images, d for examining digit sequences, s for examining structures.
16 1 Prokopis Prokopidis
17 1 Prokopis Prokopidis
ifp   : image_fullpath. Keep image fullpath for pair detection for representing an image instead of its name only.
18 1 Prokopis Prokopidis
19 1 Prokopis Prokopidis
-u_r   : url_replacements. Besides the default patterns, the user could add more patterns separated by ;
20 1 Prokopis Prokopidis
21 1 Prokopis Prokopidis
-del   : delete redundant files. Deletes cesDoc files that have not been paired