Pair Detection » History » Version 2
Vassilis Papavassiliou, 2016-02-16 07:49 PM
1 | 1 | Prokopis Prokopidis | # Pair Detection |
---|---|---|---|
2 | 1 | Prokopis Prokopidis | |
3 | 1 | Prokopis Prokopidis | It detects pairs of parallel documents based on website graph, specific patterns in URLs, occurrences of common images, similarity of sequences of digits and similarity of structure. For each detected pair, a cesAlign is generated. The basename of a cesAling file consists of the basenames of the paired documents and the identifier of the method which provided this pair (e.g. eng-12_ell-18_x.xml, where x stands for a, u, p, i, d, h, m and l). This file holds references to the paired documents. |
4 | 1 | Prokopis Prokopidis | |
5 | 1 | Prokopis Prokopidis | ``` |
6 | 1 | Prokopis Prokopidis | java -cp /opt/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar gr.ilsp.fc.bitext.PairDetector \ |
7 | 1 | Prokopis Prokopidis | -meth "aupids" -lang "en;it" -xslt -oxslt\ |
8 | 1 | Prokopis Prokopidis | -i (crawlpath up to the auto-generated xml dir) \ |
9 | 1 | Prokopis Prokopidis | -o (crawlpath up to the auto-generated xml dir) \ |
10 | 1 | Prokopis Prokopidis | -of (fullpath of file with paths of generated cesAlign) \ |
11 | 1 | Prokopis Prokopidis | -ofh (fullpath of file with paths of generated transformed cesAlign) \ |
12 | 1 | Prokopis Prokopidis | &>"/var/www/html/tests/eng-ita/log-pairdetect_www_esteri_it_eng-ita" |
13 | 1 | Prokopis Prokopidis | ``` |
14 | 1 | Prokopis Prokopidis | |
15 | 1 | Prokopis Prokopidis | -meth : methods to be used for pair detection. Put a string which contains a for checking links, u for checking urls for patterns, p for combining common images and digits, i for using common images, d for examining digit sequences, s for examining structures. |
16 | 1 | Prokopis Prokopidis | |
17 | 2 | Vassilis Papavassiliou | -ifp : image_fullpath. Keep image fullpath for pair detection for representing an image instead of its name only. |
18 | 1 | Prokopis Prokopidis | |
19 | 1 | Prokopis Prokopidis | -u_r : url_replacements. Besides the default patterns, the user could add more patterns separated by ; |
20 | 1 | Prokopis Prokopidis | |
21 | 1 | Prokopis Prokopidis | -del : delete redundant files. Deletes cesDoc files that have not been paired |