Project

General

Profile

Pair Detection » History » Version 6

Vassilis Papavassiliou, 2016-05-31 05:23 PM

1 1 Prokopis Prokopidis
# Pair Detection
2 1 Prokopis Prokopidis
3 1 Prokopis Prokopidis
It detects pairs of parallel documents based on website graph, specific patterns in URLs, occurrences of common images, similarity of sequences of digits and similarity of structure. For each detected pair, a cesAlign is generated. The basename of a cesAling file consists of the basenames of the paired documents and the identifier of the method which provided this pair (e.g. eng-12_ell-18_x.xml, where x stands for a, u, p, i, d, h, m and l). This file holds references to the paired documents.
4 1 Prokopis Prokopidis
5 1 Prokopis Prokopidis
```
6 3 Vassilis Papavassiliou
java -cp /opt/ilsp-fc/ilsp-fc-2.2.3-jar-with-dependencies.jar gr.ilsp.fc.bitext.PairDetector \
7 3 Vassilis Papavassiliou
-pdm "aupidh" -lang "en;it" -oxslt  -i (crawlpath up to the auto-generated xml dir) \
8 3 Vassilis Papavassiliou
-bs (fullpath and basename on which all files for easier content navigation will be generated)
9 1 Prokopis Prokopidis
 &>"/var/www/html/tests/eng-ita/log-pairdetect_www_esteri_it_eng-ita"
10 1 Prokopis Prokopidis
```
11 1 Prokopis Prokopidis
12 3 Vassilis Papavassiliou
## Options
13 3 Vassilis Papavassiliou
14 4 Vassilis Papavassiliou
```
15 6 Vassilis Papavassiliou
-i      : crawlpath up to the auto-generated dir by the crawl module
16 6 Vassilis Papavassiliou
17 6 Vassilis Papavassiliou
-lang   : two or three letter ISO code(s) of target language(s), 
18 6 Vassilis Papavassiliou
          e.g.  el (for a monolingual crawl for Greek content) or en;el (for a bilingual crawl)
19 6 Vassilis Papavassiliou
          CesDoc files will be generated only for crawled web documents that are in the targeted language(s)
20 6 Vassilis Papavassiliou
21 5 Vassilis Papavassiliou
-pdm   :  methods to be used for pair detection. Put a string which contains a for checking links, 
22 5 Vassilis Papavassiliou
          u for checking urls for patterns, p for combining common images and digits, i for using common images,
23 5 Vassilis Papavassiliou
          d for examining digit sequences, h for examining structures.
24 6 Vassilis Papavassiliou
25 6 Vassilis Papavassiliou
-bs     :     Basename to be used in generating all files for easier content navigation
26 6 Vassilis Papavassiliou
27 6 Vassilis Papavassiliou
-oxslt  :     Export crawl results with the help of an xslt file for better examination of results.
28 1 Prokopis Prokopidis
29 5 Vassilis Papavassiliou
-ifp   :  image_fullpath. Keep image fullpath for pair detection for representing an image instead of its name only.
30 1 Prokopis Prokopidis
31 5 Vassilis Papavassiliou
-u_r   :  url_replacements. Besides the default patterns, the user could add more patterns separated by ;
32 1 Prokopis Prokopidis
33 5 Vassilis Papavassiliou
-del   :  delete redundant files. Deletes cesDoc files that have not been paired				
34 4 Vassilis Papavassiliou
35 4 Vassilis Papavassiliou
```