Pair Detection » History » Version 6
Vassilis Papavassiliou, 2016-05-31 05:23 PM
1 | 1 | Prokopis Prokopidis | # Pair Detection |
---|---|---|---|
2 | 1 | Prokopis Prokopidis | |
3 | 1 | Prokopis Prokopidis | It detects pairs of parallel documents based on website graph, specific patterns in URLs, occurrences of common images, similarity of sequences of digits and similarity of structure. For each detected pair, a cesAlign is generated. The basename of a cesAling file consists of the basenames of the paired documents and the identifier of the method which provided this pair (e.g. eng-12_ell-18_x.xml, where x stands for a, u, p, i, d, h, m and l). This file holds references to the paired documents. |
4 | 1 | Prokopis Prokopidis | |
5 | 1 | Prokopis Prokopidis | ``` |
6 | 3 | Vassilis Papavassiliou | java -cp /opt/ilsp-fc/ilsp-fc-2.2.3-jar-with-dependencies.jar gr.ilsp.fc.bitext.PairDetector \ |
7 | 3 | Vassilis Papavassiliou | -pdm "aupidh" -lang "en;it" -oxslt -i (crawlpath up to the auto-generated xml dir) \ |
8 | 3 | Vassilis Papavassiliou | -bs (fullpath and basename on which all files for easier content navigation will be generated) |
9 | 1 | Prokopis Prokopidis | &>"/var/www/html/tests/eng-ita/log-pairdetect_www_esteri_it_eng-ita" |
10 | 1 | Prokopis Prokopidis | ``` |
11 | 1 | Prokopis Prokopidis | |
12 | 3 | Vassilis Papavassiliou | ## Options |
13 | 3 | Vassilis Papavassiliou | |
14 | 4 | Vassilis Papavassiliou | ``` |
15 | 6 | Vassilis Papavassiliou | -i : crawlpath up to the auto-generated dir by the crawl module |
16 | 6 | Vassilis Papavassiliou | |
17 | 6 | Vassilis Papavassiliou | -lang : two or three letter ISO code(s) of target language(s), |
18 | 6 | Vassilis Papavassiliou | e.g. el (for a monolingual crawl for Greek content) or en;el (for a bilingual crawl) |
19 | 6 | Vassilis Papavassiliou | CesDoc files will be generated only for crawled web documents that are in the targeted language(s) |
20 | 6 | Vassilis Papavassiliou | |
21 | 5 | Vassilis Papavassiliou | -pdm : methods to be used for pair detection. Put a string which contains a for checking links, |
22 | 5 | Vassilis Papavassiliou | u for checking urls for patterns, p for combining common images and digits, i for using common images, |
23 | 5 | Vassilis Papavassiliou | d for examining digit sequences, h for examining structures. |
24 | 6 | Vassilis Papavassiliou | |
25 | 6 | Vassilis Papavassiliou | -bs : Basename to be used in generating all files for easier content navigation |
26 | 6 | Vassilis Papavassiliou | |
27 | 6 | Vassilis Papavassiliou | -oxslt : Export crawl results with the help of an xslt file for better examination of results. |
28 | 1 | Prokopis Prokopidis | |
29 | 5 | Vassilis Papavassiliou | -ifp : image_fullpath. Keep image fullpath for pair detection for representing an image instead of its name only. |
30 | 1 | Prokopis Prokopidis | |
31 | 5 | Vassilis Papavassiliou | -u_r : url_replacements. Besides the default patterns, the user could add more patterns separated by ; |
32 | 1 | Prokopis Prokopidis | |
33 | 5 | Vassilis Papavassiliou | -del : delete redundant files. Deletes cesDoc files that have not been paired |
34 | 4 | Vassilis Papavassiliou | |
35 | 4 | Vassilis Papavassiliou | ``` |