Project

General

Profile

TMX merging » History » Version 7

Vassilis Papavassiliou, 2016-05-31 05:52 PM

1 1 Prokopis Prokopidis
# TMX merging
2 1 Prokopis Prokopidis
3 1 Prokopis Prokopidis
It merges generated TMX and creates the final TMX which is considered as the final output (i.e. the bilingual corpus). Filtering of segment pairs is supported since targeted types of document pairs and segment can be selected. It also extracts metadata of the final corpus.  
4 1 Prokopis Prokopidis
5 1 Prokopis Prokopidis
```
6 1 Prokopis Prokopidis
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar \
7 1 Prokopis Prokopidis
-tmxmerge -lang "L1;L2" -oxslt -doctypes "aupdih" -segtypes "1:1" \
8 1 Prokopis Prokopidis
-tmx (fullpath of the merged TMX to be constructed) \
9 1 Prokopis Prokopidis
&>"log-tmxmerge"
10 4 Vassilis Papavassiliou
11 4 Vassilis Papavassiliou
12 4 Vassilis Papavassiliou
java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-SNAPSHOT-jar-with-dependencies.jar -tmxmerge -lang "L1;L2" \ 
13 5 Vassilis Papavassiliou
-i (input) -oxslt -pdm "aupdih" -segtypes "1:1" -bs (baseName for output files) &>"log-tmxmerge"
14 1 Prokopis Prokopidis
```
15 1 Prokopis Prokopidis
16 4 Vassilis Papavassiliou
## Options
17 4 Vassilis Papavassiliou
18 4 Vassilis Papavassiliou
19 4 Vassilis Papavassiliou
```
20 5 Vassilis Papavassiliou
-tmxmerge     :     for merging generated TMX files (i.e. construct a bilingual corpus).
21 1 Prokopis Prokopidis
22 6 Vassilis Papavassiliou
-i            :     fullpath of input file/directory. It could be either a directory which contains the TMX files to be merged,
23 6 Vassilis Papavassiliou
                    or a text file with fullpaths of such directories (one directory per textline)
24 6 Vassilis Papavassiliou
25 7 Vassilis Papavassiliou
-pdm          :     Defines the types of the document pairs from which the segment pairs will be selected.
26 7 Vassilis Papavassiliou
                    The proposed value is "aupidh" since pairs of type "m" and "l" (e.g. eng-1_lav-3_m.xml or eng-2_lav-8_l.xml)
27 7 Vassilis Papavassiliou
                    are only used for testing or examining the tool.
28 1 Prokopis Prokopidis
29 5 Vassilis Papavassiliou
-thres        :     thresholds for 0:1 alignments per type. It should be of the same length with the types parameter. If a TMX of type X contains
30 5 Vassilis Papavassiliou
                    more 0:1 segment pairs than the corresponding threshold, it will not be selected
31 1 Prokopis Prokopidis
32 5 Vassilis Papavassiliou
-segtypes     :     Types of segment alignments that will be selected for the final output. A suggested value is "1:1".
33 5 Vassilis Papavassiliou
                    Multiple segment types can be separated by ";" (e.g. 1:1;1:2;2:1).
34 1 Prokopis Prokopidis
35 5 Vassilis Papavassiliou
-tmx          :     A TMX files that includes filtered segment pairs of the generated TMX. This is the final output of the process (i.e. the parallel corpus)
36 1 Prokopis Prokopidis
37 5 Vassilis Papavassiliou
-cc           :     If exists, only document pairs for which a license has been detected will be selected in merged TMX.
38 1 Prokopis Prokopidis
39 5 Vassilis Papavassiliou
-cfg          :     The full path to a configuration file that can be used to override default parameters.
40 1 Prokopis Prokopidis
41 5 Vassilis Papavassiliou
-keepdup     :     keeps duplicate TUs, and annotates them
42 5 Vassilis Papavassiliou
43 5 Vassilis Papavassiliou
-keepem      :     keeps TUs, even if one of its TUV does not contain any letter, and annotates them
44 5 Vassilis Papavassiliou
45 5 Vassilis Papavassiliou
-keepiden    :     keeps TUs, even if its TUVs are identical after removing non-letters, and annotates them
46 5 Vassilis Papavassiliou
47 5 Vassilis Papavassiliou
-ksn         :     keeps only TUs with same digits
48 5 Vassilis Papavassiliou
49 5 Vassilis Papavassiliou
-maxlr       :     maximum ratio of length (in chars) in a TU
50 1 Prokopis Prokopidis
51 6 Vassilis Papavassiliou
-minlr      :     minimum ratio of length (in chars) in a TU
52 5 Vassilis Papavassiliou
53 5 Vassilis Papavassiliou
-mpa         :     minimum percentage of 0:1 alignments in a TMX, to be accepted
54 5 Vassilis Papavassiliou
55 1 Prokopis Prokopidis
-mtuvl       :     minimum length in tokens of an acceptable TUV
56 6 Vassilis Papavassiliou
57 6 Vassilis Papavassiliou
-iso6393     :     if exists three language codes are used. Otherwise, two-letter language codes are used in the generated TMX files.
58 5 Vassilis Papavassiliou
59 4 Vassilis Papavassiliou
```