Getting Started » History » Version 153
Vassilis Papavassiliou, 2016-05-30 01:33 PM
1 | 130 | Prokopis Prokopidis | # Getting Started |
---|---|---|---|
2 | 2 | Prokopis Prokopidis | |
3 | 2 | Prokopis Prokopidis | Once you [[DeveloperSetup|build]] or [[HowToGet|download]] an ilsp-fc runnable jar, you can run it like this |
4 | 2 | Prokopis Prokopidis | |
5 | 70 | Prokopis Prokopidis | <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar</code></pre> |
6 | 2 | Prokopis Prokopidis | |
7 | 134 | Vassilis Papavassiliou | ## Examples of running monolingual crawls |
8 | 134 | Vassilis Papavassiliou | |
9 | 134 | Vassilis Papavassiliou | * Given a seed URL list [[ENV_EN_seeds.txt]], the following example crawls the web for 5 minutes and constructs a collection containing English web pages. |
10 | 134 | Vassilis Papavassiliou | |
11 | 134 | Vassilis Papavassiliou | ``` |
12 | 134 | Vassilis Papavassiliou | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \ |
13 | 134 | Vassilis Papavassiliou | -crawl -export -dedup -a test -f -type m -c 5 -lang en -k -u ENV_EN_seeds.txt -xslt -oxslt\ |
14 | 134 | Vassilis Papavassiliou | -dest crawlResults -of output-test-list.txt -ofh output-test-list.txt.html |
15 | 134 | Vassilis Papavassiliou | ``` |
16 | 134 | Vassilis Papavassiliou | |
17 | 134 | Vassilis Papavassiliou | In this and other example commands in this documentation, a `log4j.xml` file is being used to set logging configuration details. An example `log4j.xml` file can be downloaded from [[Log4j_xml|here]]. |
18 | 134 | Vassilis Papavassiliou | |
19 | 134 | Vassilis Papavassiliou | * Given a seed URL list [[ENV_EN_seeds.txt]] and a topic definition for the _Environment_ domain in Engish [[ENV_EN_topictxt|ENV_EN_topic.txt]], the following example crawls the web for 10 cycles and constructs a collection containing English web pages related to this domain. |
20 | 134 | Vassilis Papavassiliou | |
21 | 134 | Vassilis Papavassiliou | ``` |
22 | 134 | Vassilis Papavassiliou | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar \ |
23 | 134 | Vassilis Papavassiliou | -crawl -export -dedup -a test1 -f -type m -n 10 -lang en -k -u seed-examples.txt -xslt -oxslt \ |
24 | 134 | Vassilis Papavassiliou | -tc ENV-EN-topic.txt -dom Environment -dest crawlResults -of output-test1-list.txt -ofh output-test1-list.txt.html |
25 | 134 | Vassilis Papavassiliou | ``` |
26 | 134 | Vassilis Papavassiliou | |
27 | 134 | Vassilis Papavassiliou | ## Example of running bilingual crawls |
28 | 134 | Vassilis Papavassiliou | |
29 | 151 | Vassilis Papavassiliou | This is a test example to verify that the whole workflow (crawl, export, deuplication, pair detection, alingment) works successfully. |
30 | 134 | Vassilis Papavassiliou | |
31 | 1 | Prokopis Prokopidis | ``` |
32 | 151 | Vassilis Papavassiliou | java -Dlog4j.configuration=file:/opt/ilsp-fc/log4j.xml -jar /opt/ilsp-fc/ilsp-fc-2.2.3-SNAPSHOT-jar-with-dependencies.jar \ |
33 | 151 | Vassilis Papavassiliou | -crawl -export -dedup -pairdetect -align -tmxmerge -f -k -oxslt -type p -n 1 -t 20 -len 0 -mtlen 100 \ |
34 | 151 | Vassilis Papavassiliou | -pdm "aupdih" -segtypes "1:1" -lang "end;lt;deu;lv" -a test -filter ".*www\.airbaltic\.com.*" \ |
35 | 151 | Vassilis Papavassiliou | -u "/var/www/html/elrc/test-seeds" -dest "/var/www/html/elrc/test" \ |
36 | 151 | Vassilis Papavassiliou | -bs "/var/www/html/elrc/test/output_test" &> "/var/www/html/elrc/test/log_test" |
37 | 1 | Prokopis Prokopidis | ``` |
38 | 151 | Vassilis Papavassiliou | |
39 | 151 | Vassilis Papavassiliou | Seed URLs : |
40 | 151 | Vassilis Papavassiliou | |
41 | 151 | Vassilis Papavassiliou | ``` |
42 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/lv/bernu-atlaide |
43 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/lv/profila-registracija |
44 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/de/ermaessigung-kinder |
45 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/de/profil-erstellen |
46 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/en/child-discount |
47 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/en/create-account |
48 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/lt/child-discount |
49 | 151 | Vassilis Papavassiliou | https://www.airbaltic.com/lt/sukurti-paskira |
50 | 151 | Vassilis Papavassiliou | ``` |
51 | 151 | Vassilis Papavassiliou | |
52 | 151 | Vassilis Papavassiliou | |
53 | 152 | Vassilis Papavassiliou | ## Options |
54 | 152 | Vassilis Papavassiliou | There are several options concerning the applied processes. Besides the following comprehensive list, you could see the options that are supported for each module. |
55 | 152 | Vassilis Papavassiliou | |
56 | 152 | Vassilis Papavassiliou | ``` |
57 | 152 | Vassilis Papavassiliou | -a,--agentname <arg> Agent name to identify the person or the organization |
58 | 152 | Vassilis Papavassiliou | responsible for the crawl |
59 | 152 | Vassilis Papavassiliou | -align,--align_sentences <arg> Sentence align document pairs using this aligner (default is |
60 | 152 | Vassilis Papavassiliou | maligna) |
61 | 152 | Vassilis Papavassiliou | -bs,--basename <arg> Basename to be used in generating all files for easier |
62 | 152 | Vassilis Papavassiliou | content navigation |
63 | 152 | Vassilis Papavassiliou | -c,--crawlduration <arg> Maximum crawl duration in minutes |
64 | 152 | Vassilis Papavassiliou | -cc,--creative_commons Force the alignment process to generate a merged TMX with |
65 | 152 | Vassilis Papavassiliou | sentence alignments only from document pairs for which an |
66 | 152 | Vassilis Papavassiliou | open content license has been detected. |
67 | 152 | Vassilis Papavassiliou | -cfg,--config <arg> Path to the XML configuration file |
68 | 152 | Vassilis Papavassiliou | -crawl,--crawl Start a crawl |
69 | 152 | Vassilis Papavassiliou | -d,--stay_in_webdomain Force the monolingual crawler to stay in a specific web |
70 | 152 | Vassilis Papavassiliou | domain |
71 | 152 | Vassilis Papavassiliou | -dbg,--debug Use debug level for logging |
72 | 152 | Vassilis Papavassiliou | -dedup,--deduplicate Deduplicate and discard (near) duplicate documents |
73 | 152 | Vassilis Papavassiliou | -del,--delete_redundant_files Delete redundant crawled documents that have not been |
74 | 152 | Vassilis Papavassiliou | detected as members of a document pair |
75 | 152 | Vassilis Papavassiliou | -dest,--destination <arg> Path to a directory where the acquired/generated resources |
76 | 152 | Vassilis Papavassiliou | will be stored |
77 | 152 | Vassilis Papavassiliou | -pdm,--pairDetectMethods <arg> When creating a merged TMX file, only use sentence alignments |
78 | 152 | Vassilis Papavassiliou | from document pairs that have been identified by specific |
79 | 152 | Vassilis Papavassiliou | methods, e.g. auidh. See the pdm option. |
80 | 152 | Vassilis Papavassiliou | -dom,--domain <arg> A descriptive title for the targeted domain |
81 | 152 | Vassilis Papavassiliou | -export,--export Export crawled documents to cesDoc XML files |
82 | 152 | Vassilis Papavassiliou | -f,--force Force a new crawl. Caution: This will remove any previously |
83 | 152 | Vassilis Papavassiliou | crawled data |
84 | 152 | Vassilis Papavassiliou | -filter,--fetchfilter <arg> Use this regex to force the crawler to crawl only in specific |
85 | 152 | Vassilis Papavassiliou | sub webdomains. Webpages with urls that do not match this |
86 | 152 | Vassilis Papavassiliou | regex will not be fetched. |
87 | 152 | Vassilis Papavassiliou | -h,--help This message |
88 | 152 | Vassilis Papavassiliou | -i,--inputdir <arg> Input directory for deduplication, pairdetection, or |
89 | 152 | Vassilis Papavassiliou | alignment |
90 | 152 | Vassilis Papavassiliou | -ifp,--image_urls Full image URLs (and not only their basenames) will be used |
91 | 152 | Vassilis Papavassiliou | in pair detection with common images |
92 | 152 | Vassilis Papavassiliou | -k,--keepboiler Keep and annotate boilerplate content in parsed text |
93 | 152 | Vassilis Papavassiliou | -l,--loggingAppender <arg> Logging appender (console, DRFA) to use |
94 | 152 | Vassilis Papavassiliou | -lang,--languages <arg> Two or three letter ISO code(s) of target language(s), e.g. |
95 | 152 | Vassilis Papavassiliou | el (for a monolingual crawl for Greek content) or en;el (for |
96 | 152 | Vassilis Papavassiliou | a bilingual crawl) |
97 | 152 | Vassilis Papavassiliou | -len,--length <arg> Μinimum number of tokens per text block. Shorter text blocks |
98 | 152 | Vassilis Papavassiliou | will be annoteted as "ooi-length" |
99 | 152 | Vassilis Papavassiliou | -mtlen,--minlength <arg> Minimum number of tokens in crawled documents (after |
100 | 152 | Vassilis Papavassiliou | boilerplate detection). Shorter documents will be discarded. |
101 | 152 | Vassilis Papavassiliou | -n,--numloops <arg> Maximum number of fetch/update loops |
102 | 152 | Vassilis Papavassiliou | -oxslt,--offline_xslt Apply an xsl transformation to generate html files during |
103 | 152 | Vassilis Papavassiliou | exporting. |
104 | 152 | Vassilis Papavassiliou | -p_r,--path_replacements <arg> Put the strings to be replaced, separated by ';'. This might |
105 | 152 | Vassilis Papavassiliou | be useful for crawling via the web service |
106 | 152 | Vassilis Papavassiliou | -pairdetect,--pair_detection Detect document pairs in crawled documents |
107 | 152 | Vassilis Papavassiliou | -pdm,--pair_detection_methods <arg> Α string forcing the crawler to detect pairs using one or |
108 | 152 | Vassilis Papavassiliou | more specific methods: a (links between documents), u |
109 | 152 | Vassilis Papavassiliou | (patterns in urls), p (common images and similar digit |
110 | 152 | Vassilis Papavassiliou | sequences),i (common images), d (similar digit sequences), h, or m, or l |
111 | 152 | Vassilis Papavassiliou | (high/medium/low similarity of html structure) |
112 | 152 | Vassilis Papavassiliou | -segtypes,--segtypes <arg> When creating a merged TMX file, only use sentence alignments |
113 | 152 | Vassilis Papavassiliou | of specific types, ie. 1:1 |
114 | 152 | Vassilis Papavassiliou | -storefilter,--storefilter <arg> Use this regex to force the crawler to store only webpages |
115 | 152 | Vassilis Papavassiliou | with urls that match this regex. |
116 | 152 | Vassilis Papavassiliou | -t,--threads <arg> Maximum number of fetcher threads to use |
117 | 152 | Vassilis Papavassiliou | -tc,--topic <arg> Path to a file with the topic definition |
118 | 152 | Vassilis Papavassiliou | -tmxmerge,--tmxmerge Merge aligned segments from each document pair into one tmx |
119 | 152 | Vassilis Papavassiliou | file |
120 | 152 | Vassilis Papavassiliou | -type,--type <arg> Crawl type: m (monolingual) or p (parallel) or q |
121 | 152 | Vassilis Papavassiliou | (comparable) |
122 | 152 | Vassilis Papavassiliou | -u,--urls <arg> File with seed urls used to initialize the crawl |
123 | 152 | Vassilis Papavassiliou | -u_r,--url_replacements <arg> A string to be replaced, separated by ';'. |
124 | 152 | Vassilis Papavassiliou | ``` |
125 | 151 | Vassilis Papavassiliou | |
126 | 134 | Vassilis Papavassiliou | |
127 | 130 | Prokopis Prokopidis | ## Other settings |
128 | 73 | Prokopis Prokopidis | |
129 | 102 | Vassilis Papavassiliou | There are several settings that influence the crawling process and can be defined in a configuration file before the crawling process. The default configuration files for monolingual and bilingual crawls are [[FMC_config.xml]] and [[FBC_config.xml]] respectively. They are included in the ilsp-fc runnable jar. |
130 | 1 | Prokopis Prokopidis | |
131 | 40 | Prokopis Prokopidis | Some of the settings can also be overriden using options of the ilsp-fc runnable jar, as follows: |
132 | 38 | Prokopis Prokopidis | |
133 | 139 | Vassilis Papavassiliou | -crawl : For applying crawling process. |
134 | 34 | Vassilis Papavassiliou | |
135 | 139 | Vassilis Papavassiliou | -f : Forces the crawler to start a new job. |
136 | 139 | Vassilis Papavassiliou | |
137 | 139 | Vassilis Papavassiliou | -type : The type of crawling. Crawling for monolingual (m) or parallel (p). |
138 | 139 | Vassilis Papavassiliou | |
139 | 139 | Vassilis Papavassiliou | -lang : The language iso codes of the targeted languages separated by ";". |
140 | 139 | Vassilis Papavassiliou | |
141 | 139 | Vassilis Papavassiliou | -cfg : The full path to a configuration file that can be used to override default parameters. |
142 | 139 | Vassilis Papavassiliou | |
143 | 140 | Vassilis Papavassiliou | -a : User agent name. It is proposed to use a name similar to the targeted site in case of bilingual crawls. |
144 | 139 | Vassilis Papavassiliou | |
145 | 140 | Vassilis Papavassiliou | -u : The fullpath of text file that contains the seed URLs that will initialize the crawler. In case of bilingual crawling the list should contain the URL of the main page of the targeted website, or (of course) other URLs of this website. |
146 | 139 | Vassilis Papavassiliou | |
147 | 139 | Vassilis Papavassiliou | -filter : A regular expression to filter out URLs which do NOT match this regex. |
148 | 140 | Vassilis Papavassiliou | The use of this filter forces the crawler to either focus on a specific web domain (i.e. ".*ec.europa.eu.*"), or on a part of a web domain (e.g.".*/legislation_summaries/environment.*") or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Note that if this filter is used, only the seed URLs that match this regex will be fetched. |
149 | 1 | Prokopis Prokopidis | |
150 | 140 | Vassilis Papavassiliou | -n : The crawl duration in cycles. Since the crawler runs in cycles (during which links stored at the top of the crawler’s frontier are extracted and new links are examined) it is proposed to use this parameter either for testing purposes or selecting a large number (i.e. 100) to "verify" that the crawler will visit the entire website. |
151 | 1 | Prokopis Prokopidis | |
152 | 140 | Vassilis Papavassiliou | -c : the crawl duration in minutes. Since the crawler runs in cycles (during which links stored at the top of the crawler’s frontier are extracted and new links are examined) it is very likely that the defined time will expire during a cycle run. Then, the crawler will stop only after the end of the running cycle. |
153 | 140 | Vassilis Papavassiliou | |
154 | 141 | Vassilis Papavassiliou | -dest : The directory where the results (i.e. the crawled data) will be stored. The tool will create the file structure dest/agent/crawl-id (where dest and agent stand for the arguments of parameters dest and agent respectively and crawl-id is generated automatically). In this directory, the tool will create the "run" directories (i.e. directories containing all resources fetched/extracted/used/required for each cycle of this crawl). In addition a pdf directory for storing acquired pdf files will be created. |
155 | 139 | Vassilis Papavassiliou | |
156 | 140 | Vassilis Papavassiliou | -t : The number of threads that will be used to fetch web pages in parallel. |
157 | 1 | Prokopis Prokopidis | |
158 | 140 | Vassilis Papavassiliou | -k : Forces the crawler to annotate boilerplate content in parsed text. |
159 | 139 | Vassilis Papavassiliou | |
160 | 139 | Vassilis Papavassiliou | -len : Minimum number of tokens per paragraph. If the length (in terms of tokens) of a paragraph is |
161 | 140 | Vassilis Papavassiliou | less than this value the paragraph will be annotated as "out of interest" and will not be included into the clean text of the web page. |
162 | 1 | Prokopis Prokopidis | |
163 | 1 | Prokopis Prokopidis | -mtlen : Minimum number of tokens in cleaned document. If the length (in terms of tokens) of the cleaned text is less than this value, the document will not be stored. |
164 | 139 | Vassilis Papavassiliou | |
165 | 140 | Vassilis Papavassiliou | -tc : The fullpath of topic file (a text file that contains a list of term triplets that describe the targeted topic). An example domain definition of "Environment" for the English-Spanish pair can be found at http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki/ENV_EN_ES_topic. If omitted, the crawl will be a "general" one (i.e. module for text-to-domain classification will not be used). |
166 | 139 | Vassilis Papavassiliou | |
167 | 139 | Vassilis Papavassiliou | -dom : Title of the targeted domain (required when domain definition, i.e. tc parameter, is used). |
168 | 139 | Vassilis Papavassiliou | |
169 | 139 | Vassilis Papavassiliou | -storefilter: A regular expression to discard (i.e. visit/fetch/process but do not store) webpages with URLs which do NOT match this regex. |
170 | 140 | Vassilis Papavassiliou | |
171 | 1 | Prokopis Prokopidis | -d : Forces the crawler to stay in a web site (i.e. starts from a web site and extracts only links to pages inside the same web site). It should be used only for monolingual crawling. |
172 | 141 | Vassilis Papavassiliou | |
173 | 141 | Vassilis Papavassiliou | |
174 | 143 | Vassilis Papavassiliou | -export : For exporting process |
175 | 141 | Vassilis Papavassiliou | |
176 | 146 | Vassilis Papavassiliou | -of : The fullpath of text file containing a list with fullpaths of the exported cesDoc files, or cesAling files. |
177 | 142 | Vassilis Papavassiliou | |
178 | 143 | Vassilis Papavassiliou | -xslt : If exists, it inserts a stylesheet for rendering XML results as HTML. |
179 | 1 | Prokopis Prokopidis | |
180 | 143 | Vassilis Papavassiliou | -oxslt : If exists, Export crawl results with the help of an xslt file for better examination of results. |
181 | 142 | Vassilis Papavassiliou | |
182 | 143 | Vassilis Papavassiliou | -ofh : The fullpath of HTML file containing a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. |
183 | 1 | Prokopis Prokopidis | |
184 | 147 | Vassilis Papavassiliou | -dedup : for (near) deduplication. |
185 | 147 | Vassilis Papavassiliou | |
186 | 148 | Vassilis Papavassiliou | -pairdetect : for identification of candidate parallel documents |
187 | 146 | Vassilis Papavassiliou | |
188 | 148 | Vassilis Papavassiliou | -meth : methods to be used for pair detection. Put a string which contains a for checking links, u for checking urls for patterns, p for combining common images and digits, i for using common images, d for examining digit sequences, s for examining structures. |
189 | 146 | Vassilis Papavassiliou | |
190 | 146 | Vassilis Papavassiliou | -u_r : url_replacements. Besides the default patterns , the user could add more patterns separated by ; |
191 | 146 | Vassilis Papavassiliou | |
192 | 149 | Vassilis Papavassiliou | -ifp : image_fullpath. Keep image fullpath for pair detection for representing an image instead of its name only. |
193 | 149 | Vassilis Papavassiliou | |
194 | 149 | Vassilis Papavassiliou | -del : delete redundant files. Deletes cesDoc files that have not been paired |
195 | 146 | Vassilis Papavassiliou | |
196 | 145 | Vassilis Papavassiliou | |
197 | 145 | Vassilis Papavassiliou | -align : for segment alignment |
198 | 145 | Vassilis Papavassiliou | |
199 | 145 | Vassilis Papavassiliou | -oft : The fullpath of text file containing a list with fullpaths of the generated TMX files |
200 | 145 | Vassilis Papavassiliou | |
201 | 145 | Vassilis Papavassiliou | -ofth : The fullpath of HTML file containing a list of links pointing to generated transformed TMX files |
202 | 143 | Vassilis Papavassiliou | |
203 | 150 | Vassilis Papavassiliou | -tmxmerge : for merging generated TMX files (i.e. construct a bilingual corpus). |
204 | 150 | Vassilis Papavassiliou | |
205 | 150 | Vassilis Papavassiliou | -doctypes : Defines the types of the document pairs from which the segment pairs will be selected. The proposed value is "aupidh" since pairs of type "m" and "l" (e.g. eng-1_lav-3_m.xml or eng-2_lav-8_l.xml) are only used for testing or examining the tool. |
206 | 150 | Vassilis Papavassiliou | |
207 | 150 | Vassilis Papavassiliou | -thres : thresholds for 0:1 alignments per type. It should be of the same length with the types parameter. If a TMX of type X contains more 0:1 segment pairs than the corresponding threshold, it will not be selected |
208 | 150 | Vassilis Papavassiliou | |
209 | 150 | Vassilis Papavassiliou | -segtypes : Types of segment alignments that will be selected for the final output. The value "1:1" (deault) is proposed. If omitted, segments of all types will be processed. "Otherwise put segment types seperated by ; (i.e. 1:1;1:2;2:1) |
210 | 150 | Vassilis Papavassiliou | |
211 | 150 | Vassilis Papavassiliou | -tmx : A TMX files that includes filtered segment pairs of the generated TMX. This is the final output of the process (i.e. the parallel corpus) |
212 | 150 | Vassilis Papavassiliou | |
213 | 150 | Vassilis Papavassiliou | -cc : If exists, only document pairs for which a license has been detected will be selected in merged TMX. |
214 | 150 | Vassilis Papavassiliou | |
215 | 150 | Vassilis Papavassiliou | -metadata : Generates an XML file which contains metadata of the generated corpus. |
216 | 1 | Prokopis Prokopidis | |
217 | 1 | Prokopis Prokopidis | |
218 | 135 | Vassilis Papavassiliou | [//]: # ( ## Input ) |
219 | 135 | Vassilis Papavassiliou | |
220 | 135 | Vassilis Papavassiliou | [//]: # (In case of general monolingual crawls the required input from the user is: ) |
221 | 135 | Vassilis Papavassiliou | [//]: # (* a list of seed URLs (i.e. a text file with one URL per text line). ) |
222 | 135 | Vassilis Papavassiliou | |
223 | 135 | Vassilis Papavassiliou | [//]: # (In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are related to a targeted domain), the input should include: ) |
224 | 135 | Vassilis Papavassiliou | [//]: # (* a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]]. ) |
225 | 135 | Vassilis Papavassiliou | [//]: # (* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English. Details on how to construct/bootstrap such lists and how they are used in text to topic classification could be found at this paper http://www.aclweb.org/anthology/W13-2506.pdf ) |
226 | 135 | Vassilis Papavassiliou | |
227 | 135 | Vassilis Papavassiliou | [//]: # (In case of general bilingual crawling, the input from the user includes:) |
228 | 135 | Vassilis Papavassiliou | [//]: # (* a seed URL list which should contain URL(s) from only one web site (e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. However, the user could use the <code> filter </code> parameter (see below) to allow visiting only links pointing to pages either inside versions of the top domain of the URL (e.g. http://www.fifa.com/, http://es.fifa.com/ , etc.) or in different web sites (i.e. in cases the translations are in two web sites e.g. http://www.nrcan.gc.ca and http://www.rncan.gc.ca). Examples of seed URLs can be found at [[seed_examples.txt]]. ) |
229 | 135 | Vassilis Papavassiliou | |
230 | 135 | Vassilis Papavassiliou | [//]: # (In case of focused bilingual crawls, the input should also include: ) |
231 | 135 | Vassilis Papavassiliou | [//]: # (* a list of term triplets (_<relevance,term,subtopic>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]].) |
232 | 135 | Vassilis Papavassiliou | |
233 | 135 | Vassilis Papavassiliou | [//]: # (## Language support ) |
234 | 135 | Vassilis Papavassiliou | |
235 | 135 | Vassilis Papavassiliou | [//]: # (For both monolingual and bilingual crawling, the set of currently supported languages comprises de, el, en, es, fr, ga, hr, it, ja, and pt. ) |
236 | 135 | Vassilis Papavassiliou | |
237 | 135 | Vassilis Papavassiliou | [//]: # (In order to add another language, a developer/user should: ) |
238 | 135 | Vassilis Papavassiliou | [//]: # (* verify that the targeted language is supported by the default language identifier (https://code.google.com/p/language-detection/) integrated in the ILSP-FC, ) |
239 | 135 | Vassilis Papavassiliou | [//]: # (* add a textline with proper content in the [[langKeys.txt]] file which is included in the ilsp-fc runnable jar, and) |
240 | 135 | Vassilis Papavassiliou | [//]: # (* add a proper analyser in the <code>gr.ilsp.fmc.utils.AnalyserFactory</code> class of the ilsp-fc source. ) |
241 | 135 | Vassilis Papavassiliou | |
242 | 135 | Vassilis Papavassiliou | |
243 | 135 | Vassilis Papavassiliou | |
244 | 135 | Vassilis Papavassiliou | [//]: # (## Run a monolingual crawl ) |
245 | 135 | Vassilis Papavassiliou | |
246 | 136 | Vassilis Papavassiliou | [//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a vpapa@ilsp.gr -cfg FMC_config.xml -type m -c 10 -lang en -of output_test1_list.txt -ofh output_test1_list.txt.html -tc ENV_EN_topic.txt -u ENV_EN_seeds.txt -f -k -dom Environment</code></pre> ) |
247 | 1 | Prokopis Prokopidis | |
248 | 136 | Vassilis Papavassiliou | [//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test2 -f -k -type m -c 5 -lang es -of output_test2_list.txt -ofh output_test2_list.txt.html -u seed_examples.txt </code></pre> ) |
249 | 1 | Prokopis Prokopidis | |
250 | 136 | Vassilis Papavassiliou | [//]: # (## Run a bilingual crawl ) |
251 | 71 | Vassilis Papavassiliou | |
252 | 136 | Vassilis Papavassiliou | [//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test3 -c 10 -f -k -l1 de -l2 it -of test_HS_DE-IT_output.txt -ofh test_HS_DE-IT_output.txt.html -tc HS_DE-IT_topic.txt -type p -u seed_examples.txt -cfg FBC_config.xml -dom HS -len 0 -mtlen 100 -xslt -oxslt</code></pre> ) |
253 | 1 | Prokopis Prokopidis | |
254 | 136 | Vassilis Papavassiliou | [//]: # ( <pre><code>java -jar ilsp-fc-X.Y.Z-jar-with-dependencies.jar crawlandexport -a test4 -c 10 -f -k -l1 es -l2 en -type p -u seed_examples.txt -filter ".*uefa.com.*" -len 0 -mtlen 80 -xslt -oxslt -dest "/var/crawl_results/" -of test_U_ES-EN_output.txt -ofh test_U_ES-EN_output.txt.html -oft test_U_ES-EN_output.tmx.txt -ofth test_U_ES-EN_output.tmx.html -align hunalign -dict </code></pre> ) |
255 | 123 | Prokopis Prokopidis | |
256 | 136 | Vassilis Papavassiliou | [//]: # ( <pre><code>java -jar ilsp-fc-2.2-jar-with-dependencies.jar crawlandexport -f -a abumatran -type p -align maligna -l1 en -l2 fr -u seed_examples.txt -filter ".*(nrcan|rncan).*" -n 2 -xslt -oxslt -of output_demo_EN-FR.txt -ofh output_demo_EN-FR.txt.html -oft output_demo_EN-FR.tmx.txt -ofth output_demo_EN-FR.tmx.html </code></pre>) |
257 | 122 | Prokopis Prokopidis | |
258 | 136 | Vassilis Papavassiliou | [//]: # ( ## Output ) |
259 | 85 | Vassilis Papavassiliou | |
260 | 136 | Vassilis Papavassiliou | [//]: # (The output of the ilsp-fc in the case of a monolingual crawl consists of: ) |
261 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See this "cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml file for an example in English for the _Environment_ domain. ) |
262 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this "rendered cesDoc":http://nlp.ilsp.gr/xslt/ilsp-fc/1.xml.html file. ) |
263 | 71 | Vassilis Papavassiliou | |
264 | 136 | Vassilis Papavassiliou | [//]: # (The output of the ilsp-fc in the case of a bilingual crawl consists of: ) |
265 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example "cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml file serves as a link between a detected pair of cesDoc documents in "English":http://nlp.ilsp.gr/xslt/ilsp-fc/98.xml and "Spanish":http://nlp.ilsp.gr/xslt/ilsp-fc/44.xml.) |
266 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this "rendered cesAlign":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.xml.html file.) |
267 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this "TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.tmx file.) |
268 | 136 | Vassilis Papavassiliou | [//]: # (* a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this "rendered TMX":http://nlp.ilsp.gr/xslt/ilsp-fc/44_98_i.html file.) |
269 | 137 | Vassilis Papavassiliou | |
270 | 137 | Vassilis Papavassiliou | |
271 | 137 | Vassilis Papavassiliou | ## Running modules of the ILSP-FC |
272 | 137 | Vassilis Papavassiliou | |
273 | 137 | Vassilis Papavassiliou | The ILSP-FC, in a configuration for acquiring parallel data, applies the following processes (one after the other): |
274 | 137 | Vassilis Papavassiliou | * [[Crawl|Crawl]] |
275 | 137 | Vassilis Papavassiliou | * [[Export|Export]] |
276 | 137 | Vassilis Papavassiliou | * [[NearDeduplication|Near Deduplication]] |
277 | 137 | Vassilis Papavassiliou | * [[PairDetection|Pair Detection]] |
278 | 137 | Vassilis Papavassiliou | * [[SegmentAlignment|Segment Alignment]] |
279 | 137 | Vassilis Papavassiliou | * [[TMXmerging|TMX Merging]] |