Topic Definitions » History » Version 2
Prokopis Prokopidis, 2016-12-06 12:28 PM
1 | 1 | Prokopis Prokopidis | # Topic Definitions |
---|---|---|---|
2 | 1 | Prokopis Prokopidis | |
3 | 1 | Prokopis Prokopidis | A topic definition in the context of ILSP-FC is a list of terms that "define" a topic. This list is provided at runtime in a text file where each line contains a term with the following fields: |
4 | 1 | Prokopis Prokopidis | |
5 | 1 | Prokopis Prokopidis | * weight (an integer, see below) for the term |
6 | 1 | Prokopis Prokopidis | * a `:` separator |
7 | 1 | Prokopis Prokopidis | * a (multi-word) term to be searched in a web document fetched by the crawler (in its main, non-boilerplate content, but also in its title and keywords) |
8 | 1 | Prokopis Prokopidis | * a `=` separator |
9 | 1 | Prokopis Prokopidis | * a string for one or more topics that the term corresponds to, separated by ";" |
10 | 1 | Prokopis Prokopidis | |
11 | 1 | Prokopis Prokopidis | If the targetted languages are more than 1, i.e. if you run a multilingual crawl, the following fields are also needed |
12 | 1 | Prokopis Prokopidis | - a ">" separator |
13 | 1 | Prokopis Prokopidis | - a 2-digit iso language code for the language of the term |
14 | 1 | Prokopis Prokopidis | |
15 | 1 | Prokopis Prokopidis | This list is an external resource that must be created before running the crawl. |
16 | 1 | Prokopis Prokopidis | |
17 | 2 | Prokopis Prokopidis | One way to create it is to search for an available term list and manually modify it according to the format above. Suppose for example that a user needs to search for the topic "vaccines". Relevant lists can be downloaded from https://www.vaccines.gov/more_info/glossary/index.html, http://www.cdc.gov/vaccines/terms/glossary.html and/or http://www.lexicon.com.gr/el/main.php. One issue with this approach is assigning appropriate weights to each term. A simple heuristic is to assign the same weight (e.g. 100) to each term and assign lower terms that may be ambiguous, i.e. terms that may correspond to multiple topics. |
18 | 1 | Prokopis Prokopidis | |
19 | 2 | Prokopis Prokopidis | Here's an extract for the `environment` topic for a monolingual crawl for Greek content. |
20 | 1 | Prokopis Prokopidis | |
21 | 2 | Prokopis Prokopidis | ``` |
22 | 1 | Prokopis Prokopidis | 70:"πράσινο" σήμα=περιβάλλον |
23 | 1 | Prokopis Prokopidis | 70:άγρια ζώα=περιβάλλον |
24 | 1 | Prokopis Prokopidis | 100:άγρια φυτά και ζώα=περιβάλλον |
25 | 1 | Prokopis Prokopidis | 70:άγριο θηλαστικό=περιβάλλον |
26 | 1 | Prokopis Prokopidis | 50:αγροτική καταστροφή=περιβάλλον |
27 | 1 | Prokopis Prokopidis | 50:άδεια για κυνήγι=περιβάλλον |
28 | 1 | Prokopis Prokopidis | 50:άδεια θήρας=περιβάλλον |
29 | 1 | Prokopis Prokopidis | 50:άδεια κυνηγίου=περιβάλλον |
30 | 1 | Prokopis Prokopidis | 50:άδεια μπαταρία=περιβάλλον |
31 | 1 | Prokopis Prokopidis | 100:άδεια ρύπανσης=περιβάλλον |
32 | 1 | Prokopis Prokopidis | 25:αειφόρος ανάπτυξη=περιβάλλον |
33 | 1 | Prokopis Prokopidis | 25:αέρια εξάτμισης αυτοκινήτων=περιβάλλον |
34 | 1 | Prokopis Prokopidis | 100:αέριο που προκαλεί το φαινόμενο του θερμοκηπίου=περιβάλλον |
35 | 1 | Prokopis Prokopidis | 100:αέριο που φθείρει το στρώμα του όζοντος=περιβάλλον |
36 | 2 | Prokopis Prokopidis | ``` |
37 | 1 | Prokopis Prokopidis | |
38 | 1 | Prokopis Prokopidis | Η κατηγοριοποίηση δουλεύει ως εξής: |
39 | 1 | Prokopis Prokopidis | - το topic definition περνάει από ένα analyzer της γλώσσας στόχου και "κρατιέται" ως ακολούθως: |
40 | 1 | Prokopis Prokopidis | 70 πρασιν σημ περιβάλλον ell "πράσινο" σήμα |
41 | 1 | Prokopis Prokopidis | 70 αγρ ζωα περιβάλλον ell άγρια ζώα |
42 | 1 | Prokopis Prokopidis | 100 αγρ φυτ ζωα περιβάλλον ell άγρια φυτά και ζώα |
43 | 1 | Prokopis Prokopidis | 70 αγρι θηλαστ περιβάλλον ell άγριο θηλαστικό |
44 | 1 | Prokopis Prokopidis | 50 αγροτικ καταστροφ περιβάλλον ell αγροτική καταστροφή |
45 | 1 | Prokopis Prokopidis | 50 αδει θηρ περιβάλλον ell άδεια θήρας |
46 | 1 | Prokopis Prokopidis | 50 αδει κυνηγ περιβάλλον ell άδεια για κυνήγι |
47 | 1 | Prokopis Prokopidis | 50 αδει μπαταρ περιβάλλον ell άδεια μπαταρία |
48 | 1 | Prokopis Prokopidis | 100 αδει ρυπανσ περιβάλλον ell άδεια ρύπανσης |
49 | 1 | Prokopis Prokopidis | 25 αειφορ αναπτυξ περιβάλλον ell αειφόρος ανάπτυξη |
50 | 1 | Prokopis Prokopidis | 25 αερ εξατμισ αυτοκινητ περιβάλλον ell αέρια εξάτμισης αυτοκινήτων |
51 | 1 | Prokopis Prokopidis | 100 αερι προκαλ φαινομεν θερμοκηπ περιβάλλον ell αέριο που προκαλεί το φαινόμενο του θερμοκηπίου |
52 | 1 | Prokopis Prokopidis | 100 αερι φθειρ στρωμ οζοντ περιβάλλον ell αέριο που φθείρει το στρώμα του όζοντος |
53 | 1 | Prokopis Prokopidis | |
54 | 1 | Prokopis Prokopidis | -Tο main content κάθε webpage που έχει περάσει από language identifier και έχει βρεθεί "in a targeted language", υφίσταται την ίδια επεξεργασία και προκύπτει το normalized main content |
55 | 1 | Prokopis Prokopidis | |
56 | 1 | Prokopis Prokopidis | -Το normalized main content συγκρίνεται με τους normalized όρους και υπολογίζεται το score |
57 | 1 | Prokopis Prokopidis | (δες section 3.5 Text Classifier in http://aclweb.org/anthology/W/W13/W13-2506.pdf) |
58 | 1 | Prokopis Prokopidis | |
59 | 1 | Prokopis Prokopidis | |
60 | 1 | Prokopis Prokopidis | Tα κατώφλια που απαιτούνται, ορίζονται στο configuration file |
61 | 1 | Prokopis Prokopidis | |
62 | 1 | Prokopis Prokopidis | <classifier> |
63 | 1 | Prokopis Prokopidis | <min_content_terms> |
64 | 1 | Prokopis Prokopidis | <value>4</value> |
65 | 1 | Prokopis Prokopidis | <description>Minimum number of terms that must exist in clean |
66 | 1 | Prokopis Prokopidis | content of each web page in order to be stored.This number |
67 | 1 | Prokopis Prokopidis | is multiplied with the median value of the terms'weights and |
68 | 1 | Prokopis Prokopidis | the result is the threshold for the absolute relevance score.</description> |
69 | 1 | Prokopis Prokopidis | </min_content_terms> |
70 | 1 | Prokopis Prokopidis | <min_unique_content_terms> |
71 | 1 | Prokopis Prokopidis | <value>4</value> |
72 | 1 | Prokopis Prokopidis | <description>Minimum unique terms that must exist in clean content</description> |
73 | 1 | Prokopis Prokopidis | </min_unique_content_terms> |
74 | 1 | Prokopis Prokopidis | <relative_relevance_threshold> |
75 | 1 | Prokopis Prokopidis | <value>0.2</value> |
76 | 1 | Prokopis Prokopidis | <description>The absolute relevance score is divided by the length |
77 | 1 | Prokopis Prokopidis | (in terms of tokens) of the clean content of a document and the |
78 | 1 | Prokopis Prokopidis | calculated relative relevance score is compared with this value</description> |
79 | 1 | Prokopis Prokopidis | </relative_relevance_threshold> |
80 | 1 | Prokopis Prokopidis | |
81 | 1 | Prokopis Prokopidis | </classifier> |