Topic Definitions » History » Version 5
Prokopis Prokopidis, 2016-12-06 12:31 PM
1 | 1 | Prokopis Prokopidis | # Topic Definitions |
---|---|---|---|
2 | 1 | Prokopis Prokopidis | |
3 | 1 | Prokopis Prokopidis | A topic definition in the context of ILSP-FC is a list of terms that "define" a topic. This list is provided at runtime in a text file where each line contains a term with the following fields: |
4 | 1 | Prokopis Prokopidis | |
5 | 1 | Prokopis Prokopidis | * weight (an integer, see below) for the term |
6 | 1 | Prokopis Prokopidis | * a `:` separator |
7 | 1 | Prokopis Prokopidis | * a (multi-word) term to be searched in a web document fetched by the crawler (in its main, non-boilerplate content, but also in its title and keywords) |
8 | 1 | Prokopis Prokopidis | * a `=` separator |
9 | 1 | Prokopis Prokopidis | * a string for one or more topics that the term corresponds to, separated by ";" |
10 | 1 | Prokopis Prokopidis | |
11 | 1 | Prokopis Prokopidis | If the targetted languages are more than 1, i.e. if you run a multilingual crawl, the following fields are also needed |
12 | 1 | Prokopis Prokopidis | - a ">" separator |
13 | 1 | Prokopis Prokopidis | - a 2-digit iso language code for the language of the term |
14 | 1 | Prokopis Prokopidis | |
15 | 1 | Prokopis Prokopidis | This list is an external resource that must be created before running the crawl. |
16 | 1 | Prokopis Prokopidis | |
17 | 3 | Prokopis Prokopidis | One way to create it is to search for an available term list and manually modify it according to the format above. Suppose for example that a user needs to search for the topic "vaccines". Relevant lists can be downloaded from https://www.vaccines.gov/more_info/glossary/index.html, http://www.cdc.gov/vaccines/terms/glossary.html and/or http://www.lexicon.com.gr/el/main.php. One issue with this approach is assigning appropriate weights to each term. A simple heuristic is to assign the same weight (e.g. 100) to each term and assign lower weights to terms that may be ambiguous, i.e. terms that may correspond to multiple topics. |
18 | 1 | Prokopis Prokopidis | |
19 | 2 | Prokopis Prokopidis | Here's an extract for the `environment` topic for a monolingual crawl for Greek content. |
20 | 1 | Prokopis Prokopidis | |
21 | 2 | Prokopis Prokopidis | ``` |
22 | 1 | Prokopis Prokopidis | 70:"πράσινο" σήμα=περιβάλλον |
23 | 1 | Prokopis Prokopidis | 70:άγρια ζώα=περιβάλλον |
24 | 1 | Prokopis Prokopidis | 100:άγρια φυτά και ζώα=περιβάλλον |
25 | 1 | Prokopis Prokopidis | 70:άγριο θηλαστικό=περιβάλλον |
26 | 1 | Prokopis Prokopidis | 50:αγροτική καταστροφή=περιβάλλον |
27 | 1 | Prokopis Prokopidis | 50:άδεια για κυνήγι=περιβάλλον |
28 | 1 | Prokopis Prokopidis | 50:άδεια θήρας=περιβάλλον |
29 | 1 | Prokopis Prokopidis | 50:άδεια κυνηγίου=περιβάλλον |
30 | 1 | Prokopis Prokopidis | 50:άδεια μπαταρία=περιβάλλον |
31 | 1 | Prokopis Prokopidis | 100:άδεια ρύπανσης=περιβάλλον |
32 | 1 | Prokopis Prokopidis | 25:αειφόρος ανάπτυξη=περιβάλλον |
33 | 1 | Prokopis Prokopidis | 25:αέρια εξάτμισης αυτοκινήτων=περιβάλλον |
34 | 1 | Prokopis Prokopidis | 100:αέριο που προκαλεί το φαινόμενο του θερμοκηπίου=περιβάλλον |
35 | 1 | Prokopis Prokopidis | 100:αέριο που φθείρει το στρώμα του όζοντος=περιβάλλον |
36 | 2 | Prokopis Prokopidis | ``` |
37 | 1 | Prokopis Prokopidis | |
38 | 5 | Prokopis Prokopidis | At runtime, the crawler's classifier will be initialized with a stemmed version of the topic definition: |
39 | 5 | Prokopis Prokopidis | |
40 | 4 | Prokopis Prokopidis | |
41 | 4 | Prokopis Prokopidis | ``` |
42 | 1 | Prokopis Prokopidis | 70 πρασιν σημ περιβάλλον ell "πράσινο" σήμα |
43 | 1 | Prokopis Prokopidis | 70 αγρ ζωα περιβάλλον ell άγρια ζώα |
44 | 1 | Prokopis Prokopidis | 100 αγρ φυτ ζωα περιβάλλον ell άγρια φυτά και ζώα |
45 | 1 | Prokopis Prokopidis | 70 αγρι θηλαστ περιβάλλον ell άγριο θηλαστικό |
46 | 1 | Prokopis Prokopidis | 50 αγροτικ καταστροφ περιβάλλον ell αγροτική καταστροφή |
47 | 1 | Prokopis Prokopidis | 50 αδει θηρ περιβάλλον ell άδεια θήρας |
48 | 1 | Prokopis Prokopidis | 50 αδει κυνηγ περιβάλλον ell άδεια για κυνήγι |
49 | 1 | Prokopis Prokopidis | 50 αδει μπαταρ περιβάλλον ell άδεια μπαταρία |
50 | 1 | Prokopis Prokopidis | 100 αδει ρυπανσ περιβάλλον ell άδεια ρύπανσης |
51 | 1 | Prokopis Prokopidis | 25 αειφορ αναπτυξ περιβάλλον ell αειφόρος ανάπτυξη |
52 | 1 | Prokopis Prokopidis | 25 αερ εξατμισ αυτοκινητ περιβάλλον ell αέρια εξάτμισης αυτοκινήτων |
53 | 1 | Prokopis Prokopidis | 100 αερι προκαλ φαινομεν θερμοκηπ περιβάλλον ell αέριο που προκαλεί το φαινόμενο του θερμοκηπίου |
54 | 1 | Prokopis Prokopidis | 100 αερι φθειρ στρωμ οζοντ περιβάλλον ell αέριο που φθείρει το στρώμα του όζοντος |
55 | 4 | Prokopis Prokopidis | ``` |
56 | 1 | Prokopis Prokopidis | |
57 | 1 | Prokopis Prokopidis | -Tο main content κάθε webpage που έχει περάσει από language identifier και έχει βρεθεί "in a targeted language", υφίσταται την ίδια επεξεργασία και προκύπτει το normalized main content |
58 | 1 | Prokopis Prokopidis | |
59 | 1 | Prokopis Prokopidis | -Το normalized main content συγκρίνεται με τους normalized όρους και υπολογίζεται το score |
60 | 1 | Prokopis Prokopidis | (δες section 3.5 Text Classifier in http://aclweb.org/anthology/W/W13/W13-2506.pdf) |
61 | 1 | Prokopis Prokopidis | |
62 | 1 | Prokopis Prokopidis | |
63 | 1 | Prokopis Prokopidis | Tα κατώφλια που απαιτούνται, ορίζονται στο configuration file |
64 | 1 | Prokopis Prokopidis | |
65 | 1 | Prokopis Prokopidis | <classifier> |
66 | 1 | Prokopis Prokopidis | <min_content_terms> |
67 | 1 | Prokopis Prokopidis | <value>4</value> |
68 | 1 | Prokopis Prokopidis | <description>Minimum number of terms that must exist in clean |
69 | 1 | Prokopis Prokopidis | content of each web page in order to be stored.This number |
70 | 1 | Prokopis Prokopidis | is multiplied with the median value of the terms'weights and |
71 | 1 | Prokopis Prokopidis | the result is the threshold for the absolute relevance score.</description> |
72 | 1 | Prokopis Prokopidis | </min_content_terms> |
73 | 1 | Prokopis Prokopidis | <min_unique_content_terms> |
74 | 1 | Prokopis Prokopidis | <value>4</value> |
75 | 1 | Prokopis Prokopidis | <description>Minimum unique terms that must exist in clean content</description> |
76 | 1 | Prokopis Prokopidis | </min_unique_content_terms> |
77 | 1 | Prokopis Prokopidis | <relative_relevance_threshold> |
78 | 1 | Prokopis Prokopidis | <value>0.2</value> |
79 | 1 | Prokopis Prokopidis | <description>The absolute relevance score is divided by the length |
80 | 1 | Prokopis Prokopidis | (in terms of tokens) of the clean content of a document and the |
81 | 1 | Prokopis Prokopidis | calculated relative relevance score is compared with this value</description> |
82 | 1 | Prokopis Prokopidis | </relative_relevance_threshold> |
83 | 1 | Prokopis Prokopidis | |
84 | 1 | Prokopis Prokopidis | </classifier> |