Version 20 - History - Introduction - ILSP Focused Crawler - ILSP NLP

Introduction » History » Version 20

Prokopis Prokopidis, 2016-02-16 12:48 PM

-Prokopis Prokopidis
+# Introduction
 Prokopis Prokopidis
-Prokopis Prokopidis
+ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. ILSP-FC integrates modules for text normalization, language identification, document clean-up, metadata extraction, text classification, identification of bitexts (documents that are translations of each other), alignment of segments, and filtering of segment pairs. A detailed description of each module is available at http://aclweb.org/anthology/W/W13/W13-2506.pdf .
 Prokopis Prokopidis
-Prokopis Prokopidis
+# Workflows
 Prokopis Prokopidis
-Prokopis Prokopidis
+The current version of ILSP-FC offers the user the option to [[GettingStarted|run all relevant processes]] in a pipeline or to select a specific subset (e.g. alignment and merging only).
 Prokopis Prokopidis
-Prokopis Prokopidis
+In a configuration for acquiring monolingual data, ILSP-FC applies the following processes (one after the other):
 Prokopis Prokopidis
-Prokopis Prokopidis
+- crawls the web until an expiration criterion is met (i.e. harvests webpages and stores the ones that are in the targeted language, and relevant to a targeted topic if required)
-Prokopis Prokopidis
+- exports the stored data (i.e. stores downloaded web pages/documents and for each page generates a CesDoc files with its content and metadata).
-Prokopis Prokopidis
+- discards (near) duplicate documents
 Prokopis Prokopidis
 Prokopis Prokopidis
-Prokopis Prokopidis
+In a configuration for acquiring parallel data, it applies the following processes (one after the other):
 Prokopis Prokopidis
-Prokopis Prokopidis
+- [[Crawl|crawls]] a website with content in the targeted languages (i.e. harvests the website and stores pages that are in the targeted languages, and relevant to a targeted topic if required)
-Prokopis Prokopidis
+- [[Export|exports]] the stored data (i.e. stores downloaded web page/document and generates a CesDoc file with its content and metadata).
-Prokopis Prokopidis
+- discards (near) [[NearDeduplication|duplicate]] documents
-Prokopis Prokopidis
+- identifies [[PairDetection|pairs]] of (candidate) parallel documents and generates a cesAlign file for each detected pair.
-Prokopis Prokopidis
+- [[SegmentAlignment|aligns]] the segments in each detected document pair and generates a TMX for each document pair
-Prokopis Prokopidis
+- [[TMXmerging|merges]] TMX files corresponding to each document pair in order to create the final output, i.e. a TMX that includes all (or a selection of) segment pairs
 Prokopis Prokopidis
 Prokopis Prokopidis
-Prokopis Prokopidis
+# Input
 Prokopis Prokopidis
-Prokopis Prokopidis
+In case of general monolingual crawls, the ILSP-FC travels across the web and stores web pages/documents with content in the targeted language. The required input from the user is:
 Prokopis Prokopidis
-Prokopis Prokopidis
+- a list of seed URLs (i.e. a text file with one URL per text line).
 Prokopis Prokopidis
-Prokopis Prokopidis
+In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are in the targeted language and related to a targeted domain), the input should include:
 Prokopis Prokopidis
-Prokopis Prokopidis
+- a list of seed URLs pointing to relevant web pages. An example seed URL list for _Environment_ in English can be found at [[ENV_EN_seeds.txt]].
-Prokopis Prokopidis
+- a list of term triplets (_\<relevance,term,subtopic\>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at [[ENV_EN_topic.txt]] for the _Environment_ domain in English.
 Prokopis Prokopidis
 Prokopis Prokopidis
-Prokopis Prokopidis
+In case of general bilingual crawling, the input from the user includes:
-Prokopis Prokopidis
+- a seed URL list which should contain URL(s) from only one web site with content in both of the targeted languages(e.g. [[ENV_EN_ES_seed.txt]]). The crawler will follow only links pointing to pages inside this web site. Examples of seed URLs can be found at [[seed_examples.txt]].
 Prokopis Prokopidis
-Prokopis Prokopidis
+In case of focused bilingual crawls, the input should also include:
 Prokopis Prokopidis
-Prokopis Prokopidis
+- a list of term triplets (_\<relevance,term,subtopic\>_) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of  _Environment_ for the English-Spanish pair can be found at [[ENV_EN_ES_topic.txt]].
-Prokopis Prokopidis
+Note that in case a thematic website is targeted, it is very likely that examination of domainnesses could  be avoid (i.e. construction and use of a list of terms that define the targeted topic might be redundant).
 Prokopis Prokopidis
-Prokopis Prokopidis
+# Output
 Prokopis Prokopidis
-Prokopis Prokopidis
+cesDoc (lists),
-Prokopis Prokopidis
+cesAlign (lists),
-Prokopis Prokopidis
+TMX (lists),
-Prokopis Prokopidis
+merged TMX (TMX, HTML, MD)
 Prokopis Prokopidis
-Prokopis Prokopidis
+# Documentation
 Prokopis Prokopidis
-Prokopis Prokopidis
+- [[HowToGet|How to get ILSP-FC]]: Learn about the different ways to get ILSP-FC
-Prokopis Prokopidis
+- [[DeveloperSetup|Developer Setup]]: Learn how to build ILSP-FC
-Prokopis Prokopidis
+- [[GettingStarted|Getting Started]]: Learn how to run ILSP-FC
-Prokopis Prokopidis
+- [[Languages-supported|Languages Supported]] Learn about supported languages
-Prokopis Prokopidis
+- [[Resources|Resources]] acquired with ILSP-FC

Project

General

Profile

ILSP Focused Crawler

Introduction » History » Version 20