Project

General

Profile

Introduction

ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. ILSP-FC integrates modules for text normalization, language identification, document clean-up, metadata extraction, text classification, identification of bitexts (documents that are translations of each other), alignment of segments, and filtering of segment pairs. A detailed description of each module is available at http://aclweb.org/anthology/W/W13/W13-2506.pdf .

Workflows

The current version of ILSP-FC offers the user the option to run all relevant processes in a pipeline or to select a specific process (e.g. export or deduplication, or pair detection, etc.).

In a configuration for acquiring monolingual data, ILSP-FC applies the following processes (one after the other):

  • crawls the web until an expiration criterion is met (i.e. harvests webpages and stores the ones that are in the targeted language, and relevant to a targeted topic if required)
  • exports the stored data (i.e. stores downloaded web pages/documents and for each page generates a CesDoc files with its content and metadata).
  • discards (near) duplicate documents

In a configuration for acquiring parallel data, it applies the following processes (one after the other):

  • crawls a website with content in the targeted languages (i.e. harvests the website and stores pages that are in the targeted languages, and relevant to a targeted topic if required)
  • exports the stored data (i.e. stores downloaded web page/document and generates a CesDoc file with its content and metadata).
  • discards (near) duplicate documents
  • identifies pairs of (candidate) parallel documents and generates a cesAlign file for each detected pair.
  • aligns the segments in each detected document pair and generates a TMX for each document pair
  • merges TMX files corresponding to each document pair in order to create the final output, i.e. a TMX that includes all (or a selection of) segment pairs

Input

In case of general monolingual crawls, the ILSP-FC travels across the web and stores web pages/documents with content in the targeted language. The required input from the user is:

  • a list of seed URLs (i.e. a text file with one URL per text line).

In case of focused monolingual crawls (i.e. when the crawler visits/processes/stores web pages that are in the targeted language and related to a targeted domain), the input should include:

  • a list of seed URLs pointing to relevant web pages. An example seed URL list for Environment in English can be found at ENV_EN_seeds.txt.
  • a list of term triplets (<relevance,term,subtopic>) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain. An example domain definition can be found at ENV_EN_topic.txt for the Environment domain in English.

In case of general bilingual crawling, the input from the user includes:
- a seed URL list which should contain URL(s) from only one web site with content in both of the targeted languages(e.g. ENV_EN_ES_seed.txt). The crawler will follow only links pointing to pages inside this web site. Examples of seed URLs can be found at seed_examples.txt.

In case of focused bilingual crawls, the input should also include:

  • a list of term triplets (<relevance,term,subtopic>) that describe a domain (i.e. this list is required in case the user aims to acquire domain-specific documents) and, optionally, subcategories of this domain in both the targeted languages (i.e. the union of the domain definition in each language). An example domain definition of Environment for the English-Spanish pair can be found at ENV_EN_ES_topic.txt. Note that in case a thematic website is targeted, it is very likely that examination of domainnesses could be avoid (i.e. construction and use of a list of terms that define the targeted topic might be redundant).

Output

Each module of the tool provides its own output which feeds the next module in the pipeline:

Crawl: Creates the "run" directories (i.e. directories containing all resources fetched/extracted/used/required for each cycle of this crawl). (See setting dest)

Export :

  • a list of links pointing to XML files following the cesDOC Corpus Encoding Standard (http://www.xces.org/). See this cesDoc file for an example in English for the Environment domain.
  • a list of links pointing to HTML files (by XSL transformation of each XML) for easier browsing of the collection. As an example, see this rendered cesDoc file.

Pair detect :

  • a list of links to XML files following the cesAlign Corpus Encoding Standard for linking cesDoc documents. This example cesAlign file serves as a link between a detected pair of cesDoc documents in English and Spanish.
  • a list of links pointing to HTML files (by XSL transformation of each cesAlign XML) for easier browsing of the collection. As an example, see this rendered cesAlign file.

Segment Alignment :

  • a list of links to TMX files containing sentence alignments that have been extracted from the detected document pairs. As an example, see this TMX file.
  • a list of links pointing to HTML files (by XSL transformation of each TMX) for easier browsing of the collection. As an example, see this rendered TMX file.

TMX merging:

  • a TMX file that includes filtered segment pairs of the generated TMX files. This is the final output of the process (i.e. the parallel corpus). As an example, see this TMX file.
  • an HTML file (by XSL transformation of the TMX file). As an example, see this rendered TMX file.
  • an XML file which contains metadata of the generated corpus. As an example, see this XML file.

Documentation