General

Profile

News

ILSP Focused Crawler: ILSP-FC 2.2.3 has been released

Added by Prokopis Prokopidis over 8 years ago

ILSP-FC 2.2.3 has been released. The source code is available from the Files section of this site. A runnable jar is also available from http://nlp.ilsp.gr/ilsp-fc/ilsp-fc-2.2.3-jar-with-dependencies.jar.

Major changes include:

  • It is now possible to construct bilingual collections from a web-domain for all pairs of the targeted languages by running the whole pipeline once. See the example for running bilingual crawls in the http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki/Getting_Started/ page of the wiki for more details
  • Identical TUs; TUs with identical TUVs; TUVs with no letters; and TUs with different digits are optionally annotated as such during the merging process that creates one TMX file from a bilingual crawl
  • All generated files for easier content navigation are now created on the basis of a user-provided basename (i.e. options like "of, ofh, etc." are no longer used)
  • Bugs in, among other places, the PairDetector, the TMXMerger have been fixed

ILSP Focused Crawler: ILSP-FC 2.2.2 is released

Added by Prokopis Prokopidis almost 9 years ago

ILSP-FC 2.2.2 has been released. The source code is available from the Files section of this site. A runnable jar is also available from http://nlp.ilsp.gr/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar.

Major changes include:

  • Added classes for the generation of ELRC-SHARE (http://lr-coordination.eu/) compatible metadata descriptors

  • It is now possible to generate a merged TMX file that includes all segment pairs that are extracted 1) from document pairs detected by specific methods and 2) from document pairs for which the number of 0:1 or 1:0 alignmets is smaller than predefined thresholds (one threshold per type). Identical TUs; TUs with identical TUVs; and TUVs with no letters are filtered-out during the merging process.

  • Sub-processes (incl. crawl, export, deduplication, pair detection, sentence alignment, tmx merging) can now be called independently. See http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki for more details

  • Three-letter language codes are used in extracted datasets and metadata. 2 <-> 3 language code methods are available at the utils package.

  • The aligner can now skip paragraphs that have been detected as ooi-lang

  • New methods for pair detection promote document pairs with common digits and hreflang attributes in links connecting them. De-duplication cannot remove pages participating in pairs resulting from these methods

  • New, language-dependent abbreviation lists seem to alleviate the problem with wrong sentence splits before alignment, in the case of many languages

  • Analyzers for new languages and language detection profiles have been added. See http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki/Languages-supported for more details.

  • The package structure was reorganized in several places, with gr.ilsp.fc as the top package

  • ILSP-FC requires Java >= 1.7

  • Fixed bugs in, among other places, the XSLTransformer, the ContentNormalizer, the PDFExtractor, the license extractor and the LangDetector

  • ArticleExtractor and NumOfWords methods of integrated Boilerpipe are combined for better boilerplate removal

  • Dependencies for several libraries incl. pdfbox, lucene and bixo were upgraded

ILSP Focused Crawler: ILSP-FC 2.2 is released

Added by Prokopis Prokopidis over 10 years ago

ILSP-FC 2.2 is released. It's available for download from the "Files":/redmine/projects/ilsp-fc/files section of this site.

This version integrates a class that calls HunAlign for extracting sentence alignments from crawled document pairs. The class can be easily extended for the integration of other aligners. Sentence alignments are stored in TMX files.

ILSP Focused Crawler: ILSP-FC 2.01 is released.

Added by Prokopis Prokopidis over 10 years ago

ILSP-FC 2.01 is released. It's available for download from the "Files":/redmine/projects/ilsp-fc/files section of this site.

This version integrates an advanced PDF extraction module and new configuration options for improving pair detection based on URL similarity.

ILSP Focused Crawler: ILSP-FC 1.2.2 is released

Added by Vassilis Papavassiliou over 11 years ago

ILSP-FC 1.2.2 is released. It's available for download from the "Files":/redmine/projects/ilsp-fc/files section of this site.

This version fixes erros concerning the pair detection module (output list of cesAlign docs, pairing based on URLs).

ILSP Focused Crawler: ILSP-FC 1.2.1 released

Added by Prokopis Prokopidis over 11 years ago

ILSP-FC 1.2.1 is released. It's available for download from the "Files":/redmine/projects/ilsp-fc/files section of this site.

This version allows a user to export crawl results with the help of an xslt file for better examination of results. The crawler also saves open-content license information in the exported metadata, when this information is available in source pages.

ILSP Focused Crawler: ILSP-FC 1.2 released

Added by Prokopis Prokopidis over 11 years ago

ILSP-FC 1.2 is released. It's available for download from the "Files":/redmine/projects/ilsp-fc/files section of this site.

This version fixes errors concerning the creation of output directories during crawl jobs.

(1-10/11)

Also available in: Atom