Project

General

Profile

ILSP-FC 2.2.2 is released

Added by Prokopis Prokopidis about 8 years ago

ILSP-FC 2.2.2 has been released. The source code is available from the Files section of this site. A runnable jar is also available from http://nlp.ilsp.gr/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar.

Major changes include:

  • Added classes for the generation of ELRC-SHARE (http://lr-coordination.eu/) compatible metadata descriptors

  • It is now possible to generate a merged TMX file that includes all segment pairs that are extracted 1) from document pairs detected by specific methods and 2) from document pairs for which the number of 0:1 or 1:0 alignmets is smaller than predefined thresholds (one threshold per type). Identical TUs; TUs with identical TUVs; and TUVs with no letters are filtered-out during the merging process.

  • Sub-processes (incl. crawl, export, deduplication, pair detection, sentence alignment, tmx merging) can now be called independently. See http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki for more details

  • Three-letter language codes are used in extracted datasets and metadata. 2 <-> 3 language code methods are available at the utils package.

  • The aligner can now skip paragraphs that have been detected as ooi-lang

  • New methods for pair detection promote document pairs with common digits and hreflang attributes in links connecting them. De-duplication cannot remove pages participating in pairs resulting from these methods

  • New, language-dependent abbreviation lists seem to alleviate the problem with wrong sentence splits before alignment, in the case of many languages

  • Analyzers for new languages and language detection profiles have been added. See http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki/Languages-supported for more details.

  • The package structure was reorganized in several places, with gr.ilsp.fc as the top package

  • ILSP-FC requires Java >= 1.7

  • Fixed bugs in, among other places, the XSLTransformer, the ContentNormalizer, the PDFExtractor, the license extractor and the LangDetector

  • ArticleExtractor and NumOfWords methods of integrated Boilerpipe are combined for better boilerplate removal

  • Dependencies for several libraries incl. pdfbox, lucene and bixo were upgraded


Comments