ILSP-FC 2.2.2 is released
ILSP-FC 2.2.2 has been released. The source code is available from the Files section of this site. A runnable jar is also available from http://nlp.ilsp.gr/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar.
Major changes include:
Added classes for the generation of ELRC-SHARE (http://lr-coordination.eu/) compatible metadata descriptors
It is now possible to generate a merged TMX file that includes all segment pairs that are extracted 1) from document pairs detected by specific methods and 2) from document pairs for which the number of 0:1 or 1:0 alignmets is smaller than predefined thresholds (one threshold per type). Identical TUs; TUs with identical TUVs; and TUVs with no letters are filtered-out during the merging process.
Sub-processes (incl. crawl, export, deduplication, pair detection, sentence alignment, tmx merging) can now be called independently. See http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki for more details
Three-letter language codes are used in extracted datasets and metadata. 2 <-> 3 language code methods are available at the utils package.
The aligner can now skip paragraphs that have been detected as ooi-lang
New methods for pair detection promote document pairs with common digits and hreflang attributes in links connecting them. De-duplication cannot remove pages participating in pairs resulting from these methods
New, language-dependent abbreviation lists seem to alleviate the problem with wrong sentence splits before alignment, in the case of many languages
Analyzers for new languages and language detection profiles have been added. See http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki/Languages-supported for more details.
The package structure was reorganized in several places, with gr.ilsp.fc as the top package
ILSP-FC requires Java >= 1.7
Fixed bugs in, among other places, the XSLTransformer, the ContentNormalizer, the PDFExtractor, the license extractor and the LangDetector
ArticleExtractor and NumOfWords methods of integrated Boilerpipe are combined for better boilerplate removal
Dependencies for several libraries incl. pdfbox, lucene and bixo were upgraded
Comments