ILSP Focused Crawler: ILSP-FC 2.2.4 has been released
ILSP-FC 2.2.4 has been released. A runnable jar is available from http://nlp.ilsp.gr/ilsp-fc/ilsp-fc-2.2.4-jar-with-dependencies.jar.
Added by Vassilis Papavassiliou over 3 years ago
ILSP-FC 2.2.4 has been released. A runnable jar is available from http://nlp.ilsp.gr/ilsp-fc/ilsp-fc-2.2.4-jar-with-dependencies.jar.
Added by Prokopis Prokopidis over 8 years ago
ILSP-FC 2.2.3 has been released. The source code is available from the Files section of this site. A runnable jar is also available from http://nlp.ilsp.gr/ilsp-fc/ilsp-fc-2.2.3-jar-with-dependencies.jar.
Major changes include:
Added by Prokopis Prokopidis over 8 years ago
ILSP-FC 2.2.2 has been released. The source code is available from the Files section of this site. A runnable jar is also available from http://nlp.ilsp.gr/ilsp-fc/ilsp-fc-2.2.2-jar-with-dependencies.jar.
Major changes include:
Added classes for the generation of ELRC-SHARE (http://lr-coordination.eu/) compatible metadata descriptors
It is now possible to generate a merged TMX file that includes all segment pairs that are extracted 1) from document pairs detected by specific methods and 2) from document pairs for which the number of 0:1 or 1:0 alignmets is smaller than predefined thresholds (one threshold per type). Identical TUs; TUs with identical TUVs; and TUVs with no letters are filtered-out during the merging process.
Sub-processes (incl. crawl, export, deduplication, pair detection, sentence alignment, tmx merging) can now be called independently. See http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki for more details
Three-letter language codes are used in extracted datasets and metadata. 2 <-> 3 language code methods are available at the utils package.
The aligner can now skip paragraphs that have been detected as ooi-lang
New methods for pair detection promote document pairs with common digits and hreflang attributes in links connecting them. De-duplication cannot remove pages participating in pairs resulting from these methods
New, language-dependent abbreviation lists seem to alleviate the problem with wrong sentence splits before alignment, in the case of many languages
Analyzers for new languages and language detection profiles have been added. See http://nlp.ilsp.gr/redmine/projects/ilsp-fc/wiki/Languages-supported for more details.
The package structure was reorganized in several places, with gr.ilsp.fc as the top package
ILSP-FC requires Java >= 1.7
Fixed bugs in, among other places, the XSLTransformer, the ContentNormalizer, the PDFExtractor, the license extractor and the LangDetector
ArticleExtractor and NumOfWords methods of integrated Boilerpipe are combined for better boilerplate removal
Dependencies for several libraries incl. pdfbox, lucene and bixo were upgraded
Added by Vassilis Papavassiliou about 10 years ago
ILSP-FC 2.2.1 has been released. It's available for download from the Files section of this site.
Changes
Added by Prokopis Prokopidis over 10 years ago
ILSP-FC 2.2 is released. It's available for download from the "Files":/redmine/projects/ilsp-fc/files section of this site.
This version integrates a class that calls HunAlign for extracting sentence alignments from crawled document pairs. The class can be easily extended for the integration of other aligners. Sentence alignments are stored in TMX files.
Added by Prokopis Prokopidis over 10 years ago
ILSP-FC 2.01 is released. It's available for download from the "Files":/redmine/projects/ilsp-fc/files section of this site.
This version integrates an advanced PDF extraction module and new configuration options for improving pair detection based on URL similarity.
Added by Prokopis Prokopidis over 11 years ago
ILSP-FC 1.2.3 is released. It's available for download from the Files section of this site.
This version improves pair detection based on URL similarity.
Added by Vassilis Papavassiliou over 11 years ago
ILSP-FC 1.2.2 is released. It's available for download from the "Files":/redmine/projects/ilsp-fc/files section of this site.
This version fixes erros concerning the pair detection module (output list of cesAlign docs, pairing based on URLs).
Added by Prokopis Prokopidis over 11 years ago
ILSP-FC 1.2.1 is released. It's available for download from the "Files":/redmine/projects/ilsp-fc/files section of this site.
This version allows a user to export crawl results with the help of an xslt file for better examination of results. The crawler also saves open-content license information in the exported metadata, when this information is available in source pages.
Added by Prokopis Prokopidis over 11 years ago
ILSP-FC 1.2 is released. It's available for download from the "Files":/redmine/projects/ilsp-fc/files section of this site.
This version fixes errors concerning the creation of output directories during crawl jobs.
Also available in: Atom