This page contains a set of parallel and monolingual corpora generated from the Global Voices multilingual group of websites, where volunteers publish and translate news stories in more than 40 languages.
The original content from the Global Voices websites is available by the authors and publishers under a Creative Commons Attribution license. The content was crawled in 2015-2016 by researchers at the NLP group of the Institute for Language and Speech Processing. Documents that are translations of each other were paired on the basis of their link information. After document pairing, segment alignments were extracted with the hunalign sentence aligner. The results of the automatic alignment at document and segment level are distributed from this page under a Creative Commons Attribution 4.0 license.
The crawl resulted in a set of 174629 documents, 86.51% of which were involved in at least one document pair. Overall, 302,617 document pairs and 8,356,943 segment alignments were automatically generated for 756 language pairs, with 27.62 segment alignments per document pair on average. Language pairs involving combinations of each of the top 10 languages in the collection (as far as number of paragraphs in monolingual corpora is concerned) with all other languages contribute 94.09% of the segment alignments.
You can download segment alignments for each language pair as one tmx file and/or browse the datasets by following links in the tables below. The first table contains links for language pairs involving the top 10 languages in the collection. The second table contains links about all language pairs. Hover over a cell to get brief information on the size of the language pair. Darker cells point to language pairs with more segment alignments. All counts exclude 0:1 and 1:0 segment alignments.
You can download archives of monolingual corpora for each language (with each web document exported as an XML file) by clicking on the language names of the headers of the tables. This archive contains a list with the filenames involved in all document pairs.
If you make use of this resource in your research, please cite the following paper: Prokopis Prokopidis, Vassilis Papavassiliou and Stelios Piperidis. 2016. Parallel Global Voices: a collection of multilingual corpora with citizen media stories. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). (bib)