Parallel Global Voices

This page contains a set of parallel and monolingual corpora generated from the Global Voices multilingual group of websites, where volunteers publish and translate news stories in more than 40 languages.

The original content from the Global Voices websites is available by the authors and publishers under a Creative Commons Attribution license. The content was crawled in 2015-2016 by researchers at the NLP group of the Institute for Language and Speech Processing. Documents that are translations of each other were paired on the basis of their link information. After document pairing, segment alignments were extracted with the maligna sentence aligner. The results of the automatic alignment at document and segment level are distributed from this page under a Creative Commons Attribution 4.0 license.

The crawl resulted in a set of 174629 documents, 86.51% of which were involved in at least one document pair. Overall, 302,617 document pairs and 8,356,943 segment alignments were automatically generated for 756 language pairs, with 27.62 segment alignments per document pair on average. Language pairs involving combinations of each of the top 10 languages in the collection (as far as number of paragraphs in monolingual corpora is concerned) with all other languages contribute 94.09% of the segment alignments.

You can download segment alignments for each language pair as one tmx file and/or browse the datasets by following links in the tables below. The first table contains links for language pairs involving the top 10 languages in the collection. The second table contains links about all language pairs. Hover over a cell to get brief information on the size of the language pair. Darker cells point to language pairs with more segment alignments. All counts exclude 0:1 and 1:0 segment alignments.

You can download archives of monolingual corpora for each language (with each web document exported as an XML file) by clicking on the language names of the headers of the tables. This archive contains a list with the filenames involved in all document pairs.

If you make use of this resource in your research, please cite the following paper: Prokopis Prokopidis, Vassilis Papavassiliou and Stelios Piperidis. 2016. Parallel Global Voices: a collection of multilingual corpora with citizen media stories. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). (bib)

Bangla English French Italian Malagasy Portuguese Russian Spanish Chinese-simplified Chinese-traditional
Bangla ben-eng ben-fra ben-ita ben-mlg ben-por ben-rus ben-spa ben-zhs ben-zht
English eng-fra eng-ita eng-mlg eng-por eng-rus eng-spa eng-zhs eng-zht
French fra-ita fra-mlg fra-por fra-rus fra-spa fra-zhs fra-zht
Italian ita-mlg ita-por ita-rus ita-spa ita-zhs ita-zht
Malagasy mlg-por mlg-rus mlg-spa mlg-zhs mlg-zht
Portuguese por-rus por-spa por-zhs por-zht
Russian rus-spa rus-zhs rus-zht
Spanish spa-zhs spa-zht
Chinese-simplified zhs-zht
Amharic Arabic Aymara Bangla Bulgarian Catalan Czech Danish German Greek English Esperanto Farsi Filipino French Hebrew Hindi Hungarian Indonesian Italian Japanese Khmer Korean Macedonian Malagasy Burmese Dutch Odia Polish Portuguese Romanian Russian Spanish Albanian Serbian Swahili Swedish Turkish Urdu Chinese-simplified Chinese-traditional
Amharic amh-ara amh-aym amh-ben - amh-cat - amh-dan amh-deu amh-ell amh-eng amh-epo amh-fas - amh-fra - amh-hin amh-hun amh-ind amh-ita amh-jpn amh-khm amh-kor amh-mkd amh-mlg amh-mya amh-nld - amh-pol amh-por - amh-rus amh-spa amh-sqi amh-srp amh-swa amh-swe amh-tur amh-urd amh-zhs amh-zht
Arabic ara-aym ara-ben ara-bul ara-cat ara-ces ara-dan ara-deu ara-ell ara-eng ara-epo ara-fas ara-fil ara-fra ara-heb ara-hin ara-hun ara-ind ara-ita ara-jpn ara-khm ara-kor ara-mkd ara-mlg ara-mya ara-nld ara-ori ara-pol ara-por ara-rum ara-rus ara-spa ara-sqi ara-srp ara-swa ara-swe ara-tur ara-urd ara-zhs ara-zht
Aymara aym-ben aym-bul aym-cat aym-ces aym-dan aym-deu aym-ell aym-eng aym-epo aym-fas aym-fil aym-fra - - aym-hun aym-ind aym-ita aym-jpn aym-khm aym-kor aym-mkd aym-mlg aym-mya aym-nld aym-ori aym-pol aym-por aym-rum aym-rus aym-spa aym-sqi aym-srp aym-swa aym-swe aym-tur aym-urd aym-zhs aym-zht
Bangla ben-bul ben-cat ben-ces ben-dan ben-deu ben-ell ben-eng ben-epo ben-fas ben-fil ben-fra ben-heb ben-hin ben-hun ben-ind ben-ita ben-jpn ben-khm ben-kor ben-mkd ben-mlg ben-mya ben-nld ben-ori ben-pol ben-por ben-rum ben-rus ben-spa ben-sqi ben-srp ben-swa ben-swe ben-tur ben-urd ben-zhs ben-zht
Bulgarian - bul-cat bul-ces bul-dan bul-deu bul-ell bul-eng bul-epo bul-fas bul-fil bul-fra - - bul-hun bul-ind bul-ita bul-jpn - bul-kor bul-mkd bul-mlg bul-mya bul-nld - bul-pol bul-por bul-rum bul-rus bul-spa bul-sqi bul-srp bul-swa bul-swe bul-tur bul-urd bul-zhs bul-zht
Catalan cat-ces cat-dan cat-deu cat-ell cat-eng cat-epo cat-fas cat-fil cat-fra cat-heb cat-hin cat-hun cat-ind cat-ita cat-jpn cat-khm cat-kor cat-mkd cat-mlg cat-mya cat-nld cat-ori cat-pol cat-por cat-rum cat-rus cat-spa cat-sqi cat-srp cat-swa cat-swe cat-tur cat-urd cat-zhs cat-zht
Czech - ces-dan ces-deu ces-ell ces-eng ces-epo ces-fas ces-fil ces-fra ces-heb - ces-hun ces-ind ces-ita ces-jpn - ces-kor ces-mkd ces-mlg ces-mya ces-nld ces-ori ces-pol ces-por ces-rum ces-rus ces-spa ces-sqi ces-srp ces-swa ces-swe ces-tur ces-urd ces-zhs ces-zht
Danish dan-deu dan-ell dan-eng dan-epo dan-fas dan-fil dan-fra dan-heb - dan-hun dan-ind dan-ita dan-jpn dan-khm dan-kor dan-mkd dan-mlg dan-mya dan-nld - dan-pol dan-por - dan-rus dan-spa dan-sqi dan-srp dan-swa dan-swe dan-tur dan-urd dan-zhs dan-zht
German deu-ell deu-eng deu-epo deu-fas deu-fil deu-fra deu-heb deu-hin deu-hun deu-ind deu-ita deu-jpn deu-khm deu-kor deu-mkd deu-mlg deu-mya deu-nld deu-ori deu-pol deu-por deu-rum deu-rus deu-spa deu-sqi deu-srp deu-swa deu-swe deu-tur deu-urd deu-zhs deu-zht
Greek ell-eng ell-epo ell-fas ell-fil ell-fra ell-heb ell-hin ell-hun ell-ind ell-ita ell-jpn ell-khm ell-kor ell-mkd ell-mlg ell-mya ell-nld ell-ori ell-pol ell-por ell-rum ell-rus ell-spa ell-sqi ell-srp ell-swa ell-swe ell-tur ell-urd ell-zhs ell-zht
English eng-epo eng-fas eng-fil eng-fra eng-heb eng-hin eng-hun eng-ind eng-ita eng-jpn eng-khm eng-kor eng-mkd eng-mlg eng-mya eng-nld eng-ori eng-pol eng-por eng-rum eng-rus eng-spa eng-sqi eng-srp eng-swa eng-swe eng-tur eng-urd eng-zhs eng-zht
Esperanto epo-fas epo-fil epo-fra - epo-hin epo-hun epo-ind epo-ita epo-jpn - - epo-mkd epo-mlg - epo-nld - epo-pol epo-por - epo-rus epo-spa - epo-srp epo-swa epo-swe - epo-urd epo-zhs epo-zht
Farsi fas-fil fas-fra fas-heb fas-hin fas-hun fas-ind fas-ita fas-jpn fas-khm fas-kor fas-mkd fas-mlg fas-mya fas-nld fas-ori fas-pol fas-por fas-rum fas-rus fas-spa fas-sqi fas-srp fas-swa fas-swe fas-tur fas-urd fas-zhs fas-zht
Filipino - fil-fra - - fil-hun fil-ind fil-ita fil-jpn fil-khm fil-kor fil-mkd fil-mlg fil-mya fil-nld - fil-pol fil-por - fil-rus fil-spa fil-sqi fil-srp fil-swa fil-swe fil-tur fil-urd fil-zhs fil-zht
French fra-heb fra-hin fra-hun fra-ind fra-ita fra-jpn fra-khm fra-kor fra-mkd fra-mlg fra-mya fra-nld fra-ori fra-pol fra-por fra-rum fra-rus fra-spa fra-sqi fra-srp fra-swa fra-swe fra-tur fra-urd fra-zhs fra-zht
Hebrew - - - - - - - - heb-ita heb-jpn - - heb-mkd heb-mlg heb-mya heb-nld - heb-pol heb-por - heb-rus heb-spa - - - heb-swe - heb-urd heb-zhs heb-zht
Hindi - - - - - - hin-hun hin-ind hin-ita hin-jpn hin-khm - hin-mkd hin-mlg - hin-nld hin-ori hin-pol hin-por - hin-rus hin-spa hin-sqi hin-srp hin-swa hin-swe - hin-urd hin-zhs hin-zht
Hungarian - hun-ind hun-ita hun-jpn hun-khm hun-kor hun-mkd hun-mlg hun-mya hun-nld - hun-pol hun-por hun-rum hun-rus hun-spa hun-sqi hun-srp hun-swa hun-swe hun-tur hun-urd hun-zhs hun-zht
Indonesian - ind-ita ind-jpn - ind-kor ind-mkd ind-mlg ind-mya ind-nld - ind-pol ind-por ind-rum ind-rus ind-spa ind-sqi ind-srp ind-swa ind-swe ind-tur ind-urd ind-zhs ind-zht
Italian ita-jpn ita-khm ita-kor ita-mkd ita-mlg ita-mya ita-nld ita-ori ita-pol ita-por ita-rum ita-rus ita-spa ita-sqi ita-srp ita-swa ita-swe ita-tur ita-urd ita-zhs ita-zht
Japanese jpn-khm jpn-kor jpn-mkd jpn-mlg jpn-mya jpn-nld jpn-ori jpn-pol jpn-por jpn-rum jpn-rus jpn-spa jpn-sqi jpn-srp jpn-swa jpn-swe jpn-tur jpn-urd jpn-zhs jpn-zht
Khmer - - - - - khm-kor khm-mkd khm-mlg - khm-nld - - khm-por - khm-rus khm-spa - khm-srp khm-swa - - khm-urd khm-zhs khm-zht
Korean - - - kor-mkd kor-mlg kor-mya kor-nld - kor-pol kor-por kor-rum kor-rus kor-spa kor-sqi kor-srp kor-swa kor-swe kor-tur kor-urd kor-zhs kor-zht
Macedonian mkd-mlg mkd-mya mkd-nld mkd-ori mkd-pol mkd-por mkd-rum mkd-rus mkd-spa mkd-sqi mkd-srp mkd-swa mkd-swe mkd-tur mkd-urd mkd-zhs mkd-zht
Malagasy mlg-mya mlg-nld mlg-ori mlg-pol mlg-por mlg-rum mlg-rus mlg-spa mlg-sqi mlg-srp mlg-swa mlg-swe mlg-tur mlg-urd mlg-zhs mlg-zht
Burmese - - - mya-nld - mya-pol mya-por - mya-rus mya-spa mya-sqi mya-srp mya-swa mya-swe mya-tur mya-urd mya-zhs mya-zht
Dutch nld-ori nld-pol nld-por nld-rum nld-rus nld-spa nld-sqi nld-srp nld-swa nld-swe nld-tur nld-urd nld-zhs nld-zht
Odia - - - - - - - - - - - ori-pol ori-por - ori-rus ori-spa ori-sqi - - - ori-tur ori-urd ori-zhs ori-zht
Polish - pol-por pol-rum pol-rus pol-spa pol-sqi pol-srp pol-swa pol-swe pol-tur pol-urd pol-zhs pol-zht
Portuguese por-rum por-rus por-spa por-sqi por-srp por-swa por-swe por-tur por-urd por-zhs por-zht
Romanian - - - - - - - - - rum-rus rum-spa - rum-srp rum-swa - rum-tur - rum-zhs rum-zht
Russian rus-spa rus-sqi rus-srp rus-swa rus-swe rus-tur rus-urd rus-zhs rus-zht
Spanish spa-sqi spa-srp spa-swa spa-swe spa-tur spa-urd spa-zhs