The African Language Data Problem — and the Initiatives Solving It
Voice AI needs data. But for most African languages, that data barely exists. Here's a deep look at the datasets being built — from Google's WAXAL to African Voices — and why they matter for the future of AI on the continent.
Why Data Is the Bottleneck for African AI
Building a great voice AI model is not primarily a compute problem. It is a data problem.
Large speech models need hundreds — often thousands — of hours of clean, accurately transcribed audio to learn the phonetics, prosody, and vocabulary of a language. For English, Mandarin, and Spanish, those hours exist in abundance: podcasts, news archives, audiobooks, academic corpora.
For most African languages, that infrastructure simply does not exist yet.
Africa has over 2,000 distinct languages, yet the overwhelming majority are not represented in any publicly available speech dataset. The result: AI assistants that cannot understand Hausa, TTS systems that mangle Yoruba tones, and speech-to-text tools that go blank when spoken to in Igbo.
This is starting to change — fast. Here is a look at the most important data initiatives reshaping the landscape.
Google WAXAL: 21 Languages, 11,000 Hours
In February 2026, Google Research Africa announced WAXAL — a name taken from the Wolof word for "speak." The announcement was a significant moment for African language AI.
WAXAL covers 21 African languages and contains:
- Over 11,000 hours of speech data from nearly 2 million individual recordings
- Approximately 1,250 hours of transcribed speech for Automatic Speech Recognition (ASR)
- Over 20 hours of studio-quality recordings for Text-to-Speech (TTS) voice synthesis
Languages included span the continent: Acholi, Akan, Dagaare, Dagbani, Dholuo, Ewe, Fante, Fulani, Hausa, Igbo, Kikuyu, Lingala, Luganda, Malagasy, Masaaba, Nyankole, Rukiga, Shona, Swahili, and Yoruba.
How it was collected
What makes WAXAL particularly interesting is its methodology. Rather than scraping the internet, Google's partners — Makerere University in Uganda, the University of Ghana, and Digital Umuganda in Rwanda — asked participants to describe pictures in their native languages. This captures spontaneous, natural speech rather than read text, which is far more useful for training conversational AI.
Professional voice actors were also recorded in studios to create the clean, consistent audio needed specifically for TTS synthesis.
Critically, the dataset is released under an open licence and is freely available on Hugging Face. The partner institutions retain ownership of the data they collected — a model of data sovereignty that others should follow.
African Voices: 3,000+ Hours of Nigerian and Malian Speech
African Voices is a large-scale multilingual speech dataset developed by Data Science Nigeria, with support from the Bill & Melinda Gates Foundation.
It focuses on five languages with a strong emphasis on Nigeria:
| Language | Country |
|---|---|
| Hausa | Nigeria |
| Igbo | Nigeria |
| Yorùbá | Nigeria |
| Nigerian Pidgin | Nigeria |
| Bambara | Mali |
The dataset contains over 3,000 hours of transcribed audio, collected through community-centred, ethical protocols designed to respect linguistic and cultural diversity. Data was gathered through both scripted and spontaneous speech methods.
For anyone building speech technology specifically for Nigeria, African Voices is arguably the most directly relevant dataset available today. It is publicly downloadable and cited under a standard academic BibTeX reference.
The Broader Ecosystem: Masakhane and Beyond
The initiatives above are not isolated. They sit within a growing ecosystem of African-led NLP research.
Masakhane — a grassroots community whose name means "we build together" in Nguni — has produced some of the foundational datasets for African languages:
- MasakhaNER: Named Entity Recognition datasets for 10 African languages including Hausa, Igbo, Naija Pidgin, Swahili, and Yorùbá
- Machine translation corpora for Yorùbá-English, Luganda-English, and other pairs
- A growing research community that has published dozens of peer-reviewed papers
Zindi Africa has also catalogued dozens of labelled datasets for tasks ranging from text classification (Swahili news, Chichewa news) to sentiment analysis (Tunisian Arabizi) to speech recognition (Wolof ASR, Kinyarwanda ASR from Mozilla Common Voice).
What This Means for Developers Building African Voice Apps
The existence of these datasets changes what is technically possible right now.
If you are building a voice product for a Nigerian market:
- Hausa, Igbo, Yoruba, and Pidgin ASR is now achievable with reasonable quality using fine-tuned Whisper models trained on African Voices or WAXAL data
- TTS for the same languages is ready today via the 9jaLingo API — no dataset collection required on your side
- Custom voice cloning lets you give your product a locally recognisable voice without months of data labelling
The gap between what was possible five years ago and what is possible today is enormous. Five years ago, building a production Hausa speech product would have required millions of dollars in data collection. Today, you can prototype one in an afternoon.
The Road Ahead
The data flywheel is spinning faster every year. More data → better models → more users → more data. But there are still large gaps:
- Tone marking in Yoruba and Igbo remains inconsistently handled in most datasets
- Code-switching corpora (Pidgin + English in the same utterance) are rare
- Languages like Efik, Tiv, Igala, and hundreds of others have almost no labelled data
If you are a linguist, researcher, or native speaker who wants to contribute to this effort, organisations like Masakhane, Data Science Nigeria, and Mozilla Common Voice are always looking for contributors.
The infrastructure for African language AI is being built right now. The datasets above are the foundation — and they are free for anyone to use.
9jaLingo Team
Research · 9jaLingo