Grammar Checking

The 100 million+ words of Irish downloaded by the crawler have been instrumental in the development of my grammar checker An Gramadóir, and many other NLP applications for Irish. Other language groups have use the Crúbadán data for developing grammar checkers, including for Afrikaans (Petri Jooste), Breton (Thierry Vignaud), Cornish (Edi Werner), Occitan (Bruno Gallart), Walon (Pablo Saratxaga)

Diacritic Restoration

Using Unicode it is possible to create electronic documents in most languages with all of the proper diacritical marks and extended characters. Nevertheless, for various reasons speakers of many languages do not do this when writing emails or blogs or producing other documents for consumption on the web. In Lingala, for example, there are tone markings as well as two extended vowels, the "open e" (ɛ), and the "open o" (ɔ). On the web, tone marks are generally omitted and these vowels are written as "e" and "o" respectively, so "abɔkɔ́lɛ́kɛ́" becomes "abokoleke". This can cause ambiguities for people reading Lingala texts, and also limits the usefulness of web texts for statistical purposes. To improve this situation, I wrote a script called "charlifter" that performs statistical diacritic restoration on web texts. This greatly enhances the usefulness of the Crúbadán corpora. The charlifter is also an application of the web crawler, in that the statistical language models it uses are created from the (rare) texts found by the crawler that use diacritical marks and extended characters correctly. My student Michael Schade has written a Firefox add-on called that allows one to use this technology anywhere on the web: email, blogs, chats, etc.

Language Recognition

The n-gram statistics gathered from the corpora for each language provide a powerful and effective language recognition algorithm. Of course particular care must be given to language pairs with very similar n-gram profiles; for more on this, see my blog post 1000 Languages on the Web. A corpus of Crúbadán 3-grams is now available (under the GPLv3) as part of the Natural Language Toolkit (NLTK); see their corpus download page.

Hearing Testing

I've provided Mongolian corpus data to Drs. Richard Harris and Shawn Nissen at Brigham Young University for their project Development of digitally recorded Mongolian Speech Audiometry Materials, the aim of which is to produce low-cost hearing tests for Mongolian speakers. I provided additional material for work-in-progress on Samoan, Tagalog, and several South African languages. See the Master's Theses based on this work here and here.


Dasher is a free software package developed at the University of Cambridge that allows efficient text-entry without a keyboard. It uses a language model trained on text corpora to help it make predictions; the Dasher developers are training 129 new language models using the An Crúbadán corpora.


The Crúbadán corpora have proved useful to a number of lexicographical projects:

Spell Checking

New word lists

I've written a series of statistical filters that can be applied to the web corpora to generate word lists that speed the process of developing a new spell checker. These techniques have been applied to create the following spell checking packages:

Abandoned projects or works in progress

Please contact me if you speak one of these languages and would be willing to help.

  • Balochi, with Mostafa Daneshvar.
  • Chechen, with Sarah Slye, Steve Massey, et al
  • Chhattisgarhi, with Ravishankar Shrivastava.
  • Cornish, with Edi Werner and Paul Bowden.
  • Dzongkha, with Tshering Cigay Dorji.
  • Guaraní, with Iván Prieto Corvalán.
  • Hausa, with Mustapha Abubakar.
  • Igbo, with Chinedu Uchechukwu and Ogechi Nnadi.
  • Itzgründisch, with Sabine Emmy Eller.
  • Kapampangan, with José Navarro.
  • Kikongo, with Anderson Sunda-Meya.
  • Limburgish, with Kenneth Rohde Christiansen.
  • Luganda, with San Emmanuel James and Jackson Ssekiryango.
  • Nawat, with Alan King.
  • Samoan, with Chris Bickers.
  • Sindhi, with Abdul Rahim Nizamani.
  • Sundanese, with Mang Jamal.
  • Tahitian, with Christin Livine.
  • Tigrinya and Tigré, with Merhawie Woldezion.
  • Tongan, with Brian Romanowski.
  • Yoruba, with Tope Faro.
Abandoned projects, word lists now available elsewhere
  • Asturian, with Ricardo Mones Lastra (update: extension available here).
  • Basque, with Alberto Fernández (update: hunspell package now available from
  • Bislama, with Eric Brandell (update: GPL word lists available from
  • Friulian, with Andrea Tami (update: extensive word list now available from
  • Papiamentu, with Peter Damiana (update: Firefox addon available here).

Other Projects

I've provided data to hundreds of researchers working on computational or pure linguistics research projects for many languages. I used to track them all here but that's become more trouble than it's worth. Here is a list of some of the applications in any case:

  • Dialect discrimination
  • Computational Morphology
  • Syntactic analysis (computational and theoretical)
  • Lemmatization and IR
  • Language ID
  • Optical Character Recognition
  • Sociolinguistics and social media
  • Language learning
  • Word sense disambiguation
  • Predictive text and autocorrect
  • Comparative phonology
  • Lexicography
  • Machine translation
  • Selectional preferences
  • Crossword generation
  • POS tagging
  • Speech recognition
  • Speech synthesis
  • Psycholinguistics
  • Semantic networks
  • Diacritic restoration
  • Spelling and grammar checking

Please contact me (kscanne at gmail dot com) if you are interested in applying these techniques to a new language.