The 100 million+ words of Irish downloaded by the crawler have been
instrumental in the development of
my grammar checker An Gramadóir,
and many other NLP applications for Irish.
Other language groups have use the Crúbadán data for developing grammar checkers, including for Afrikaans (Petri Jooste), Breton (Thierry Vignaud), Cornish (Edi Werner), Occitan (Bruno Gallart), Walon (Pablo Saratxaga)
Using Unicode it is possible to create electronic documents in most
languages with all of the proper diacritical marks and extended characters.
Nevertheless, for various reasons speakers of many languages do not do
this when writing
emails or blogs or producing other documents for consumption on the web. In Lingala,
for example, there are tone markings as well as two extended vowels,
the "open e" (ɛ), and the "open o" (ɔ). On the web, tone marks are generally omitted and these vowels are written as "e" and "o" respectively, so
"abɔkɔ́lɛ́kɛ́" becomes "abokoleke". This can cause ambiguities for people reading
Lingala texts, and also limits the usefulness of web
texts for statistical purposes. To improve this situation, I wrote a script
called "charlifter" that performs statistical diacritic
restoration on web texts. This greatly enhances the usefulness of the
Crúbadán corpora. The charlifter is also an application of the web crawler,
in that the statistical language models it uses are created
from the (rare) texts found by the crawler that use diacritical marks
and extended characters correctly. My student Michael Schade has written a Firefox add-on called accentuate.us that allows one to use this technology anywhere on the web: email, blogs, chats, etc.
The n-gram statistics gathered from the corpora for each language
provide a powerful and effective language recognition algorithm.
Of course particular care must be given to language pairs with very
similar n-gram profiles; for more on this, see my blog post
1000 Languages on the Web. A corpus of Crúbadán 3-grams is now
available (under the GPLv3) as part of the Natural Language Toolkit (NLTK);
see their corpus download page.
I've provided Mongolian corpus data to Drs. Richard Harris and Shawn Nissen at Brigham Young University for their project Development of digitally recorded Mongolian Speech Audiometry Materials, the aim of which is to produce low-cost hearing tests for Mongolian speakers. I provided additional material for work-in-progress on Samoan, Tagalog, and several South African languages. See the Master's Theses based on this work here and here.
Dasher is a
free software package developed at the University of Cambridge that
allows efficient text-entry without a keyboard.
It uses a language model trained on text corpora to help it
make predictions; the Dasher developers are
training 129 new language models using the An Crúbadán corpora.
The Crúbadán corpora have proved useful to a number of lexicographical
New word lists
I've written a series of
statistical filters that can be applied to the web corpora
to generate word lists that speed the process of
developing a new spell checker.
These techniques have been applied to create the following
spell checking packages:
- hunspell-as, (Assamese). With Amitakhya Phukan.
hunspell-az (Azerbaijani). With Metin Amiroff.
- Mozilla (Bosnian). With Mirsad Čirkić, based on earlier work of Ninoslav Jurković, Samir Ribić, and Vedran Ljubović.
hunspell-csb (Kashubian). With Roman Drzeżdżon and Piotr Formella.
- Mozilla (Diola-Kasa). With Outi Sane.
- Mozilla (Diola-Fogny). With Outi Sane.
hunspell-fy (Frisian). With Eeltje de Vries.
(Scottish Gaelic). With Michael Bauer.
(Manx Gaelic). Using earlier work of Alastair McKinstry.
(Hiligaynon). With Francis Dimzon.
(Haitian Creole). With Jean Came Poulard and LogiPam.
(Kurdish). With Erdal Ronahi and Rêzan Tovjîn.
- aspell-ky (Kirghiz). With Ilyas Bakirov.
(Lingala). With Denis Jacquerye.
(Malagasy). With Rado Ramarotafika.
- Mozilla (Marshallese). With Marco Mora.
(Mongolian). With Sanlig Badral. See the
announcement (in Mongolian). I'm "Профессор Доктор Кэвин Сканнелл".
(Chichewa). With Soyapi Mumba and Edmond Kachale.
- Mozilla (Oromo). With Belayneh Melka and Dawit Boka.
(Kinyarwanda). With Steve Murphy and Philibert Ndandali.
(Songhay). With Abdoul Cisse and Mohomodou Houssouba.
- hunspell-so (Somali). With Mohamed I. Mursal. Packaged as a Mozilla add-on. See the announcement (English).
(Tetum). With Peter Gossner.
(Turkmen). With Jumamurat Bayjan.
(Tagalog). With Ramil Sagum.
(Setswana). With Thapelo Otlogetswe.
(Tok Pisin). With Helge Søgaard.
(Xhosa). Crúbadán word list is the basis for the translate.org.za spell checker.
- hunspell-zu (Zulu, experimental). Crúbadán word list is the basis for the translate.org.za spell checker, which includes a rich affix file by Friedel Wolff.
Abandoned projects or works in progress
Please contact me if you speak one of these languages and would be willing to help.
- Balochi, with Mostafa Daneshvar.
- Chechen, with Sarah Slye, Steve Massey, et al
- Chhattisgarhi, with Ravishankar Shrivastava.
- Cornish, with Edi Werner and Paul Bowden.
- Dzongkha, with Tshering Cigay Dorji.
- Guaraní, with Iván Prieto Corvalán.
- Hausa, with Mustapha Abubakar.
- Igbo, with Chinedu Uchechukwu and Ogechi Nnadi.
- Itzgründisch, with Sabine Emmy Eller.
- Kapampangan, with José Navarro.
- Kikongo, with Anderson Sunda-Meya.
- Limburgish, with Kenneth Rohde Christiansen.
- Luganda, with San Emmanuel James and Jackson Ssekiryango.
- Nawat, with Alan King.
- Samoan, with Chris Bickers.
- Sindhi, with Abdul Rahim Nizamani.
- Sundanese, with Mang Jamal.
- Tahitian, with Christin Livine.
- Tigrinya and Tigré, with Merhawie Woldezion.
- Tongan, with Brian Romanowski.
- Yoruba, with Tope Faro.
Abandoned projects, word lists now available elsewhere
- Asturian, with Ricardo Mones Lastra (update: OpenOffice.org extension available here).
- Basque, with Alberto Fernández (update: hunspell package now available from euskadi.net).
- Bislama, with Eric Brandell (update: GPL word lists available from swtech.com.au).
- Friulian, with Andrea Tami (update: extensive word list now available from digilander.libero.it).
- Papiamentu, with Peter Damiana (update: Firefox addon available here).
I've provided data to hundreds of researchers working on computational or
pure linguistics research projects for many languages.
I used to track them all here but that's become more trouble than it's
worth. Here is a list of some of the applications in any case:
- Dialect discrimination
- Computational Morphology
- Syntactic analysis (computational and theoretical)
- Lemmatization and IR
- Language ID
- Optical Character Recognition
- Sociolinguistics and social media
- Language learning
- Word sense disambiguation
- Predictive text and autocorrect
- Comparative phonology
- Machine translation
- Selectional preferences
- Crossword generation
- POS tagging
- Speech recognition
- Speech synthesis
- Semantic networks
- Diacritic restoration
- Spelling and grammar checking
Please contact me (kscanne at gmail dot com)
if you are interested in applying
these techniques to a new language.