Project Overview

Statistical techniques are a key part of most modern natural language processing systems. Unfortunately, such techniques require the existence of large bodies of text, and in the past corpus development has proved to be quite expensive. As a result, substantial corpora exist primarily for languages like English, French, German, etc. where there is a market-driven need for NLP tools.

The idea behind this project is to exploit the vast quantities of text freely available on the web as a way of bringing the benefits of statistical NLP to languages with small numbers of speakers and/or limited computational resources. Initially it was deployed for the six Celtic languages, but over the last ten years I've added support for more than 2000 languages from all parts of the world. You can information about all of these languages and downloadable datasets on the Downloads Page. There is also information on various tools that have been developed using these corpora on the Applications Page.

I gave a presentation on this work at the WAC3 conference in Louvain-la-Neuve, Belgium in September of 2007. Here is the conference paper, which is the one we'd ask you to cite if you make use of this work: The Crúbadán Project: Corpus building for under-resourced languages. I am grateful to the organizers Cédrick Fairon, Adam Kilgarriff, and Gilles-Maurice de Schryver for the invitation and to the Université Catholique de Louvain for financial support that made the trip possible. You can read the slides from my main talk and also some remarks I made during the WAC panel discussion.

The word crúbadán means literally "crawler" in Irish, but with the additional (appropriate in this context) connotation of unwanted or clumsy "pawing", from the root crúb ("paw"). Several people have asked me how it is pronounced - you can now listen to the word as it's spoken by the wonderful Irish speech synthesizer