#AnGaodhal: Training the First Irish-English Bilingual OCR Model

This blog post introduces two new publicly available Optical Character Recognition (OCR) models for the Irish language and the core dataset on which they were trained — the first bilingual Irish-English newspaper, An Gaodhal, produced monthly from 1881 to 1898 by an Irishman living in Brooklyn, New York, and available in the University of Galway Library.

The title page of An Gaodhal (Vol. 4, No. 4, February 1885)

Models

The first of these OCR models is a monolingual Irish (Gaeilge) model, one of the first to be made available publicly. The second model is the first ever to combine multilingual and multiscript functionality in a single OCR model: it extracts Irish texts in Cló Gaelach script and English texts in Roman script that are printed on the same page. The models were trained with the software called Transkribus, and are available on the Transkribus website.

An Gaodhal Irish (Gaeilge) Monolingual Model v.2 (ID# 61350)

164,015 training tokens

publicly available

An Gaodhal Irish (Gaeilge) - English Bilingual Model v.1 (ID# 51080)
54,406 training tokens

publication pending

For pages of English language content, we applied Transkribus Print M1 (ID# 39995), which has been trained on over 5 million tokens and reflects the historical typographical conventions of the corpus.

Key Features of the Dataset

To begin, some bullet points below identify key features of the project dataset that were essential to our OCR model design decisions, features that the project team discussed at length before any OCR training or application began. If your dataset features any of these elements, we imagine you will find our experience useful to your workflow planning.

Chiefly print, with some hand-written marginalia
Chiefly artisanal printing, produced by an amateur printer working in a domestic setting in late-nineteenth century Brooklyn, yielding printing errors and idiosyncratic edits
Newspaper with two-column layout
Fine-print advertisements — possible to upscale images when running layout recognition (but we learned this too late to apply it and so lost time inputting by hand)
Bilingual content — Irish & English — both reading left to right
Three main dialects of Irish with variable non-standard spelling
Two different scripts or typefaces — usually but not strictly confined to their respective languages i.e. when Irish type ran short, Roman type was substituted; limited variants of font (italics only in English)

Model Training Challenges

Layout deviations in the dataset and different spellings and changes in character use had a major impact on our project in respect of efficiency, accuracy, and determining future applications of the models.

Layout deviations in the dataset

Different table layouts and mix of orthographies.

Some elements of the newspaper — including small tables appearing between text regions, varying line directionality, and curved or acrostic texts — needed layout analysis (LA) bounding boxes and/or text input to be generated by hand. Given the limitations of our project resources, the number of such instances was too small to merit spending more time on training workable layout models. Instead, to maximise the accuracy of OCR and so minimise human correction of OCR output, all LA elements were either automatically generated or manually applied and then fully reviewed before OCR models were run. In applying layout analysis, we adopted the default baseline recognition settings provided in Transkribus.

For the manual drawing of lines and word boxes, it is best to follow the directionality of the line but to draw word boxes in reverse order. This will render the word boxes in the appropriate order in the relevant line.

Ensuring line polygons capture all diacritics and punctuation demands vigilance to minimise the human correction burden. Within the confines of our project resources, we limited the tagging of text regions to page numbers, paragraphs, and marginalia. Where required, tags for ‘gaps’ or for words that were ‘supplied’ or ‘unclear’ were applied. If you are conducting a HTR training workflow, note that lines featuring the 'unclear' tag are excluded from ground truth.

Word-level corrections, often required for further Natural Language Processing (NLP) work, are slower than line-level corrections, especially where the page contains a high volume of words. At the time of writing, word-level corrections can only be applied in the Expert client on your desktop (see the spanner icon for options). Before saving changes and exporting files, it is advisable to refresh the ID elements (via the Expert client on your desktop, under the ‘Layout’ menu tab) to ensure all identifiers run sequentially.

Different spellings and changes in character use

"The Gaelic Alphabet" frequently reprinted in An Gaodhal.

Historically, the Irish language has been printed in two different orthographies: Irish type, known as Cló Gaelach, which originates in the scribal tradition; and Roman type. Cló Gaelach uses two kinds of diacritics: acute accents on vowels (ÁáÉéÍíÓóÚú); and dotted consonants (ḂḃĊċḊḋḞḟĠġṀṁṖṗṠṡṪṫ), the dots indicating a grammatical feature called lenition. Where Irish appears in Roman type, dotted consonants are replaced by Bh, Ch, dh, fh, etc. To ensure that such nuances of typesetting and spelling conventions in a given printed artefact are preserved in text extraction, the two new OCR models were trained to match a single unicode character to each printed glyph; in other words, substituting Ḃ, Ċ, and Ḋ with Bh, Ch, and Dh, etc. was eschewed. The chief creator of our dataset rarely adopted such substitutions and, in Gaeilge texts, chose to adhere to the relevant orthography, spelling some English words phonetically e.g. “Nuaḋ Ġorc” for New York.

In our two models, the selected unicode characters do not replicate exactly the design of Cló Gaelach (such as Gaelchló provides); rather, in deference to long-standing practice, Roman type characters — including those with diacritics — were chosen, thus ensuring interoperability between this dataset and others, e.g. the Historical Irish Corpus by the Royal Irish Academy. There is no fully integrated keyboard for Cló Gaelach available, so input of required unicode characters relied on the customisation of the virtual keyboard embedded in Transkribus.

Original Cló Gaelach script and Unicode output.

We found that our bilingual model sometimes struggled to recognise infrequently used letters in Cló Gaelach texts including 'h / H' and 'p / P'. Sometimes it rendered 'p' as 'd'. In time, with further application to other texts, we expect such errors to become less frequent.

History of Irish Language Literacy and Print

Although the quantity of printed material in Irish in the centuries prior to the appearance of An Gaodhal was small in comparison to many languages, a recent cataloguing of titles published in Irish between the sixteenth and nineteenth centuries that was produced by Richard Sharpe and Micheál Hoyne (Clóliosta, Dublin, 2020) lists over a thousand entries, several of them with multiple editions. Prior to the 1880s, the most common genres for printing in Irish were religious texts (both Catholic and Protestant), academic texts, and so-called Gaelic columns in otherwise English-only newspapers in which a relatively small amount of content (usually letters, songs, or poetry) was printed in Irish. An Gaodhal thus appeared at a time when printing in Irish was taking place, but not on a mass scale, so the newspaper’s production represented an energetic undertaking in the face of headwinds.

The challenges An Gaodhal faced were varied. The economics of audience size over printing costs, particularly for a newspaper printed for a transatlantic audience, drove its founder and editor to forgo any income from his work on the paper. The debate over the choice of type, whether Roman or Cló Gaelach, had long been a heated one; for those who insisted that Cló Gaelach was the only proper type for expressing Irish, there was the immediate challenge of procuring such a unique typeface — a matter of availability, not cost, as it could be purchased for the same price as Roman type. Even where Roman type was selected, any printer choosing to produce Irish texts in the nineteenth century or earlier faced difficulties in finding a sufficiently large, paying audience and, in a diasporic context, sufficiently fluent typesetters or compositors. The absence of mass literacy in Irish prior to the twentieth century combines with these challenges for printing to make the appearance of An Gaodhal and of similar undertakings in its aftermath especially notable: they represent the first steps in creating a media landscape in the Irish language, an impact foretold in the ambition expressed in An Gaodhal “to have the ‘million’ readers yet” (Vol. 9, No. 8 (January 1893), 236).

Finally, it is significant that An Gaodhal was created in Brooklyn, then a city in its own right and a major locus of a global diasporic community of Irish migrants (and Irish-speaking migrants) several centuries in the making. The outward migration from Ireland from the 1840s onward featured significant numbers from communities along the west coast of Ireland where Irish was predominant — in Munster, Connacht, and west Ulster — as opposed to more anglicized regions of the north and east. This shift in migration patterns made it likely that the cohorts arriving in places like the United States, Australia, Canada, and the United Kingdom in this period spoke Irish at higher rates than their immediate predecessors of the early nineteenth century. Although the largely anglicized contexts to which the majority of these migrants moved would, over time, weaken the strength of Irish language practice among them, we find that, for a time — in fact, coinciding with the two-decade heights of An Gaodhal’s reach — the center of gravity for Irish-speaking was located abroad almost as much as it was found in Ireland. The story of An Gaodhal highlights how, perhaps counterintuitively, the diasporic context had a lasting impact on the survival of the Irish language into the twenty-first century.

The Dataset in Context

Our core dataset, the monthly bilingual Irish-English newspaper An Gaodhal, was established and edited by Micheál Ó Lócháin (also known as Michael J. Logan). The first four issues of the newspaper were printed commercially and at a loss. To save the enterprise, Ó Lócháin took on the task of typesetting and printing the newspaper himself, most likely in his own home. His amateur efforts were likely assisted by his youngest child, Edward, who went on to become a professional printer. Over the next seventeen years, Ó Lócháin continued to issue the paper, supported by a transnational network of contributors. His commitment combined with the appetite among readers to achieve 1,200 subscriptions within the first year, growing to 3,000 at its peak, five times the number achieved by the contemporaneous Dublin-based Irisleabhar na Gaedhilge (also known as The Gaelic Journal).

As one might expect from an ethnic newspaper emerging in a diasporic setting, contributors to An Gaodhal and its readers welcomed the arrival of a new forum in which to identify their community of ‘Éire Mhór’ (Greater Ireland) and celebrate it. Nationalist politics at home in Ireland amplified that sense of pride, which extended to the use of Cló Gaelach throughout the newspaper to distinguish Irish expression from the English nation, its language and Roman type, and British imperialism. Indeed, the Irish type used in the newspaper, modelled on Watts type, was newly cast in the United States to avoid purchasing a set cast in a London foundry.

There is a palpable sense of energy and excitement in the newspaper as many of its contributors and readers were then gaining literacy in Irish for the first time. The standard of written Irish varied accordingly, as did the spelling, which had yet to be standardised. Add to this the use of three differing dialects and the emerging corpus of texts — however small at 1.86 million tokens — yields a welcome diversity in the prospective training data. To date, the adaptability of the models we have developed supports this inference. To facilitate further developments in this field, the project data are distributed openly and include: full text (ALTO XML), which is corrected manually; a BART-based bilingual OCR post-correction model and the dataset with which it was trained; and a paper published in the LT4HALA @ LREC-COLING 2024 proceedings.

The only complete series of the newspaper survives in the University of Galway Library where it has been digitally accessible since 2021 as high-resolution scans. This set was compiled, bound, and annotated by the Philadelphia-based scholar of Irish folklore and sean-nós song, Rev. Daniel J. Murphy, and forms part of his manuscript archive, which is also held in Galway. We extracted Rev. Murphy's annotations from our corpus marginalia in order to facilitate future linkage between them and his manuscripts containing 1,200 sean-nós songs transcribed contemporaneously in Philadelphia and northeastern Pennsylvania, some of which appeared in An Gaodhal. Together, these corpora predate the National Folklore Collection of Ireland and capture the Gaelic memory of Ireland before the Great Hunger of the 1840s, which devastated Irish-speaking communities. Generating integrated resources from these artefacts will go a long way toward realising the faith and hope of their dedicated creators. It was this particular potential for the integration of multi-modal resources that inspired the present project, which aims to build on the pioneering work of Fionnuala Uí Fhlannagáin who delivered the first comprehensive study of the rich content, reach, and impact of An Gaodhal (Uí Fhlannagáin, 1990).

Intentions and Ambitions

Our primary intention was to produce accurate text extraction for An Gaodhal amounting to 2,298 pages of late-nineteenth century printed texts in Irish and English (381 pages feature Irish mostly, 896 English mostly, and 1,019 both languages together), nearly 22% of which contained valuable marginalia. To that end, we understood we would need to develop a bilingual model from scratch, as none was available to us. We elected to train an Irish-only model and then used that model to train a bilingual Irish-English model. You can trace each step of that process in the project README and learn more about the OCR post-correction work in our LT4HALA 2024 paper.

At the outset of the present project in January 2023, there were no publicly available OCR models attuned to Cló Gaelach and pre-standardised spelling of the Irish language, in either monolingual or multilingual contexts; in bilingual or multilingual contexts, Irish appears most frequently alongside English, reflecting their co-existence in Ireland for centuries. The only related project in existence was a Cló Gaelach training dataset for Tesseract OCR software, published by Scannell et al. (2020). In November 2023, an Irish-only model for texts in either Cló Gaelach or Roman typefaces was made public on the Transkribus OCR platform (Farrell, 2023). The methodology that produced this model differs considerably from the approach discussed herein in respect of the treatment of different typefaces, the treatment of individual printed glyphs, and the broader span of centuries represented in the corpus upon which the model was trained.

In time, we hope to apply our bilingual model to another Irish-English bilingual monthly newspaper, An Stoc, produced from 1917 to 1931 by the Professor of Irish at University of Galway, Tomás Ó Máille. An Stoc was printed commercially in Cló Newman, which is similar to the Cló Watts style of type used in An Gaodhal. With a diasporic readership of its own, An Stoc relates to many of the same communities represented in An Gaodhal. It also connects with the recently digitised audio archive of over 500 tracks of wax cylinder recordings produced by Prof. Ó Máille.

The team continues to work on a digitally enhanced edition of An Gaodhal by exploring the application of Named Entity Recognition (NER) for historical Irish data.

For more on this project, you can listen to this 7 min report on RTÉ Radio One’s The History Show, and follow #AnGaodhal for updates.

Deirdre Ní Chonghaile

Oksana Dereza (Digital Library Developer, University of Galway Library)

Nicholas Wolf

Funding

The project is funded by the Robert D. L. Gardiner Foundation, the Irish Institute of New York, Glucksman Ireland House at New York University, and the University of Galway.

Search This Blog

The HardiBlog