Posts

Showing posts from August, 2024

#AnGaodhal: Training the First Irish-English Bilingual OCR Model

Image
This blog post introduces two new publicly available Optical Character Recognition (OCR) models for the Irish language and the core dataset on which they were trained — the first bilingual Irish-English newspaper, An Gaodhal , produced monthly from 1881 to 1898 by an Irishman living in Brooklyn, New York, and available in the University of Galway Library .   The title page of An Gaodhal (Vol. 4, No. 4, February 1885) Models The first of these OCR models is a monolingual Irish (Gaeilge) model, one of the first to be made available publicly. The second model is the first ever to combine multilingual and multiscript functionality in a single OCR model: it extracts Irish texts in Cló Gaelach script and English texts in Roman script that are printed on the same page. The models were trained with the software called Transkribus , and are available on the Transkribus website .   An Gaodhal Irish (Gaeilge) Monolingual Model v.2 (ID# 61350) 164,015 training tokens publicly available An Gao