#AnGaodhal: Training the First Irish-English Bilingual OCR Model
This blog post introduces two new publicly available Optical Character Recognition (OCR) models for the Irish language and the core dataset on which they were trained — the first bilingual Irish-English newspaper, An Gaodhal , produced monthly from 1881 to 1898 by an Irishman living in Brooklyn, New York, and available in the University of Galway Library . The title page of An Gaodhal (Vol. 4, No. 4, February 1885) Models The first of these OCR models is a monolingual Irish (Gaeilge) model, one of the first to be made available publicly. The second model is the first ever to combine multilingual and multiscript functionality in a single OCR model: it extracts Irish texts in Cló Gaelach script and English texts in Roman script that are printed on the same page. The models were trained with the software called Transkribus , and are available on the Transkribus website . An Gaodhal Irish (Gaeilge) Monolingual Model v.2 (ID# 61350) 164,015 training tokens publicly availab...