Since the summer, the Open and Digital Research Team in the Library have been testing and refining an Optical Character Recognition (OCR) pipeline. The Library’s OCR pipeline aims to convert scanned images of textual archival records into a format that is machine-readable, at scale. This is an especially important enhancement to the University’s digitised heritage collections, making its holdings more searchable and supporting diverse areas of research interest.
The first test case for the pipeline has been the Irish emigrant letters taken from the Kerby A. Miller Collection, as published online to the Library’s Digital Collections and a standalone digital repository for materials relating to Irish emigrants to North America called Imirce. In December 2024, the first batch of OCR enriched material has been launched to both the Digital Collections and Imirce online, and further heritage collections are scheduled for processing in 2025.
What this means for users
The Digital Collections and Imirce now support both a global (site-wide) search and a local search per letter through the IIIF Viewer that reads the full text of the Irish emigrant letters from the Kerby A. Miller Collection (Archival ID p155).
Letters from the p155 collection curated for online publication include a typed transcript version of the letter produced by Miller or associates, a reproduction (photocopy) of a handwritten letter, or a combination of these types, where available. This collection was selected for testing the OCR pipeline because most of the collection is represented by typed transcripts. This text format is especially good for machine readability.
Archivists and other information professionals are essential in the process of making important data discoverable to researchers. When cataloguing textual archival materials, items are manually reviewed and enriched with metadata to support search and discovery, including the selection of subject key words from controlled vocabularies. This practice of indexing the contents of the item for researchers, with the recent addition of a full text search, means that the archival materials are open to even more dynamic and (possibly) surprising content discoveries by researchers and members of the public alike.
Global (site-wide) search
Global search refers to searching across the collection(s) for search terms included in the item metadata fields, as well as the full letter texts.
Below are examples of returned results for the search terms
shamrock (69 search results in the Digital Collections) and
grave (138 search results in Imirce). In this list of search results, you will have a mix of letters that have the search terms either in a metadata field or in the full text of the letter.
|
Search results for the search term shamrock in the Digital Collections, 19/12/2024. |
|
Search results for the search term grave in Imirce, 19/12/2024.
|
Local (IIIF viewer) search
The local search in the IIIF Viewer shows users exactly where the search term is in the letter, with a highlight on the page. To find it, it is necessary to type the search term again into the search bar at the bottom of the page as shown in the examples below.
|
Search results for the search term shamrock on a p155 collection item in the IIIF Viewer in the Digital Collections, 19/12/2024. |
|
Search results for the search term grave on a p155 collection item in the IIIF Viewer in Imirce, 19/12/2024.
|
The full machine transcript is included in a metadata field in the IIIF Viewer. The Item View page also includes a ‘Machine Transcript’ tab to show the transcript text for the page being reviewed. In both cases, this text can be highlighted to copy and paste for research purposes.
|
The 'Machine Transcript' metadata field, with highlighted text for copy/pasting, shown on a p155 collection item in the IIIF Viewer in the Digital Collections, 19/12/2024. |
|
The 'Machine Transcript' tab, with highlighted text for copy/pasting, shown on a p155 collection item in the Item View in Imirce, 19/12/2024
|
Technical Details about OCR and Search
The p155 collection is published to the Library’s Digital Asset Management software (DAMs) in jpeg format. For the OCR pipeline, the jpeg files were processed using
Amazon Textract. This OCR software can recognise both typewritten and handwritten text and output the recognised text into JSON format. For ingestion and use in the DAMs, these outputs were then converted into ALTO XML and plain text files.
The outputs from Amazon Textract were assessed according to a mean confidence score. The mean score is used to assess the overall average confidence score for every line on a page. Only pages with a mean confidence score of more than 75% were selected for ingestion to the DAMs. This score doesn't show how accurate the recognition results are, but how confident the machine is in its predictions. A higher confidence score is better, although there is no direct correspondence with accuracy.
To roll out the OCR pipeline at scale, it is impossible for each output file to be individually assessed for accuracy and corrected where necessary. For this reason, the output text included in the Digital Collections and Imirce is specifically identified as being a “Machine Transcript” - a denomination that acknowledges the possibility of some errors in the machine text recognition.
For more context about the first OCR test case, the figures for the p155 collection were as follows:
- 4,012 items (15,077 individual assets) processed overall – this is the total number of Irish emigrant letters published online to date as part of the Imirce project
- 3,886 items (12,876 individual assets) successfully processed
- 3,526 items (11,860 individual assets) have machine transcripts available to support full text search in the Digital Collections and Imirce
- 360 items (1,016 individual assets) set aside for copyright reasons - these can only be consulted in the Library
- 2,885 individual assets failed the 75% confidence test and most of these assets were pages of handwritten text
To get the most out of the global (site-wide) search, it may be useful to know the following:
- It is not case sensitive, for example: a query for the term christmas will find both Christmas and christmas.
- It looks for whole words, for example: if you search christ, the results will only show Christ or christ, but not Christmas.
- To search for a part of the word, use the * wildcard, for example: searching christ* (with the * wildcard included) will yield results for both Christ and Christmas.
- The search allows for related terms to be found based on the stem of the selected term, for example: a query for the term organisation will return results for this exact term, as well as related terms such as organised and organising.
- To search for a search term AND another search term, use the + symbol, for example: searching for shamrock + patrick + parade will return results where all these terms are used, but not in any specific arrangement.
|
Search results for the search query shamrock + patrick + parade in the Digital Collections, 19/12/2024.
|
- Adding quotation marks to a specific set of related terms will search the collection for only the instances where these terms exist in the order listed between the quotation marks, for example: searching for "gaelic scholars" will show only the letters where these terms appear and in the specified order.
|
Search results for the search query "gaelic scholars" in the Digital Collections, 19/12/2024.
|
To get the most out of the local (IIIF Viewer) search, it may be useful to know the following:
- Stop words have been excluded from the search in the IIIF viewer so it is not possible to search for the following extremely common terms: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there ,these, they, this, to, was, will, with.
Authors
Oksana Dereza is the Digital Library Developer in the University of Galway Library. Since 2022, she has been a part of An Gaodhal project that focuses on developing OCR models for Cló Gaelach in a bilingual Irish-English context.
Related Links
Further Reading
Comments