Larissa Will, Mannheim University Library: Automated Text Recognition (OCR) (July 2024)

Larissa Will is a consultant for research data management and digitization at the University Library's Research Data Centre. She is responsible for advising researchers in the field of (digital) humanities with regard to research data. Her main areas of expertise include automated text recognition of historical manuscripts and prints as well as the creation and management of digital exhibitions. Larissa Will joined the University Library in 2021 as an employee in the OCR-BW project and previously studied culture and economy with the subjects history and business administration at the University of Mannheim in her Bachelor's and Master's degrees.

What is your current research topic?

As part of the OCR-BW cooperation project, we have set up the OCR competence center together with Tübingen University Library. As part of the project, we worked intensively on automatic text recognition and advised and supported researchers, archives, libraries and other institutions in Baden-Württemberg and beyond in the use of automatic text recognition and transcription software. The competence center continues to exist even after the end of the project and I have since been involved in advising researchers at the University of Mannheim as well as external parties. In addition to the classic 1:1 consultation, my colleagues and I also offer open consultation hours and workshops, and we regularly attend conferences and exchange information on current developments.

For those who have not yet delved deeply into the topic of Data Science: How would you explain to a child what you are working on?

Imagine you have a picture of a book or a piece of paper with words written on it. You want to have these words in your computer so that you can edit or search them, for example. You could easily find your favorite part of the book or send the book to your friend. But the computer can't just read the image like you can. It only sees a collection of dots and colors.

This is where OCR comes into the picture. OCR stands for “Optical Character Recognition”. It's like a magic spell that teaches the computer to understand the dots on the image and convert them into text that the computer itself can read.

Everyone talks about Data Science – how would you describe the importance of the topic for yourself in three words?

Enabling new insights

What points of contact with Data Science does your work have? Which methods do you already use, and which would be interesting for you in the future?

In my work, I create the basis for generating data from printed or handwritten texts. The mere generation of searchability does not yet produce research data, as the results are often still imperfect, but through targeted, so-called work-specific retraining of the neural networks, the result can be improved to such an extent that the transcriptions are at research data level. These are then suitable for various analyses, text mining and digital editions.

How high is the value of Data Science for your work? Would your research even be possible without Data Science?

Data science is of great importance for my work, as the low-threshold provision of full texts already makes an important contribution to improving accessibility but is only a starting point. The generated full texts provide the basis for a variety of evaluation methods such as linguistic and literary text analysis and the identification of patterns. In the humanities in particular, this provides the first opportunity to systematically search and analyze large text corpora. Trends and patterns can be discovered that would not have been noticed before, and completely new questions can be asked of the source material. This also enables interdisciplinary cooperation, e.g. between history and business informatics.

What development opportunities do you see for the topic of Data Science in relation to your field?

I hope that targeted data extraction from full texts will become easier in the future thanks to artificial intelligence and that it will be possible to obtain structured data output, e.g. in table form, from unstructured data in text form without the previous effort. The combination of OCR with natural language processing (NLP) techniques to better understand and analyze the content of texts is also a meaningful development.

Back