DigiHistCH24
  • Home
  • Book of Abstracts
  • Conference Program
  • Call for Contributions
  • About

On the Historiographic Authority of Machine Learning Systems

  • Home
  • Book of Abstracts
    • Data-Driven Approaches to Studying the History of Museums on the Web: Challenges and Opportunities for New Discoveries
    • On a solid ground. Building software for a 120-year-old research project applying modern engineering practices
    • Tables are tricky. Testing Text Encoding Initiative (TEI) Guidelines for FAIR upcycling of digitised historical statistics.
    • Training engineering students through a digital humanities project: Techn’hom Time Machine
    • From manual work to artificial intelligence: developments in data literacy using the example of the Repertorium Academicum Germanicum (2001-2024)
    • A handful of pixels of blood
    • Impresso 2: Connecting Historical Digitised Newspapers and Radio. A Challenge at the Crossroads of History, User Interfaces and Natural Language Processing.
    • Learning to Read Digital? Constellations of Correspondence Project and Humanist Perspectives on the Aggregated 19th-century Finnish Letter Metadata
    • Teaching the use of Automated Text Recognition online. Ad fontes goes ATR
    • Geovistory, a LOD Research Infrastructure for Historical Sciences
    • Using GIS to Analyze the Development of Public Urban Green Spaces in Hamburg and Marseille (1945 - 1973)
    • Belpop, a history-computer project to study the population of a town during early industrialization
    • Contributing to a Paradigm Shift in Historical Research by Teaching Digital Methods to Master’s Students
    • Revealing the Structure of Land Ownership through the Automatic Vectorisation of Swiss Cadastral Plans
    • Rockefeller fellows as heralds of globalization: the circulation of elites, knowledge, and practices of modernization (1920–1970s): global history, database connection, and teaching experience
    • Theory and Practice of Historical Data Versioning
    • Towards Computational Historiographical Modeling
    • Efficacy of Chat GPT Correlations vs. Co-occurrence Networks in Deciphering Chinese History
    • Data Literacy and the Role of Libraries
    • 20 godparents and 3 wives – studying migrant glassworkers in post-medieval Estonia
    • From record cards to the dynamics of real estate transactions: Working with automatically extracted information from Basel’s historical land register, 1400-1700
    • When the Data Becomes Meta: Quality Control for Digitized Ancient Heritage Collections
    • On the Historiographic Authority of Machine Learning Systems
    • Films as sources and as means of communication for knowledge gained from historical research
    • Develop Yourself! Development according to the Rockefeller Foundation (1913 – 2013)
    • AI-assisted Search for Digitized Publication Archives
    • Digital Film Collection Literacy – Critical Research Interfaces for the “Encyclopaedia Cinematographica”
    • From Source-Criticism to System-Criticism, Born Digital Objects, Forensic Methods, and Digital Literacy for All
    • Connecting floras and herbaria before 1850 – challenges and lessons learned in digital history of biodiversity
    • A Digital History of Internationalization. Operationalizing Concepts and Exploring Millions of Patent Documents
    • From words to numbers. Methodological perspectives on large scale Named Entity Linking
    • Go Digital, They Said. It Will Be Fun, They Said. Teaching DH Methods for Historical Research
    • Unveiling Historical Depth: Semantic annotation of the Panorama of the Battle of Murten
    • When Literacy Goes Digital: Rethinking the Ethics and Politics of Digitisation
  • Conference Program
    • Schedule
    • Keynote
    • Practical Information
    • Event Digital History Network
    • Event SSH ORD
  • Call for Contributions
    • Key Dates
    • Evaluation Criteria
    • Submission Guidelines
  • About
    • Code of Conduct
    • Terms and Conditions

On this page

  • Introduction
  • Facticity
  • Qualifying Error Rates
  • Three Strategic Directions
    • Direction 1: Formulating Clear Needs
    • Direction 2: Creating Transparency
    • Direction 3: Data Criticism and Data Hermeneutics
  • References
  • Edit this page
  • Report an issue

On the Historiographic Authority of Machine Learning Systems

Session 4B
Author
Affiliations

Dominic Weber

University of Bern

University of Basel

Published

September 12, 2024

Doi

10.5281/zenodo.13907672

Abstract

The integration of Machine Learning in historical research has significantly altered the approach to sources, data and workflows. Historians now use Machine Learning applications such as Handwritten Text Recognition (HTR) and Natural Language Processing (NLP) to manage large corpora, enhancing research capabilities but also introducing challenges in combining machine-generated and manually created data without propagating errors. The reliability of machine-generated data is a central concern, paralleling issues found in traditional transcription and edition practices. The concept of factoids highlights the fragmentation and recontextualization of data in digital history. Evaluating Machine Learning systems, particularly through tools like CERberus for HTR, emphasises the need for qualitative error analysis to support historical research. The article proposes three strategic directions for digital history: defining clear needs to manage data pragmatically, enhancing transparency to improve data reuse and interoperability, and advancing data criticism and hermeneutics. These directions aim to refine the methods and practices of digital historians, ensuring that Machine Learning outputs are critically assessed and effectively integrated into historical scholarship.

Keywords

Machine Learning, Methodology, Epistemology, Facticity, Evaluation

Introduction

Over the last few years, Machine Learning applications became more and more popular in the humanities and social sciences in general, and therefore also in history. Handwritten Text Recognition (HTR) and various tasks of Natural Language Processing (NLP) are now commonly employed in a plethora of research projects of various sizes. Even for PhD projects it is now feasible to research large corpora like serial legal source, which would not be possible entirely by hand. This acceleration of research processes implies fundamental changes to how we think about sources, data, research and workflows.

In history, Machine Learning systems are typically used to speed up the production of research data. As the output of these applications is never entirely accurate or correct, this raises the question how historians can use machine generated data together with manually created data without propagating errors and uncertainties to downstream tasks and investigations.

Facticity

The question of the combined usability of machine-generated and manually generated data is also a question of the reliability or facticity of data. Data generated by humans are not necessarily complete and correct either, as they are a product of human perception. For example, creating transcriptions depends on the respective transcription guidelines and individual text understanding, which can lead to errors. However, we consider transcriptions by experts as correct and use them for historical research. This issue is even more evident in the field of editions. Even very old editions with methodological challenges are valued for their core content. Errors may exist, but they are largely accepted due to the expertise of the editors, treating the output as authorised. This pragmatic approach enables efficient historical research. Historians trust their ability to detect and correct errors during research.

Francesco Beretta represents data, information, and knowledge as a pyramid: data form the base, historical information (created from data through conceptual models and critical methods) forms the middle, and historical knowledge (produced from historical information through theories, statistical models and heuristics) forms the top (Beretta 2023, fig. 3). Interestingly, however, he makes an important distinction regarding digital data: “Digital data does not belong to the epistemic layer of data, but to the layer of information, of which they are the information technical carrier” (Translation: DW. Original Text: “[L]les données numériques n’appartiennent pas à la strate épistémique des données, mais bien à celle de l’information dont elles constituent le support informatique.” Beretta 2023, 18)

Andreas Fickers adds that digitization transforms the nature of sources, affecting the concept of the original (Fickers 2020, 162). Sources are preprocessed using HTR/OCR and various NLP strategies. The resulting digital data are already processed historical information. This shift from analog to digital means that what we extract from sources is not just given but constructed (Beretta 2023, 26). Analog historical research, which relies on handwritten archival documents, also depends on transcriptions or editions to conduct research pragmatically; and here, too, data becomes information. The main difference is that with the generation of digital data, the (often linear) structure of sources is typically dissolved in favour of a highly fragmented and hyperconnected structure (For hyperconnectivity see Fickers 2022, 51–54; For the underlying concept of hypertextual systems see Landow 2006, 53–58; for a a more extensive discussion of digital representations of fragmented texts see Weber 2021). This is partly due to the way sources are processed into historical information using digital tools and methods, but it is inherently connected with issues of storing, retrieving, and presenting digital data – in a very technical sense.

The concept of factoids introduced by Michele Pasin and John Bradley, is central to this argument. They define factoids as pieces of information about one or more persons in a primary source. Those factoids are then represented in a semantic network of subject-predicate-object triples (Pasin and Bradley 2015, 89–90). This involves extracting statements from their original context, placing them in a new context, and outsourcing verification to later steps. Therefore, factoids can be contradictory. Francesco Beretta applies this idea to historical science, viewing the aggregation of factoids as a process aiming for the best possible approximation of facticity (Beretta 2023, 20). The challenge is to verify machine output sufficiently for historical research and to assess the usefulness of the factoid concept. Evaluating machine learning models and their outputs is crucial for this.

Qualifying Error Rates

Evaluating the output of a machine learning system is not trivial. Models can be evaluated using various calculated scores, which is done continuously during the training process. However, these performance metrics are statistical measures that generally refer to the model and are based on a set of test data. Even the probabilities output by machine learning systems when applied to new data are purely computational figures, only partially suitable for quality assurance. This verification is further complicated by the potentially vast scale of the output. Therefore, historical science must find a pragmatic way to translate statistical evaluation metrics into qualitative statements and identify systematic sources of error.

In automatic handwriting recognition, models are typically evaluated using character error rate (CER). These metrics only tell us the percentage of characters or words incorrectly recognised compared to a ground truth. They do not reveal the distribution of these errors, which is important when comparing automatic and manual transcriptions. For detailed HTR model evaluation, CERberus is being developed (Haverals 2023). This tool compares ground truth with HTR output from the same source. Instead of calculating just the character error rate, it breaks down the differences further. Errors are categorised into missing, excess, and incorrectly recognised characters. Additionally, a separate CER is calculated for all characters and Unicode blocks in the text, aggregated into confusion statistics that identify the most frequently confused characters. Confusion plots are generated to show the most common errors for each character. These metrics do not pinpoint specific errors but provide a more precise analysis of the model’s behaviour. CERberus cannot evaluate entirely new HTR output without comparison text but is a valuable tool for Digital History, revealing which character forms are often confused and guiding model improvement or post-processing strategies.

In other machine learning applications, such as named entity recognition (NER), different metrics are important, requiring detailed error source analysis. Evaluating NER is more complex than HTR because it involves categorizing longer text sections based on context. Precision (how many recognised positives are true positives) and recall (how many actual positives are recognised) are combined into the F1-score to indicate model performance. Fu et al. proposed evaluating NER with a set of eight annotation attributes influencing model performance. These attributes are divided into local properties (entity length, sentence length, unknown word density, entity density) and aggregated attributes (annotation consistency and frequency at the token and entity levels) (Fu, Liu, and Neubig 2020, 3). Buckets of source points where a model performs particularly well or poorly are created and separately evaluated (Fu, Liu, and Neubig 2020, 1). This analysis identifies conditions affecting model performance, guiding further training steps and dataset expansion.

The qualitative error analysis presented here does not solve the question of authorizing machine learning output for historical research. Instead, it provides tools to assess models more precisely and analyse training and test datasets. Such investigations extend the crucial source criticism in historical science to digital datasets and the algorithms and models involved in their creation. This requires historians to expand their traditional methods to include new, less familiar areas.

Three Strategic Directions

In the following last part of this article, the previously raised questions and problem areas will be consolidated, from which three strategic directions for digital history will be derived. These will be suggestions for how the theory, methodology, and practice of Digital History could evolve to address and mitigate the identified problem areas. The three perspectives should not be viewed in isolation or as mutually exclusive. Instead, they are interdependent and should work together to meet the additional challenges.

Direction 1: Formulating Clear Needs

When data is collected or processed into information in the historical research process a certain pragmatism is involved. Ideally, such a project would fully and consistently transcribe the entire collection with the same thoroughness, but in practice, a compromise is often found between completeness, correctness, and pragmatism. Often, for one’s own research purposes, it is sufficient to transcribe a source only to the extent that its meaning can be understood. This compromise has not fully transitioned into Digital History. Even if a good CER is achieved, there is pressure to justify how these potential errors are managed in the subsequent research process. This skepticism is not fundamentally bad, and the epistemological consequences of erroneous machine learning output are worthy of discussion. Nonetheless, the resulting text is usually quite readable and usable.

Thus, I argue that digital history must more clearly define and communicate its needs. However, it must be remembered that Digital History also faces broader demands. Especially in machine learning-supported research, the demand for data interoperability is rightly emphasised. Incomplete or erroneous datasets are, of course, less reusable by other research projects.

Direction 2: Creating Transparency

The second direction for digital history is to move towards greater transparency. The issue of reusability and interoperability of datasets from the first strategic direction can be at least partially mitigated by transparency.

As Hodel et al. convincingly argued, it is extremely sensible and desirable for projects using HTR to publish their training data. This allows for gradual development towards models that can generalise as broadly as possible (Hodel et al. 2021, 7–8). If a CERberus error analysis is conducted for HTR that goes beyond the mere CER, it makes sense to publish this alongside the data and the model. With this information, it is easier to assess whether it might be worthwhile to include this dataset in one’s own training material. Similarly, when NER models are published, an extended evaluation according to Fu et al. helps to better assess the performance of a model for one’s own dataset.

Pasin and Bradley, in their prosopographic graph database, indicate the provenance of each data point and who captured it (Pasin and Bradley 2015, 91–92). This principle could also be interesting for Digital History, by indicating in the metadata whether published research data was generated manually or by a machine, ideally with information about the model used and the annotating person for manually generated data. Models provide a confidence estimate with their prediction, indicating how likely the prediction is correct. The most probable prediction would be treated as the first factoid. The second or even third most probable prediction from the systems cloud provide additional factoids that can be incorporate into the source representation. These additional pieces of information can support the further research process by allowing inconsistencies and errors to be better assessed and balanced.

Direction 3: Data Criticism and Data Hermeneutics

The shift to digital history requires an evaluation and adjustment of our hermeneutic methods. This ongoing discourse is not new, and Torsten Hiltmann has identified three broad directions: first, the debate about extending source criticism to data, algorithms, and interfaces; second, the call for computer-assisted methods to support text understanding; and third, the theorization of data hermeneutics, or the “understanding of and with data” (Hiltmann 2024, 208).

Even though these discourse strands cannot be sharply separated, the focus here is primarily on data criticism and hermeneutics. The former can fundamentally orient itself towards classical source criticism. Since digital data is not given but constructed, it is crucial to discuss by whom, for what purpose, and how data was generated. This is no easy task, especially when datasets are poorly documented. Therefore, the call for data and model criticism is closely linked to the plea for more transparency in data and model publication.

In the move towards data hermeneutics, a thorough rethinking of the factoid principle can be fruitful. If, as suggested above, the second or even third most likely predictions of a model are included as factoids in the publication of research data, this opens up additional perspectives on the sources underlying the data. From these new standpoints, the data – and thus the sources – can be analyzed and understood more thoroughly. Additionally, this allows for a more informed critique of the data, and extensive transparency also mitigates the “black box” problem of interpretation described by Silke Schwandt (Schwandt 2022). If we more precisely describe and reflect on how we generate digital data from sources as historians, we will find that our methods are algorithmic (Schwandt 2022, 81–82). This insight can also support the understanding of how machine learning applications work. Data hermeneutics thus requires both a critical reflection of our methods and a more transparent approach to data and metadata.

References

Beretta, Francesco. 2023. “Données ouvertes liées et recherche historique : un changement de paradigme.” Humanités numériques, no. 7 (July). https://doi.org/10.4000/revuehn.3349.
Fickers, Andreas. 2020. “Update für die Hermeneutik. Geschichtswissenschaft auf dem Weg zur digitalen Forensik?” Zeithistorische Forschungen 1: 157–68. https://doi.org/10.14765/ZZF.DOK-1765.
———. 2022. “What the D does to history: Das digitale Zeitalter als neues historisches Zeitregime?” In Digital History: Konzepte, Methoden und Kritiken Digitaler Geschichtswissenschaft, edited by Karoline Dominika Döring, Stefan Haas, Mareike König, and Jörg Wettlaufer, 45–64. De Gruyter Oldenbourg. https://doi.org/10.1515/9783110757101-003.
Fu, Jinlan, Pengfei Liu, and Graham Neubig. 2020. “Interpretable Multi-dataset Evaluation for Named Entity Recognition.” arXiv. https://doi.org/10.48550/arXiv.2011.06854.
Haverals, Wouter. 2023. “CERberus: Guardian Against Character Errors.” https://github.com/WHaverals/CERberus.
Hiltmann, Torsten. 2024. “Hermeneutik in Zeiten der KI: Large Language Models als hermeneutische Instrumente in den Geschichtswissenschaften.” In KI:Text, 201–32. De Gruyter. https://doi.org/10.1515/9783111351490-014.
Hodel, Tobias, David Schoch, Christa Schneider, and Jake Purcell. 2021. “General Models for Handwritten Text Recognition: Feasibility and State-of-the Art. German Kurrent as an Example.” Journal of Open Humanities Data 7. https://doi.org/10.5334/johd.46.
Landow, George P. 2006. Hypertext 3.0: Critical Theory and New Media in an Era of Globalization. 3. Auflage. Baltimore.
Pasin, Michele, and John Bradley. 2015. “Factoid-Based Prosopography and Computer Ontologies: Towards an Integrated Approach.” Digital Scholarship in the Humanities 30 (1): 86–97. https://doi.org/10.1093/llc/fqt037.
Schwandt, Silke. 2022. “Opening the Black Box of Interpretation: Digital History Practices as Models of Knowledge.” History and Theory 61 (4): 77–85. https://doi.org/10.1111/hith.12281.
Weber, Dominic. 2021. “Klassifizieren – Verknüpfen – Abbilden. Herausforderungen Der Digitalen Repräsentation Hypertextueller Systeme Am Beispiel Des Klingentaler Jahrzeitenbuchs H.” Master’s thesis, Basel: University of Basel. https://github.com/DominicWeber/jahrzeitenbuch-h.
Back to top

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:
@misc{weber2024,
  author = {Weber, Dominic},
  editor = {Baudry, Jérôme and Burkart, Lucas and Joyeux-Prunel,
    Béatrice and Kurmann, Eliane and Mähr, Moritz and Natale, Enrico and
    Sibille, Christiane and Twente, Moritz},
  title = {On the {Historiographic} {Authority} of {Machine} {Learning}
    {Systems}},
  date = {2024-09-12},
  url = {https://digihistch24.github.io/submissions/465/},
  doi = {10.5281/zenodo.13907672},
  langid = {en},
  abstract = {The integration of Machine Learning in historical research
    has significantly altered the approach to sources, data and
    workflows. Historians now use Machine Learning applications such as
    Handwritten Text Recognition (HTR) and Natural Language Processing
    (NLP) to manage large corpora, enhancing research capabilities but
    also introducing challenges in combining machine-generated and
    manually created data without propagating errors. The reliability of
    machine-generated data is a central concern, paralleling issues
    found in traditional transcription and edition practices. The
    concept of factoids highlights the fragmentation and
    recontextualization of data in digital history. Evaluating Machine
    Learning systems, particularly through tools like CERberus for HTR,
    emphasises the need for qualitative error analysis to support
    historical research. The article proposes three strategic directions
    for digital history: defining clear needs to manage data
    pragmatically, enhancing transparency to improve data reuse and
    interoperability, and advancing data criticism and hermeneutics.
    These directions aim to refine the methods and practices of digital
    historians, ensuring that Machine Learning outputs are critically
    assessed and effectively integrated into historical scholarship.}
}
For attribution, please cite this work as:
Weber, Dominic. 2024. “On the Historiographic Authority of Machine Learning Systems.” Edited by Jérôme Baudry, Lucas Burkart, Béatrice Joyeux-Prunel, Eliane Kurmann, Moritz Mähr, Enrico Natale, Christiane Sibille, and Moritz Twente. Digital History Switzerland 2024: Book of Abstracts. https://doi.org/10.5281/zenodo.13907672.
When the Data Becomes Meta: Quality Control for Digitized Ancient Heritage Collections
Films as sources and as means of communication for knowledge gained from historical research
Source Code
---
submission_id: 465
categories: 'Session 4B'
title: On the Historiographic Authority of Machine Learning Systems
author:
  - name: Dominic Weber
    orcid: 0000-0002-9265-3388
    email: dominic.weber@unibe.ch
    affiliations:
      - University of Bern
      - University of Basel
keywords:
  - Machine Learning
  - Methodology
  - Epistemology
  - Facticity
  - Evaluation
abstract: |
  The integration of Machine Learning in historical research has significantly altered the approach to sources, data and workflows. Historians now use Machine Learning applications such as Handwritten Text Recognition (HTR) and Natural Language Processing (NLP) to manage large corpora, enhancing research capabilities but also introducing challenges in combining machine-generated and manually created data without propagating errors. The reliability of machine-generated data is a central concern, paralleling issues found in traditional transcription and edition practices. The concept of factoids highlights the fragmentation and recontextualization of data in digital history. Evaluating Machine Learning systems, particularly through tools like CERberus for HTR, emphasises the need for qualitative error analysis to support historical research. The article proposes three strategic directions for digital history: defining clear needs to manage data pragmatically, enhancing transparency to improve data reuse and interoperability, and advancing data criticism and hermeneutics. These directions aim to refine the methods and practices of digital historians, ensuring that Machine Learning outputs are critically assessed and effectively integrated into historical scholarship.
key-points:
  - Integrating Machine Learning output in historical research requires meticulous evaluation. 
  - Factoids can provide a technique for the multifaceted representation of data points.
  - Digital History requires new hermeneutical tools suitable for digital data and workflows.
date: 09-12-2024
bibliography: references.bib
doi: 10.5281/zenodo.13907672
---

## Introduction

Over the last few years, Machine Learning applications became more and more popular in the humanities and social sciences in general, and therefore also in history. Handwritten Text Recognition (HTR) and various tasks of Natural Language Processing (NLP) are now commonly employed in a plethora of research projects of various sizes. Even for PhD projects it is now feasible to research large corpora like serial legal source, which would not be possible entirely by hand. This acceleration of research processes implies fundamental changes to how we think about sources, data, research and workflows.

In history, Machine Learning systems are typically used to speed up the production of research data. As the output of these applications is never entirely accurate or correct, this raises the question how historians can use machine generated data together with manually created data without propagating errors and uncertainties to downstream tasks and investigations.

## Facticity

The question of the combined usability of machine-generated and manually generated data is also a question of the reliability or facticity of data. Data generated by humans are not necessarily complete and correct either, as they are a product of human perception. For example, creating transcriptions depends on the respective transcription guidelines and individual text understanding, which can lead to errors. However, we consider transcriptions by experts as correct and use them for historical research. This issue is even more evident in the field of editions. Even very old editions with methodological challenges are valued for their core content. Errors may exist, but they are largely accepted due to the expertise of the editors, treating the output as authorised. This pragmatic approach enables efficient historical research. Historians trust their ability to detect and correct errors during research.

Francesco Beretta represents data, information, and knowledge as a pyramid: data form the base, historical information (created from data through conceptual models and critical methods) forms the middle, and historical knowledge (produced from historical information through theories, statistical models and heuristics) forms the top [@berettaDonneesOuvertesLiees2023, fig. 3]. Interestingly, however, he makes an important distinction regarding digital data: "Digital data does not belong to the epistemic layer of data, but to the layer of information, of which they are the information technical carrier" [Translation: DW. Original Text: "[L]les données numériques n’appartiennent pas à la strate épistémique des données, mais bien à celle de l’information dont elles constituent le support informatique.", @berettaDonneesOuvertesLiees2023, p. 18]

Andreas Fickers adds that digitization transforms the nature of sources, affecting the concept of the original [@fickersUpdateFuerHermeneutik2020, p. 162]. Sources are preprocessed using HTR/OCR and various NLP strategies. The resulting digital data are already processed historical information. This shift from analog to digital means that what we extract from sources is not just given but constructed [@berettaDonneesOuvertesLiees2023, p. 26]. Analog historical research, which relies on handwritten archival documents, also depends on transcriptions or editions to conduct research pragmatically; and here, too, data becomes information. The main difference is that with the generation of digital data, the (often linear) structure of sources is typically dissolved in favour of a highly fragmented and hyperconnected structure [For hyperconnectivity see @fickersWhatDoesHistory2022, pp. 51-54; For the underlying concept of hypertextual systems see @landowHypertextCriticalTheory2006, pp. 53-58; for a a more extensive discussion of digital representations of fragmented texts see @weberKlassifizierenVerknupfenAbbilden2021]. This is partly due to the way sources are processed into historical information using digital tools and methods, but it is inherently connected with issues of storing, retrieving, and presenting digital data -- in a very technical sense.

The concept of *factoids* introduced by Michele Pasin and John Bradley, is central to this argument. They define factoids as pieces of information about one or more persons in a primary source. Those factoids are then represented in a semantic network of subject-predicate-object triples [@pasinFactoidbasedProsopographyComputer2015, pp. 89-90]. This involves extracting statements from their original context, placing them in a new context, and outsourcing verification to later steps. Therefore, factoids can be contradictory. Francesco Beretta applies this idea to historical science, viewing the aggregation of factoids as a process aiming for the best possible approximation of facticity [@berettaDonneesOuvertesLiees2023, p. 20]. The challenge is to verify machine output sufficiently for historical research and to assess the usefulness of the factoid concept. Evaluating machine learning models and their outputs is crucial for this.

## Qualifying Error Rates

Evaluating the output of a machine learning system is not trivial. Models can be evaluated using various calculated scores, which is done continuously during the training process. However, these performance metrics are statistical measures that generally refer to the model and are based on a set of test data. Even the probabilities output by machine learning systems when applied to new data are purely computational figures, only partially suitable for quality assurance. This verification is further complicated by the potentially vast scale of the output. Therefore, historical science must find a pragmatic way to translate statistical evaluation metrics into qualitative statements and identify systematic sources of error.

In automatic handwriting recognition, models are typically evaluated using character error rate (CER). These metrics only tell us the percentage of characters or words incorrectly recognised compared to a ground truth. They do not reveal the distribution of these errors, which is important when comparing automatic and manual transcriptions. For detailed HTR model evaluation, CERberus is being developed [@haverals2023cerberus]. This tool compares ground truth with HTR output from the same source. Instead of calculating just the character error rate, it breaks down the differences further. Errors are categorised into missing, excess, and incorrectly recognised characters. Additionally, a separate CER is calculated for all characters and Unicode blocks in the text, aggregated into confusion statistics that identify the most frequently confused characters. Confusion plots are generated to show the most common errors for each character. These metrics do not pinpoint specific errors but provide a more precise analysis of the model's behaviour. CERberus cannot evaluate entirely new HTR output without comparison text but is a valuable tool for Digital History, revealing which character forms are often confused and guiding model improvement or post-processing strategies.

In other machine learning applications, such as named entity recognition (NER), different metrics are important, requiring detailed error source analysis. Evaluating NER is more complex than HTR because it involves categorizing longer text sections based on context. Precision (how many recognised positives are true positives) and recall (how many actual positives are recognised) are combined into the F1-score to indicate model performance. Fu et al. proposed evaluating NER with a set of eight annotation attributes influencing model performance. These attributes are divided into local properties (entity length, sentence length, unknown word density, entity density) and aggregated attributes (annotation consistency and frequency at the token and entity levels) [@fuInterpretableMultidatasetEvaluation2020, p. 3]. Buckets of source points where a model performs particularly well or poorly are created and separately evaluated [@fuInterpretableMultidatasetEvaluation2020, p. 1]. This analysis identifies conditions affecting model performance, guiding further training steps and dataset expansion.

The qualitative error analysis presented here does not solve the question of authorizing machine learning output for historical research. Instead, it provides tools to assess models more precisely and analyse training and test datasets. Such investigations extend the crucial source criticism in historical science to digital datasets and the algorithms and models involved in their creation. This requires historians to expand their traditional methods to include new, less familiar areas.

## Three Strategic Directions

In the following last part of this article, the previously raised questions and problem areas will be consolidated, from which three strategic directions for digital history will be derived. These will be suggestions for how the theory, methodology, and practice of Digital History could evolve to address and mitigate the identified problem areas. The three perspectives should not be viewed in isolation or as mutually exclusive. Instead, they are interdependent and should work together to meet the additional challenges.

### Direction 1: Formulating Clear Needs

When data is collected or processed into information in the historical research process a certain pragmatism is involved. Ideally, such a project would fully and consistently transcribe the entire collection with the same thoroughness, but in practice, a compromise is often found between completeness, correctness, and pragmatism. Often, for one's own research purposes, it is sufficient to transcribe a source only to the extent that its meaning can be understood. This compromise has not fully transitioned into Digital History. Even if a good CER is achieved, there is pressure to justify how these potential errors are managed in the subsequent research process. This skepticism is not fundamentally bad, and the epistemological consequences of erroneous machine learning output are worthy of discussion. Nonetheless, the resulting text is usually quite readable and usable.

Thus, I argue that digital history must more clearly define and communicate its needs. However, it must be remembered that Digital History also faces broader demands. Especially in machine learning-supported research, the demand for data interoperability is rightly emphasised. Incomplete or erroneous datasets are, of course, less reusable by other research projects.

### Direction 2: Creating Transparency

The second direction for digital history is to move towards greater transparency. The issue of reusability and interoperability of datasets from the first strategic direction can be at least partially mitigated by transparency.

As Hodel et al. convincingly argued, it is extremely sensible and desirable for projects using HTR to publish their training data. This allows for gradual development towards models that can generalise as broadly as possible [@hodelGeneralModelsHandwritten2021, pp. 7-8]. If a CERberus error analysis is conducted for HTR that goes beyond the mere CER, it makes sense to publish this alongside the data and the model. With this information, it is easier to assess whether it might be worthwhile to include this dataset in one's own training material. Similarly, when NER models are published, an extended evaluation according to Fu et al. helps to better assess the performance of a model for one's own dataset.

Pasin and Bradley, in their prosopographic graph database, indicate the provenance of each data point and who captured it [@pasinFactoidbasedProsopographyComputer2015, 91-92]. This principle could also be interesting for Digital History, by indicating in the metadata whether published research data was generated manually or by a machine, ideally with information about the model used and the annotating person for manually generated data. Models provide a confidence estimate with their prediction, indicating how likely the prediction is correct. The most probable prediction would be treated as the first factoid. The second or even third most probable prediction from the systems cloud provide additional factoids that can be incorporate into the source representation. These additional pieces of information can support the further research process by allowing inconsistencies and errors to be better assessed and balanced.

### Direction 3: Data Criticism and Data Hermeneutics

The shift to digital history requires an evaluation and adjustment of our hermeneutic methods. This ongoing discourse is not new, and Torsten Hiltmann has identified three broad directions: first, the debate about extending source criticism to data, algorithms, and interfaces; second, the call for computer-assisted methods to support text understanding; and third, the theorization of data hermeneutics, or the "understanding of and with data" [@hiltmann2024, p. 208].

Even though these discourse strands cannot be sharply separated, the focus here is primarily on data criticism and hermeneutics. The former can fundamentally orient itself towards classical source criticism. Since digital data is not given but constructed, it is crucial to discuss by whom, for what purpose, and how data was generated. This is no easy task, especially when datasets are poorly documented. Therefore, the call for data and model criticism is closely linked to the plea for more transparency in data and model publication.

In the move towards data hermeneutics, a thorough rethinking of the factoid principle can be fruitful. If, as suggested above, the second or even third most likely predictions of a model are included as factoids in the publication of research data, this opens up additional perspectives on the sources underlying the data. From these new standpoints, the data -- and thus the sources -- can be analyzed and understood more thoroughly. Additionally, this allows for a more informed critique of the data, and extensive transparency also mitigates the "black box" problem of interpretation described by Silke Schwandt [@schwandtOpeningBlackBox2022]. If we more precisely describe and reflect on how we generate digital data from sources as historians, we will find that our methods are algorithmic [@schwandtOpeningBlackBox2022, pp. 81-82]. This insight can also support the understanding of how machine learning applications work. Data hermeneutics thus requires both a critical reflection of our methods and a more transparent approach to data and metadata.

## References

::: {#refs}
:::
  • Edit this page
  • Report an issue