On the Historiographic Authority of Machine Learning Systems
The integration of Machine Learning in historical research has significantly altered the approach to sources, data and workflows. Historians now use Machine Learning applications such as Handwritten Text Recognition (HTR) and Natural Language Processing (NLP) to manage large corpora, enhancing research capabilities but also introducing challenges in combining machine-generated and manually created data without propagating errors. The reliability of machine-generated data is a central concern, paralleling issues found in traditional transcription and edition practices. The concept of factoids highlights the fragmentation and recontextualization of data in digital history. Evaluating Machine Learning systems, particularly through tools like CERberus for HTR, emphasises the need for qualitative error analysis to support historical research. The article proposes three strategic directions for digital history: defining clear needs to manage data pragmatically, enhancing transparency to improve data reuse and interoperability, and advancing data criticism and hermeneutics. These directions aim to refine the methods and practices of digital historians, ensuring that Machine Learning outputs are critically assessed and effectively integrated into historical scholarship.
Machine Learning, Methodology, Epistemology, Facticity, Evaluation
Introduction
Over the last few years, Machine Learning applications became more and more popular in the humanities and social sciences in general, and therefore also in history. Handwritten Text Recognition (HTR) and various tasks of Natural Language Processing (NLP) are now commonly employed in a plethora of research projects of various sizes. Even for PhD projects it is now feasible to research large corpora like serial legal source, which would not be possible entirely by hand. This acceleration of research processes implies fundamental changes to how we think about sources, data, research and workflows.
In history, Machine Learning systems are typically used to speed up the production of research data. As the output of these applications is never entirely accurate or correct, this raises the question how historians can use machine generated data together with manually created data without propagating errors and uncertainties to downstream tasks and investigations.
Facticity
The question of the combined usability of machine-generated and manually generated data is also a question of the reliability or facticity of data. Data generated by humans are not necessarily complete and correct either, as they are a product of human perception. For example, creating transcriptions depends on the respective transcription guidelines and individual text understanding, which can lead to errors. However, we consider transcriptions by experts as correct and use them for historical research. This issue is even more evident in the field of editions. Even very old editions with methodological challenges are valued for their core content. Errors may exist, but they are largely accepted due to the expertise of the editors, treating the output as authorised. This pragmatic approach enables efficient historical research. Historians trust their ability to detect and correct errors during research.
Francesco Beretta represents data, information, and knowledge as a pyramid: data form the base, historical information (created from data through conceptual models and critical methods) forms the middle, and historical knowledge (produced from historical information through theories, statistical models and heuristics) forms the top (Beretta 2023, fig. 3). Interestingly, however, he makes an important distinction regarding digital data: “Digital data does not belong to the epistemic layer of data, but to the layer of information, of which they are the information technical carrier” (Translation: DW. Original Text: “[L]les données numériques n’appartiennent pas à la strate épistémique des données, mais bien à celle de l’information dont elles constituent le support informatique.” Beretta 2023, 18)
Andreas Fickers adds that digitization transforms the nature of sources, affecting the concept of the original (Fickers 2020, 162). Sources are preprocessed using HTR/OCR and various NLP strategies. The resulting digital data are already processed historical information. This shift from analog to digital means that what we extract from sources is not just given but constructed (Beretta 2023, 26). Analog historical research, which relies on handwritten archival documents, also depends on transcriptions or editions to conduct research pragmatically; and here, too, data becomes information. The main difference is that with the generation of digital data, the (often linear) structure of sources is typically dissolved in favour of a highly fragmented and hyperconnected structure (For hyperconnectivity see Fickers 2022, 51–54; For the underlying concept of hypertextual systems see Landow 2006, 53–58; for a a more extensive discussion of digital representations of fragmented texts see Weber 2021). This is partly due to the way sources are processed into historical information using digital tools and methods, but it is inherently connected with issues of storing, retrieving, and presenting digital data – in a very technical sense.
The concept of factoids introduced by Michele Pasin and John Bradley, is central to this argument. They define factoids as pieces of information about one or more persons in a primary source. Those factoids are then represented in a semantic network of subject-predicate-object triples (Pasin and Bradley 2015, 89–90). This involves extracting statements from their original context, placing them in a new context, and outsourcing verification to later steps. Therefore, factoids can be contradictory. Francesco Beretta applies this idea to historical science, viewing the aggregation of factoids as a process aiming for the best possible approximation of facticity (Beretta 2023, 20). The challenge is to verify machine output sufficiently for historical research and to assess the usefulness of the factoid concept. Evaluating machine learning models and their outputs is crucial for this.
Qualifying Error Rates
Evaluating the output of a machine learning system is not trivial. Models can be evaluated using various calculated scores, which is done continuously during the training process. However, these performance metrics are statistical measures that generally refer to the model and are based on a set of test data. Even the probabilities output by machine learning systems when applied to new data are purely computational figures, only partially suitable for quality assurance. This verification is further complicated by the potentially vast scale of the output. Therefore, historical science must find a pragmatic way to translate statistical evaluation metrics into qualitative statements and identify systematic sources of error.
In automatic handwriting recognition, models are typically evaluated using character error rate (CER). These metrics only tell us the percentage of characters or words incorrectly recognised compared to a ground truth. They do not reveal the distribution of these errors, which is important when comparing automatic and manual transcriptions. For detailed HTR model evaluation, CERberus is being developed (Haverals 2023). This tool compares ground truth with HTR output from the same source. Instead of calculating just the character error rate, it breaks down the differences further. Errors are categorised into missing, excess, and incorrectly recognised characters. Additionally, a separate CER is calculated for all characters and Unicode blocks in the text, aggregated into confusion statistics that identify the most frequently confused characters. Confusion plots are generated to show the most common errors for each character. These metrics do not pinpoint specific errors but provide a more precise analysis of the model’s behaviour. CERberus cannot evaluate entirely new HTR output without comparison text but is a valuable tool for Digital History, revealing which character forms are often confused and guiding model improvement or post-processing strategies.
In other machine learning applications, such as named entity recognition (NER), different metrics are important, requiring detailed error source analysis. Evaluating NER is more complex than HTR because it involves categorizing longer text sections based on context. Precision (how many recognised positives are true positives) and recall (how many actual positives are recognised) are combined into the F1-score to indicate model performance. Fu et al. proposed evaluating NER with a set of eight annotation attributes influencing model performance. These attributes are divided into local properties (entity length, sentence length, unknown word density, entity density) and aggregated attributes (annotation consistency and frequency at the token and entity levels) (Fu, Liu, and Neubig 2020, 3). Buckets of source points where a model performs particularly well or poorly are created and separately evaluated (Fu, Liu, and Neubig 2020, 1). This analysis identifies conditions affecting model performance, guiding further training steps and dataset expansion.
The qualitative error analysis presented here does not solve the question of authorizing machine learning output for historical research. Instead, it provides tools to assess models more precisely and analyse training and test datasets. Such investigations extend the crucial source criticism in historical science to digital datasets and the algorithms and models involved in their creation. This requires historians to expand their traditional methods to include new, less familiar areas.
Three Strategic Directions
In the following last part of this article, the previously raised questions and problem areas will be consolidated, from which three strategic directions for digital history will be derived. These will be suggestions for how the theory, methodology, and practice of Digital History could evolve to address and mitigate the identified problem areas. The three perspectives should not be viewed in isolation or as mutually exclusive. Instead, they are interdependent and should work together to meet the additional challenges.
Direction 1: Formulating Clear Needs
When data is collected or processed into information in the historical research process a certain pragmatism is involved. Ideally, such a project would fully and consistently transcribe the entire collection with the same thoroughness, but in practice, a compromise is often found between completeness, correctness, and pragmatism. Often, for one’s own research purposes, it is sufficient to transcribe a source only to the extent that its meaning can be understood. This compromise has not fully transitioned into Digital History. Even if a good CER is achieved, there is pressure to justify how these potential errors are managed in the subsequent research process. This skepticism is not fundamentally bad, and the epistemological consequences of erroneous machine learning output are worthy of discussion. Nonetheless, the resulting text is usually quite readable and usable.
Thus, I argue that digital history must more clearly define and communicate its needs. However, it must be remembered that Digital History also faces broader demands. Especially in machine learning-supported research, the demand for data interoperability is rightly emphasised. Incomplete or erroneous datasets are, of course, less reusable by other research projects.
Direction 2: Creating Transparency
The second direction for digital history is to move towards greater transparency. The issue of reusability and interoperability of datasets from the first strategic direction can be at least partially mitigated by transparency.
As Hodel et al. convincingly argued, it is extremely sensible and desirable for projects using HTR to publish their training data. This allows for gradual development towards models that can generalise as broadly as possible (Hodel et al. 2021, 7–8). If a CERberus error analysis is conducted for HTR that goes beyond the mere CER, it makes sense to publish this alongside the data and the model. With this information, it is easier to assess whether it might be worthwhile to include this dataset in one’s own training material. Similarly, when NER models are published, an extended evaluation according to Fu et al. helps to better assess the performance of a model for one’s own dataset.
Pasin and Bradley, in their prosopographic graph database, indicate the provenance of each data point and who captured it (Pasin and Bradley 2015, 91–92). This principle could also be interesting for Digital History, by indicating in the metadata whether published research data was generated manually or by a machine, ideally with information about the model used and the annotating person for manually generated data. Models provide a confidence estimate with their prediction, indicating how likely the prediction is correct. The most probable prediction would be treated as the first factoid. The second or even third most probable prediction from the systems cloud provide additional factoids that can be incorporate into the source representation. These additional pieces of information can support the further research process by allowing inconsistencies and errors to be better assessed and balanced.
Direction 3: Data Criticism and Data Hermeneutics
The shift to digital history requires an evaluation and adjustment of our hermeneutic methods. This ongoing discourse is not new, and Torsten Hiltmann has identified three broad directions: first, the debate about extending source criticism to data, algorithms, and interfaces; second, the call for computer-assisted methods to support text understanding; and third, the theorization of data hermeneutics, or the “understanding of and with data” (Hiltmann 2024, 208).
Even though these discourse strands cannot be sharply separated, the focus here is primarily on data criticism and hermeneutics. The former can fundamentally orient itself towards classical source criticism. Since digital data is not given but constructed, it is crucial to discuss by whom, for what purpose, and how data was generated. This is no easy task, especially when datasets are poorly documented. Therefore, the call for data and model criticism is closely linked to the plea for more transparency in data and model publication.
In the move towards data hermeneutics, a thorough rethinking of the factoid principle can be fruitful. If, as suggested above, the second or even third most likely predictions of a model are included as factoids in the publication of research data, this opens up additional perspectives on the sources underlying the data. From these new standpoints, the data – and thus the sources – can be analyzed and understood more thoroughly. Additionally, this allows for a more informed critique of the data, and extensive transparency also mitigates the “black box” problem of interpretation described by Silke Schwandt (Schwandt 2022). If we more precisely describe and reflect on how we generate digital data from sources as historians, we will find that our methods are algorithmic (Schwandt 2022, 81–82). This insight can also support the understanding of how machine learning applications work. Data hermeneutics thus requires both a critical reflection of our methods and a more transparent approach to data and metadata.
References
Reuse
Citation
@misc{weber2024,
author = {Weber, Dominic},
editor = {Baudry, Jérôme and Burkart, Lucas and Joyeux-Prunel,
Béatrice and Kurmann, Eliane and Mähr, Moritz and Natale, Enrico and
Sibille, Christiane and Twente, Moritz},
title = {On the {Historiographic} {Authority} of {Machine} {Learning}
{Systems}},
date = {2024-07-23},
url = {https://digihistch24.github.io/book-of-abstracts/submissions/465/},
langid = {en},
abstract = {The integration of Machine Learning in historical research
has significantly altered the approach to sources, data and
workflows. Historians now use Machine Learning applications such as
Handwritten Text Recognition (HTR) and Natural Language Processing
(NLP) to manage large corpora, enhancing research capabilities but
also introducing challenges in combining machine-generated and
manually created data without propagating errors. The reliability of
machine-generated data is a central concern, paralleling issues
found in traditional transcription and edition practices. The
concept of factoids highlights the fragmentation and
recontextualization of data in digital history. Evaluating Machine
Learning systems, particularly through tools like CERberus for HTR,
emphasises the need for qualitative error analysis to support
historical research. The article proposes three strategic directions
for digital history: defining clear needs to manage data
pragmatically, enhancing transparency to improve data reuse and
interoperability, and advancing data criticism and hermeneutics.
These directions aim to refine the methods and practices of digital
historians, ensuring that Machine Learning outputs are critically
assessed and effectively integrated into historical scholarship.}
}