DigiHistCH24
  • Home
  • Book of Abstracts
  • Conference Program
  • Call for Contributions
  • About

From words to numbers. Methodological perspectives on large scale Named Entity Linking

  • Home
  • Book of Abstracts
    • Data-Driven Approaches to Studying the History of Museums on the Web: Challenges and Opportunities for New Discoveries
    • On a solid ground. Building software for a 120-year-old research project applying modern engineering practices
    • Tables are tricky. Testing Text Encoding Initiative (TEI) Guidelines for FAIR upcycling of digitised historical statistics.
    • Training engineering students through a digital humanities project: Techn’hom Time Machine
    • From manual work to artificial intelligence: developments in data literacy using the example of the Repertorium Academicum Germanicum (2001-2024)
    • A handful of pixels of blood
    • Impresso 2: Connecting Historical Digitised Newspapers and Radio. A Challenge at the Crossroads of History, User Interfaces and Natural Language Processing.
    • Learning to Read Digital? Constellations of Correspondence Project and Humanist Perspectives on the Aggregated 19th-century Finnish Letter Metadata
    • Teaching the use of Automated Text Recognition online. Ad fontes goes ATR
    • Geovistory, a LOD Research Infrastructure for Historical Sciences
    • Using GIS to Analyze the Development of Public Urban Green Spaces in Hamburg and Marseille (1945 - 1973)
    • Belpop, a history-computer project to study the population of a town during early industrialization
    • Contributing to a Paradigm Shift in Historical Research by Teaching Digital Methods to Master’s Students
    • Revealing the Structure of Land Ownership through the Automatic Vectorisation of Swiss Cadastral Plans
    • Rockefeller fellows as heralds of globalization: the circulation of elites, knowledge, and practices of modernization (1920–1970s): global history, database connection, and teaching experience
    • Theory and Practice of Historical Data Versioning
    • Towards Computational Historiographical Modeling
    • Efficacy of Chat GPT Correlations vs. Co-occurrence Networks in Deciphering Chinese History
    • Data Literacy and the Role of Libraries
    • 20 godparents and 3 wives – studying migrant glassworkers in post-medieval Estonia
    • From record cards to the dynamics of real estate transactions: Working with automatically extracted information from Basel’s historical land register, 1400-1700
    • When the Data Becomes Meta: Quality Control for Digitized Ancient Heritage Collections
    • On the Historiographic Authority of Machine Learning Systems
    • Films as sources and as means of communication for knowledge gained from historical research
    • Develop Yourself! Development according to the Rockefeller Foundation (1913 – 2013)
    • AI-assisted Search for Digitized Publication Archives
    • Digital Film Collection Literacy – Critical Research Interfaces for the “Encyclopaedia Cinematographica”
    • From Source-Criticism to System-Criticism, Born Digital Objects, Forensic Methods, and Digital Literacy for All
    • Connecting floras and herbaria before 1850 – challenges and lessons learned in digital history of biodiversity
    • A Digital History of Internationalization. Operationalizing Concepts and Exploring Millions of Patent Documents
    • From words to numbers. Methodological perspectives on large scale Named Entity Linking
    • Go Digital, They Said. It Will Be Fun, They Said. Teaching DH Methods for Historical Research
    • Unveiling Historical Depth: Semantic annotation of the Panorama of the Battle of Murten
    • When Literacy Goes Digital: Rethinking the Ethics and Politics of Digitisation
  • Conference Program
    • Schedule
    • Keynote
    • Practical Information
    • Event Digital History Network
    • Event SSH ORD
  • Call for Contributions
    • Key Dates
    • Evaluation Criteria
    • Submission Guidelines
  • About
    • Code of Conduct
    • Terms and Conditions

On this page

  • Introduction
  • Methods
  • Conclusion
  • References
  • Edit this page
  • Report an issue

From words to numbers. Methodological perspectives on large scale Named Entity Linking

Session 7A
Authors
Affiliations

Tarun Chadha

ETH Zürich IT Services

Gentiana Rashiti

ETH Zürich Library

Christiane Sibille

ETH Zürich Library

Agnieszka Ilnicka

ETH Zürich IT Services

Published

September 13, 2024

Modified

October 13, 2024

Doi

10.5281/zenodo.13907910

Abstract
Named Entity Linking (NEL) describes the recognition, disambiguation, and linking of so-called «Named Entities» (such as people, places, and organizations) in text. Machine-assisted linking of entities helps to identify historical actors in large source corpora and thus contributes significantly to digital approaches in historical research. However, applying NEL to historical data presents unique challenges due to issues ranging from poor OCR and alternate spellings to people in historical texts being under-represented in contemporary databases. Given that we often have only sparse specific information about an entity in its direct context, we are developing a robust, modular, and scalable workflow in which we «embed» the people by the context in which they appear. This gives us more information, enabling disambiguation even when only limited data is present and application of NEL to large text corpora. Such techniques have been used and described in works such as Nozza et al. (2019) and Vasilyev et al. (2022). With developing this pipeline and the corresponding embedding knowledge base(s) of historical entities we want to enable the use of such methods in the Swiss GLAM landscape.
Keywords

Machine Learning, Named Entity Linking, Named Entity Recognition, Historical Data, Natural Language Processing

Introduction

Named entity recognition, disambiguation, and linking are pivotal methods in Natural Language Processing (NLP) applied to historical research. These methods present unique and complex challenges in the context of historical texts (Bunout, Ehrmann, and Clavert 2023; Luthra et al. 2022; Ehrmann et al. 2023). They grapple with the complexities arising from context-dependent meanings of named entities, as well as the issues of polysemy, homonymy, and naming variations.

Historically, solutions ranged from basic string matching to intricate rule-based heuristics. While these methods are still widely used, they often fall short in terms of scalability, generalization, and accuracy, particularly when compared to current machine-learning techniques. Recent advances have seen a shift towards leveraging contextual embeddings to achieve groundbreaking accuracy in these tasks, as evidenced by seminal works such as Yamada et al. (2016); Ganea and Hofmann (2017); and Chen et al. (2020).

Vector embeddings are an essential tool used in NLP to represent words as numerical vectors. When applied appropriately, they can capture semantic information of words depending on the context in which they appear. For instance, in sentences such as «I opened an account at the bank» and «Beavers build dams in river banks,» the word «bank» would be embedded differently. On the other hand, the vector embeddings for «I sat down on the chair» and «I lowered myself onto the seat» would be «close» in the vector space, as they contain similar content.

Regarding linking named entities in a text, e.g. persons, this would mean that we embed them based on the context in which they appear. If there are two viable options (such as the same first name, last name, and time period) for a match between a name and a person, but the name we are searching for appears in an article about architecture and one of the two options is an architect and one a medical doctor, we can now take into account this semantic context as an additional parameter to calculate a possible match.

Methods

In our presentation, we will show a glimpse of the current state of our ambitious project, which aims to create a robust and scalable pipeline for applying embeddings-based NEL to historical texts. In our work, we focus on three key aspects. Firstly, on embeddings-based linking and disambiguation workflow applied to a historical corpus of Swiss magazines (E-Periodica) that uses Wikipedia, Gemeinsame Normdatei (GND), and – since our primary use cases deal with historical material from Switzerland – the Historical Dictionary of Switzerland (HDS) as reference knowledge bases. This part aims to develop a performant and modular pipeline to recognize named entities in retro-digitized texts and link them to so-called authority files (Normdaten), e.g., the German Authority File (GND). With this workflow, we will help to identify historical actors in source material and contribute to the in-depth FAIRification of large datasets through persistent identifiers on the text level. Our proposed pipeline is modular with respect to the embedding model, enabling performance comparison across different embedding model choices and leaving room for future improved embedding models, which capture semantic similarities even better than current popular open-source models such as BERT.

Secondly, we plan to use this case study to reflect upon the interpretation of metrics provided by algorithmic models and their relevance in historical research methodology. We will focus on three key areas: Contextual Sensitivity, Ambiguity Resolution, and Computational Efficiency. By focusing on these aspects, we will provide a comprehensive insight into the models’ operational capabilities, particularly in large-scale historical text analysis. Given the challenges of retro-digitized historical data (OCR quality, heterogeneous contents in large collections, etc.), it is necessary to not only select appropriate models and methods to the specific needs of such material but also to create representative ground truth data for OCR, NER, and NEL. Furthermore, scale considerations drive our case study, as some of our use cases consist of millions of pages.

Finally, we will discuss the role of GLAM (galleries, libraries, archives, and museums) institutions as drivers of change and facilitators, especially when it comes to the use of their collections as data (Padilla et al. 2023).

Pipeline of end-to-end Named Entity Linking.

Conclusion

Current solutions for NEL need more accuracy and scalability. At the same time, such enrichment processes will become standard processes for GLAM institutions so that they can offer enriched data layers to their users as a service. This raises several challenges: The technical challenge to improve the linking workflow itself, the challenge to document the workflow in a transparent and reproducible form, and finally, the methodological challenge to negotiate and interpret the results at the intersection of GLAM institutions, data science, and historical research.

References

Bunout, Estelle, Maud Ehrmann, and Frédéric Clavert, eds. 2023. Reflections on Tools, Methods and Epistemology. Berlin, Boston: De Gruyter Oldenbourg. https://doi.org/doi:10.1515/9783110729214.
Chen, Haotian, Andrej Zukov-Gregoric, Xi David Li, and Sahil Wadhwa. 2020. “Contextualized End-to-End Neural Entity Linking.” https://arxiv.org/abs/1911.03834.
Ehrmann, Maud, Ahmed Hamdi, Elvys Linhares Pontes, Matteo Romanello, and Antoine Doucet. 2023. “Named Entity Recognition and Classification in Historical Documents: A Survey.” ACM Comput. Surv. 56 (2). https://doi.org/10.1145/3604931.
Ganea, Octavian-Eugen, and Thomas Hofmann. 2017. “Deep Joint Entity Disambiguation with Local Neural Attention.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, edited by Martha Palmer, Rebecca Hwa, and Sebastian Riedel, 2619–29. Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1277.
Luthra, Mrinalini, Konstantin Todorov, Charles Jeurgens, and Giovanni Colavizza. 2022. “Unsilencing Colonial Archives via Automated Entity Recognition.” https://arxiv.org/abs/2210.02194.
Nozza, Debora, Cezar Sas, Elisabetta Fersini, and Enza Messina. 2019. “Word Embeddings for Unsupervised Named Entity Linking.” In Knowledge Science, Engineering and Management, edited by Christos Douligeris, Dimitris Karagiannis, and Dimitris Apostolou, 115–32. Cham: Springer International Publishing.
Padilla, Thomas, Hannah Scates Kettler, Stewart Varner, and Yasmeen Shorish. 2023. “Vancouver Statement on Collections as Data.” Zenodo. https://doi.org/10.5281/zenodo.8342171.
Vasilyev, Oleg, Alex Dauenhauer, Vedant Dharnidharka, and John Bohannon. 2022. “Named Entity Linking with Entity Representation by Multiple Embeddings.” https://arxiv.org/abs/2205.10498.
Yamada, Ikuya, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2016. “Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation.” In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, edited by Stefan Riezler and Yoav Goldberg, 250–59. Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/K16-1025.
Back to top

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:
@misc{chadha2024,
  author = {Chadha, Tarun and Rashiti, Gentiana and Sibille, Christiane
    and Ilnicka, Agnieszka},
  editor = {Baudry, Jérôme and Burkart, Lucas and Joyeux-Prunel,
    Béatrice and Kurmann, Eliane and Mähr, Moritz and Natale, Enrico and
    Sibille, Christiane and Twente, Moritz},
  title = {From Words to Numbers. {Methodological} Perspectives on Large
    Scale {Named} {Entity} {Linking}},
  date = {2024-09-13},
  url = {https://digihistch24.github.io/submissions/486/},
  doi = {10.5281/zenodo.13907910},
  langid = {en},
  abstract = {Named Entity Linking (NEL) describes the recognition,
    disambiguation, and linking of so-called «Named Entities» (such as
    people, places, and organizations) in text. Machine-assisted linking
    of entities helps to identify historical actors in large source
    corpora and thus contributes significantly to digital approaches in
    historical research. However, applying NEL to historical data
    presents unique challenges due to issues ranging from poor OCR and
    alternate spellings to people in historical texts being
    under-represented in contemporary databases. Given that we often
    have only sparse specific information about an entity in its direct
    context, we are developing a robust, modular, and scalable workflow
    in which we «embed» the people by the context in which they appear.
    This gives us more information, enabling disambiguation even when
    only limited data is present and application of NEL to large text
    corpora. Such techniques have been used and described in works such
    as @10.1007/978-3-030-29563-9\_13 and
    @vasilyev2022namedentitylinkingentity. With developing this pipeline
    and the corresponding embedding knowledge base(s) of historical
    entities we want to enable the use of such methods in the Swiss GLAM
    landscape.}
}
For attribution, please cite this work as:
Chadha, Tarun, Gentiana Rashiti, Christiane Sibille, and Agnieszka Ilnicka. 2024. “From Words to Numbers. Methodological Perspectives on Large Scale Named Entity Linking.” Edited by Jérôme Baudry, Lucas Burkart, Béatrice Joyeux-Prunel, Eliane Kurmann, Moritz Mähr, Enrico Natale, Christiane Sibille, and Moritz Twente. Digital History Switzerland 2024: Book of Abstracts. https://doi.org/10.5281/zenodo.13907910.
A Digital History of Internationalization. Operationalizing Concepts and Exploring Millions of Patent Documents
Go Digital, They Said. It Will Be Fun, They Said. Teaching DH Methods for Historical Research
Source Code
---
submission_id: 486
categories: 'Session 7A'
title: From words to numbers. Methodological perspectives on large scale Named Entity Linking
author:
  - name: Tarun Chadha
    email: tarun@ethz.ch
    affiliations:
      - ETH Zürich IT Services
  - name: Gentiana Rashiti
    email: rashitig@ethz.ch
    orcid: 0009-0005-6799-4358
    affiliations:
      - ETH Zürich Library
  - name: Christiane Sibille
    email: christiane.sibille@library.ethz.ch
    orcid: 0000-0003-3689-2154
    affiliations:
      - ETH Zürich Library
  - name: Agnieszka Ilnicka
    email: agnieszka.ilnicka@ethz.ch
    orcid: 0000-0002-4710-3440
    affiliations:
      - ETH Zürich IT Services
keywords:
  - Machine Learning
  - Named Entity Linking
  - Named Entity Recognition
  - Historical Data
  - Natural Language Processing
abstract: Named Entity Linking (NEL) describes the recognition, disambiguation, and linking of so-called «Named Entities» (such as people, places, and organizations) in text. Machine-assisted linking of entities helps to identify historical actors in large source corpora and thus contributes significantly to digital approaches in historical research. However, applying NEL to historical data presents unique challenges due to issues ranging from poor OCR and alternate spellings to people in historical texts being under-represented in contemporary databases. Given that we often have only sparse specific information about an entity in its direct context, we are developing a robust, modular, and scalable workflow in which we «embed» the people by the context in which they appear. This gives us more information, enabling disambiguation even when only limited data is present and application of NEL to large text corpora. Such techniques have been used and described in works such as @10.1007/978-3-030-29563-9_13 and @vasilyev2022namedentitylinkingentity. With developing this pipeline and the corresponding embedding knowledge base(s) of historical entities we want to enable the use of such methods in the Swiss GLAM landscape.
date: 09-13-2024
date-modified: 10-13-2024
bibliography: references.bib
doi: 10.5281/zenodo.13907910
---

## Introduction

Named entity recognition, disambiguation, and linking are pivotal methods in Natural Language Processing (NLP) applied to historical research. These methods present unique and complex challenges in the context of historical texts [@bunout2023; @luthra2022unsilencingcolonialarchivesautomated; @10.1145/3604931]. They grapple with the complexities arising from context-dependent meanings of named entities, as well as the issues of polysemy, homonymy, and naming variations.

Historically, solutions ranged from basic string matching to intricate rule-based heuristics. While these methods are still widely used, they often fall short in terms of scalability, generalization, and accuracy, particularly when compared to current machine-learning techniques. Recent advances have seen a shift towards leveraging contextual embeddings to achieve groundbreaking accuracy in these tasks, as evidenced by seminal works such as @yamada-etal-2016-joint; @ganea-hofmann-2017-deep; and @chen2020contextualizedendtoendneuralentity.

Vector embeddings are an essential tool used in NLP to represent words as numerical vectors. When applied appropriately, they can capture semantic information of words depending on the context in which they appear. For instance, in sentences such as «I opened an account at the bank» and «Beavers build dams in river banks,» the word «bank» would be embedded differently. On the other hand, the vector embeddings for «I sat down on the chair» and «I lowered myself onto the seat» would be «close» in the vector space, as they contain similar content.

Regarding linking named entities in a text, e.g. persons, this would mean that we embed them based on the context in which they appear. If there are two viable options (such as the same first name, last name, and time period) for a match between a name and a person, but the name we are searching for appears in an article about architecture and one of the two options is an architect and one a medical doctor, we can now take into account this semantic context as an additional parameter to calculate a possible match.

## Methods

In our presentation, we will show a glimpse of the current state of our ambitious project, which aims to create a robust and scalable pipeline for applying embeddings-based NEL to historical texts. In our work, we focus on three key aspects. Firstly, on embeddings-based linking and disambiguation workflow applied to a historical corpus of Swiss magazines (E-Periodica) that uses Wikipedia, Gemeinsame Normdatei (GND), and – since our primary use cases deal with historical material from Switzerland – the Historical Dictionary of Switzerland (HDS) as reference knowledge bases. This part aims to develop a performant and modular pipeline to recognize named entities in retro-digitized texts and link them to so-called authority files (Normdaten), e.g., the German Authority File (GND). With this workflow, we will help to identify historical actors in source material and contribute to the in-depth FAIRification of large datasets through persistent identifiers on the text level. Our proposed pipeline is modular with respect to the embedding model, enabling performance comparison across different embedding model choices and leaving room for future improved embedding models, which capture semantic similarities even better than current popular open-source models such as BERT.

Secondly, we plan to use this case study to reflect upon the interpretation of metrics provided by algorithmic models and their relevance in historical research methodology. We will focus on three key areas: Contextual Sensitivity, Ambiguity Resolution, and Computational Efficiency. By focusing on these aspects, we will provide a comprehensive insight into the models' operational capabilities, particularly in large-scale historical text analysis. Given the challenges of retro-digitized historical data (OCR quality, heterogeneous contents in large collections, etc.), it is necessary to not only select appropriate models and methods to the specific needs of such material but also to create representative ground truth data for OCR, NER, and NEL. Furthermore, scale considerations drive our case study, as some of our use cases consist of millions of pages.

Finally, we will discuss the role of GLAM (galleries, libraries, archives, and museums) institutions as drivers of change and facilitators, especially when it comes to the use of their collections as data [@padilla_2023_8342171].

![Pipeline of end-to-end Named Entity Linking.](images/graph.png)

## Conclusion

Current solutions for NEL need more accuracy and scalability. At the same time, such enrichment processes will become standard processes for GLAM institutions so that they can offer enriched data layers to their users as a service. This raises several challenges: The technical challenge to improve the linking workflow itself, the challenge to document the workflow in a transparent and reproducible form, and finally, the methodological challenge to negotiate and interpret the results at the intersection of GLAM institutions, data science, and historical research.

## References

::: {#refs}
:::
  • Edit this page
  • Report an issue