DigiHistCH24
  • Home
  • Book of Abstracts
  • Conference Program
  • Call for Contributions
  • About

Impresso 2: Connecting Historical Digitised Newspapers and Radio. A Challenge at the Crossroads of History, User Interfaces and Natural Language Processing.

  • Home
  • Book of Abstracts
    • Data-Driven Approaches to Studying the History of Museums on the Web: Challenges and Opportunities for New Discoveries
    • On a solid ground. Building software for a 120-year-old research project applying modern engineering practices
    • Tables are tricky. Testing Text Encoding Initiative (TEI) Guidelines for FAIR upcycling of digitised historical statistics.
    • Training engineering students through a digital humanities project: Techn’hom Time Machine
    • From manual work to artificial intelligence: developments in data literacy using the example of the Repertorium Academicum Germanicum (2001-2024)
    • A handful of pixels of blood
    • Impresso 2: Connecting Historical Digitised Newspapers and Radio. A Challenge at the Crossroads of History, User Interfaces and Natural Language Processing.
    • Learning to Read Digital? Constellations of Correspondence Project and Humanist Perspectives on the Aggregated 19th-century Finnish Letter Metadata
    • Teaching the use of Automated Text Recognition online. Ad fontes goes ATR
    • Geovistory, a LOD Research Infrastructure for Historical Sciences
    • Using GIS to Analyze the Development of Public Urban Green Spaces in Hamburg and Marseille (1945 - 1973)
    • Belpop, a history-computer project to study the population of a town during early industrialization
    • Contributing to a Paradigm Shift in Historical Research by Teaching Digital Methods to Master’s Students
    • Revealing the Structure of Land Ownership through the Automatic Vectorisation of Swiss Cadastral Plans
    • Rockefeller fellows as heralds of globalization: the circulation of elites, knowledge, and practices of modernization (1920–1970s): global history, database connection, and teaching experience
    • Theory and Practice of Historical Data Versioning
    • Towards Computational Historiographical Modeling
    • Efficacy of Chat GPT Correlations vs. Co-occurrence Networks in Deciphering Chinese History
    • Data Literacy and the Role of Libraries
    • 20 godparents and 3 wives – studying migrant glassworkers in post-medieval Estonia
    • From record cards to the dynamics of real estate transactions: Working with automatically extracted information from Basel’s historical land register, 1400-1700
    • When the Data Becomes Meta: Quality Control for Digitized Ancient Heritage Collections
    • On the Historiographic Authority of Machine Learning Systems
    • Films as sources and as means of communication for knowledge gained from historical research
    • Develop Yourself! Development according to the Rockefeller Foundation (1913 – 2013)
    • AI-assisted Search for Digitized Publication Archives
    • Digital Film Collection Literacy – Critical Research Interfaces for the “Encyclopaedia Cinematographica”
    • From Source-Criticism to System-Criticism, Born Digital Objects, Forensic Methods, and Digital Literacy for All
    • Connecting floras and herbaria before 1850 – challenges and lessons learned in digital history of biodiversity
    • A Digital History of Internationalization. Operationalizing Concepts and Exploring Millions of Patent Documents
    • From words to numbers. Methodological perspectives on large scale Named Entity Linking
    • Go Digital, They Said. It Will Be Fun, They Said. Teaching DH Methods for Historical Research
    • Unveiling Historical Depth: Semantic annotation of the Panorama of the Battle of Murten
    • When Literacy Goes Digital: Rethinking the Ethics and Politics of Digitisation
  • Conference Program
    • Schedule
    • Keynote
    • Practical Information
    • Event Digital History Network
    • Event SSH ORD
  • Call for Contributions
    • Key Dates
    • Evaluation Criteria
    • Submission Guidelines
  • About
    • Code of Conduct
    • Terms and Conditions

On this page

  • Introduction
  • Media Monitoring of the Past: The Impresso Projects
  • Challenges in Enriching and Integrating Historical Digitised Newspaper and Radio Archives
    • The Impresso Corpus: A Silo-Breaking, Transmedia and Transnational Corpus
    • Enriching and Connecting Historical Media Collections
    • Unlocking the Dynamics of Historical Media Ecosystems
    • Interfaces
  • References
  • Edit this page
  • Report an issue

Impresso 2: Connecting Historical Digitised Newspapers and Radio. A Challenge at the Crossroads of History, User Interfaces and Natural Language Processing.

Session 3A
Authors
Affiliations

Maud Ehrmann

Ecole Polytechnique Fédérale de Lausanne

Raphaëlle Ruppen Coutaz

University of Lausanne

Simon Clematide

University of Zurich

Marten Düring

University of Luxembourg

Published

September 12, 2024

Doi

10.5281/zenodo.13907298

Abstract

The Impresso project pioneers the exploration of historical media content across temporal, linguistic, and geographical boundaries. In its initial phase (2017-2020), the project developed a scalable infrastructure for Swiss and Luxembourgish newspapers, featuring a powerful search interface. The second phase, beginning in 2023, expands the focus to connect media archives across languages and modalities, creating a Western European corpus of newspaper and broadcast collections for transnational research on historical media. In this presentation, we introduce Impresso 2 and discuss some of the specific challenges to connecting newspaper and radio.

Keywords

digitised historical newspaper and radio, natural language processing, historical document processing, media history, digital history, historical scholarship

Introduction

The Impresso project pioneers the exploration of historical media content across temporal, linguistic, and geographical boundaries. In its initial phase (2017-2020), the project developed a scalable infrastructure for Swiss and Luxembourgish newspapers, featuring a powerful search interface. The second phase, beginning in 2023, expands the focus to connect media archives across languages and modalities, creating a Western European corpus of newspaper and broadcast collections for transnational research on historical media.

We introduce Impresso 2 and review the evolution from the first to the second project. We also discuss the specific challenges to connecting newspaper and radio, our efforts to adapt text mining and exploration tools to the affordances of historical material derived from different modalities, and our approach to conducting comparative and data-driven historical research using semantically enriched sources, accessible through both graphical and API-based interfaces.

Media Monitoring of the Past: The Impresso Projects

At the intersection of natural language processing, history and design, the Impresso projects focus on the processing, enrichment and exploration of large-scale historical media sources, with the aim of transforming the accessibility and usability of media archives for historical research.

The objectives of the first project (2017-2020) were to improve text mining tools for historical texts, to enrich historical newspapers with new information layers in the form of semantic annotations, and to integrate such data into historical research workflows by means of a newly developed user interface. Impresso 1 compiled and semantically enriched a corpus of digitised Swiss and Luxembourg newspapers, and designed a scalable infrastructure and user interface, that together form the Impresso app (Düring, Bunout, and Guido 2024; Ehrmann 2021; Ehrmann et al. 2020; Romanello et al. 2020). Beyond simple browsing and often unsatisfactory keyword searches, this interface exploits new opportunities offered by semantic enrichments such as word embeddings, named entities, topic modelling and text reuse for content search and discovery, as well as comparative and critical perspectives on the available data (Düring et al. 2021, 2023; Düring, Bunout, and Guido 2024). Its creation has been guided by the principles of co-design (where all team members actively contribute through continuous push-pull interaction), generosity (offering multiple complementary access points to the collection) and transparency (providing information on e.g. tools, document processing, and data quality). Impresso 1 aimed to harness machine reading to support historical scholarship, and its interface has helped popularise the use of text mining-based enrichment for the retrieval and exploration of newspaper articles - now a de facto standard.

Two key observations form the basis of the second Impresso project (2023-2027). First, although newspaper digitisation and online accessibility has outpaced that of broadcast archives over the past two decades, the digitisation of radio collections has recently caught up, making it easier for scholars to select them as sources and opening up new processing opportunities. Second, despite considerable progress in facilitating exploration of digitised sources, particularly for newspapers, existing digital portals still fall short of meeting the needs of historical research. This deficit arises for two main reasons. First, current frameworks for exploring digitised media archives remain fragmented, confined to digital archive silos with country-based institutional portals, and digital media silos, where enrichments and exploration capabilities are typically limited to one language, one modality and one media type. Even with enhanced search capabilities, the search horizon remains narrow. Second, portals offer only passive exploration of static collections, whereas historical research proceeds through iterative processes of association and comparison of multiple objects of study. Furthermore, as the digital transformation increasingly permeates all phases of historical research, historians are calling for new methods to critically analyse data, tools and interfaces.

Building on the achievements of Impresso 1, Impresso 2 envisions a comprehensive connection between media archives, aiming to enable the first-ever joint exploration of historical newspaper and radio content across temporal, linguistic, and national boundaries1. The aim is not merely to juxtapose collections and provide full-text search across them, but to enrich and connect these sources through multiple layers of semantic enrichment represented in a shared multilingual vector space, and to design appropriate, meaningful and transparent exploration capabilities for historical research from a transmedia and transnational perspective. These efforts are guided by original research on historical media ecosystems making use of this newly available data and of data-driven research methods.

To achieve this goal, Impresso 2 collaborates with 21 European partners, including national libraries, archives, newspapers, cultural heritage institutions, and international research networks.

Challenges in Enriching and Integrating Historical Digitised Newspaper and Radio Archives

As the mass media of their time, historical newspapers and radio broadcasts offer a daily account of the past and are key to the study of historical media ecosystems. Since the 19th century in print and the 1920s on air, these media have disseminated news, opinion, and entertainment, reported on events, and offered insights into the daily lives of past societies.

Figure 1

They have both shaped and been shaped by their social, cultural, and political environments. Until now, these sources have mostly been studied in isolation, resulting in a plethora of parallel national histories (Fickers 2018). Although there has been a ‘transnational turn’ toward broader geographical and temporal perspectives (Badenoch, Fickers, and Henrich-Franke 2013; Fickers and Johnson 2012; Cronqvist and Hilgert 2017), in media history, transnational and transmedia perspectives are rare, particularly when focusing on the distribution of content rather than institutional histories.

How can we accurately connect large collections of newspapers and radio, provide effective means for their exploration in historical research, and put content-based transmedia history into practice? Impresso 2 undertakes a multi-dimensional approach to integrating historical newspaper and broadcasts, focusing on four interconnected areas (see also Figure 1):

  1. Laying the foundation stone by compiling and making available an unprecedented corpus of Western European digitised print and broadcast media.
  2. Enriching and connecting historical sources by transforming noisy and heterogeneous sources into semantically enriched and structured data, ultimately connected in a common vector space.
  3. Conducting original historical research, which advances understanding of historical media ecosystems through various case studies, while also defining methods for data-driven analysis and identifying diverse user needs for data and interface design.
  4. Designing and implementing new interfaces by creating different entry points to the data and its enrichments.

In pursuing these objectives, Impresso 2 faces a set of unique challenges arising from the central issue of aligning and connecting newspaper and radio archives.

Figure 2

In many ways familiar sources, digitised radio and newspapers have however never been aligned and connected before. How to proceed? Their integration requires a careful examination of the nature of these media objects, their preservation by cultural heritage institutions, specific characteristics as archival records and legal boundaries which determine access to them. While many questions remain unanswered at this stage, we provide a brief overview of each project focus area, primarily focusing on the key issues raised by integrating radio and newspapers: how to document and contextualise the corpus, how to align radio and newspaper historical documents, how to integrate their content cross-lingually, how to analyse and compare, and how to explore and interact with the data2.

The Impresso Corpus: A Silo-Breaking, Transmedia and Transnational Corpus

The foundation of Impresso 2 is the creation of a large corpus of print and broadcast media collections across several countries. Building on the newspapers collected in the first Impresso project, the corpus expands to a geographically and historically coherent set of neighbouring countries, encompassing both newspaper and radio archives from each. To begin, let us review the core characteristics of the two media sources. Newspapers were disseminated and are preserved in print, while radio broadcasts, originally transmitted as audio signals, are preserved through audio recordings or typescripts (the text read by the speaker). Newspapers were typically published daily, with one issue per day, whereas radio broadcasts followed a highly variable rhythm, documented in radio schedules available in dedicated magazines, unpublished internal listings, and newspapers (see Figure 2). From a digitised archive perspective, newspaper materials consist of facsimiles and their transcriptions obtained via optical character recognition (OCR), whereas radio materials include facsimiles and OCR for typescripts and radio schedules, and audio recordings and transcripts generated through automatic speech recognition ASR for broadcasts. These materials encompass different modalities, such as text and image for newspapers, and text and sound for radio.

Data Acquisition and Sharing Framework

Acquiring such a large and diverse corpus is a lengthy and complex endeavour with several obstacles: collections include various data elements (metadata, facsimiles, audio records, transcripts) with differing copyright statuses and are held by institutions across multiple countries, each with its own jurisdiction and legal constraints based on data owners’ preferences and institutional policies. We aim to make these sources accessible to researchers for operations such viewing, searching, and exporting. To address these issues, we have developed a dual approach: operational, by implementing a rigorous organisation to conduct dialogue and facilitate data exchange with our partners; and legal, by designing a modular data sharing and access framework that respects copyright and institutional constraints while maximising research opportunities through differentiated access. We believe that this operational and legal basis will help to break down institutional barriers.

Rich Contextual Information for Historical Research

The practical acquisition of the corpus also provides an opportunity to deepen our understanding of the sources, which is essential for their use. Although radio and newspaper archival records come with standard metadata, this information is often heterogeneous and varies significantly in content, quantity and quality across collections and institutions. Additionally, there are other sources of contextual knowledge, including unspoken or unwritten information. Two ongoing initiatives aim to further document the production and preservation of these archives to provide rich contextual information for historical research, all the more important in this multi-source context.

The first seeks to leverage the information contained in newspaper directories, following the approach outlined in (Beelen et al. 2022). As a starting point for the Swiss context, we are focusing on extracting semi-structured information from the Swiss Press Bibliography published by Fritz Blaser in 1956 (Blaser 1956). This bibliography documents in great detail about 480 newspapers and around 1,000 periodicals published in Switzerland between 1803 and 1958. It offers rich insights into the origins and history of Swiss newspapers, which can be used to enhance the documentation of newspapers in the Impresso corpus and interface, as well as support the study of the newspaper ecosystem in Switzerland. A similar approach will be used to trace the development of radio programmes through radio schedules and magazines.

The second initiative focuses on radio, adopting an oral history perspective to gain a better understanding not only of radio archives, but also of each archive custodian. What important aspects about these archives might we be unaware of? The ’Oral History of Radio Archivists” (OHRA) addresses this by conducting semi-structured interviews of Impresso audio partners.These interviews explore topics such as archival preservation and documentation policies, digitisation priorities, and the evolution of data quality over time, with the aim of providing complementary narrative descriptions of radio archives.

Enriching and Connecting Historical Media Collections

A critical step towards Impresso’s vision is the application of text and image processing techniques to the corpus. We aim to enrich and connect media sources through multiple layers of semantic enrichment, ultimately represented and connected in a common vector space. This endeavour involves three main steps; we discuss here only the first and the last.

Source consolidation. Sources come in various digital formats and exhibit varying levels of refinement, both in terms of OCR and ASR quality and in the granularity of document layout analysis and article segmentation. We harmonise historical document data by establishing a consistent data representation and ensuring uniform characterisation of the material across all collections.

First, we aim to define an appropriate and consistent data representation framework for both radio and newspaper digital materials. By ‘representation’, we refer to how data or information is effectively encoded and structured in a machine-readable format. The world of digitised newspapers is well understood, with a clear and consistent structure: a title consists of issues, which include pages containing articles and images. Each issue’s content is organised into sections, a framework that intersects with journalistic genres, exhibits distinct characteristics in both layout and content, and evolves over time. In contrast, digitised radio broadcasts present a more complex and heterogeneous landscape, lacking a shared definition of what constitutes a ‘standard’ broadcasting unit. There are varying levels of content organisation and an uneven distribution of material over time. Drawing from concrete evidence from files on disk, we carefully inventory all source elements for both media types, refining the terminology for newspapers and radio components – an often challenging process. With this groundwork, we then explore how to align collections’ structure and compositional units of both media sources. This alignment design, developed in parallel with data acquisition and in collaboration with archive partners, also informs and influences the rendering of sources in the interface.

Second, we elevate the corpus to a consistent and higher-quality level, through the assessment and optimisation of OCR and ASR quality, along with the homogenisation of content item segmentation and classes. The objective in this regard is to create “bridges” between radio broadcast and newspaper content types, enabling the retrieval of specific categories–such as editorials, sports-related (sub)sections, or radio schedules–across a set of titles, languages and a specific period in all digitised material.

Cross-lingual connection of sources. After enriching historical sources with semantic information to facilitate content-related search facets and support the exploration and comparison of people, locations, keyphrases, topics, and semantic classes across time periods, languages, and media channels (the second step not discussed here), a crucial task is establishing meaningful connections at the content level across both media, both mono- and cross-lingually. This process shifts focus away from structure, concentrating instead on content and its enrichments, with the goal of computing effective vector representations. This is relatively easy to achieve on a monolingual basis, but becomes more challenging cross-lingually, where the goal is to compute meaningful similarities of multilingual representations across languages.

Unlocking the Dynamics of Historical Media Ecosystems

The focus of historical research activities within Impresso 2 is on examining influence—encompassing actors, themes, and formats—within a transnational media ecosystem. In our context, influence refers to the ability of individuals, groups, organisations or even texts (e.g. books, articles) to shape, direct, or alter narratives, imagery, content, opinions, behaviours, or outcomes related to a particular subject, issue, or topic represented by and through the media. Several case studies, which are both conceptually and methodologically complementary, investigate influences from various perspectives: external to the media, within the media ecosystem, within individual institutions, and across different content formats. This transnational and transmedia research aims to revisit and compare media autonomy, examining interactions between different media forms, such as radio and newspapers, beyond their competitive aspects. A central challenge is extend the historian’s traditional skill of making meaningful comparisons from incomplete and diverse sources to a large-scale, multilingual and transmedia corpus interconnected and enriched with semantic annotations. To this end, we are developing a comparative framework that leverages our data, tools, and interfaces.

Interfaces

To facilitate research on our data, we are developing two complementary research-oriented user interfaces: the Impresso web app, a powerful graphical user interface, and the Impresso data lab, built around a user-oriented API. While the search interface offers a more traditional entry point to explore sources, enable close reading and the compilation of research datasets, the data lab is designed for automating access and enabling computational analysis. Our efforts focus on providing an easy programmatic access to the corpus and its enrichments through a public API and the Impresso Python library, allowing users to annotate external documents with Impresso tools and import external annotations into the web app (annotation and import services), and ensuring transparency with comprehensive documentation, including datasets, models notebook galleries, and tutorials. To further inform the design of the Impresso Datalab, we are surveying current approaches and realisation around data labs for computational humanities research (Beelen, During, and Guido 2024).

References

Badenoch, Alexander, Andreas Fickers, and Christian Henrich-Franke, eds. 2013. Airy Curtains in the European Ether. Broadcasting and the Cold War. Baden-Baden: Nomos.
Beelen, Kaspar, Marten During, and Daniele Guido. 2024. “Surveying Cultural Heritage Data Labs.” Reykjavík, Iceland. https://www.conftool.org/dhnb2024/index.php?page=browseSessions&presentations=show&search=during.
Beelen, Kaspar, Jon Lawrence, Daniel C S Wilson, and David Beavan. 2022. “Bias and Representativeness in Digitized Newspaper Collections: Introducing the Environmental Scan.” Digital Scholarship in the Humanities, July, fqac037. https://doi.org/10.1093/llc/fqac037.
Blaser, Fritz. 1956. “Bibliographie der Schweizer Presse : mit Einschluss des Fürstentums Liechtenstein = Bibliographie de la presse suisse = Bibliografia della stampa svizzera / bearb. von Fritz Blaser.” In Bibliographie der Schweizer Presse mit Einschluss des Fürstentums Liechtenstein = Bibliographie de la presse suisse = Bibliografia della stampa svizzera. Bern: [Schweizerische Nationalbibliothek].
Cronqvist, Marie, and Christoph Hilgert. 2017. “Entangled Media Histories.” Media History 23 (1): 130–41. https://doi.org/10.1080/13688804.2016.1270745.
Düring, Marten, Estelle Bunout, and Daniele Guido. 2024. “Transparent Generosity. Introducing the Impresso Interface for the Exploration of Semantically Enriched Historical Newspapers.” Historical Methods: A Journal of Quantitative and Interdisciplinary History, June, 35–55. https://www.tandfonline.com/doi/full/10.1080/01615440.2024.2344004.
Düring, Marten, Roman Kalyakin, Estelle Bunout, and Daniele Guido. 2021. “Impresso Inspect and Compare. Visual Comparison of Semantically Enriched Historical Newspaper Articles.” Information 12 (9): 348. https://www.mdpi.com/2078-2489/12/9/348.
Düring, Marten, Matteo Romanello, Maud Ehrmann, Kaspar Beelen, Daniele Guido, Brecht Deseure, Estelle Bunout, Jana Keck, and Petros Apostolopoulos. 2023. “Impresso Text Reuse at Scale. An Interface for the Exploration of Text Reuse Data in Semantically Enriched Historical Newspapers.” Frontiers in Big Data 6. https://www.frontiersin.org/articles/10.3389/fdata.2023.1249469.
Ehrmann, Maud. 2021. “Explorer La Presse Numérisée: Le Projet Impresso.” Revue Historique Vaudoise, 159–73.
Ehrmann, Maud, Matteo Romanello, Simon Clematide, Phillip Benjamin Ströbel, and Raphaël Barman. 2020. “Language Resources for Historical Newspapers: The Impresso Collection.” In Proceedings of the Twelfth Language Resources and Evaluation Conference, edited by Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, et al., 958–68. Marseille, France: European Language Resources Association. https://aclanthology.org/2020.lrec-1.121.
Fickers, Andreas. 2018. “<<Hybrid Histories>>: Versuch Einer Kritischen Standortbestimmung Der Mediengeschichte.” Annali Dell’Istituto Storico Italo-Germanico in Trento/Jahrbuch Des Italienisch-Deutschen Historischen Instituts in Trient 1 (44): 117–31.
Fickers, Andreas, and Catherine Johnson. 2012. Transnational Television History: A Comparative Approach. Transnational Television History a Comparative Approach. London: Routledge. https://doi.org/10.4324/9780203723579.
Romanello, Matteo, Maud Ehrmann, Simon Clematide, and Daniele Guido. 2020. “The Impresso System Architecture in a Nutshell.” EuropeanaTech Insights. EuropeanaTech Insights. https://pro.europeana.eu/page/issue-16-newspapers#the-impresso-system-architecture-in-a-nutshell.
Back to top

Footnotes

  1. Impresso is the result of the collaboration of an interdisciplinary team composed of computational linguists, designers and historians. In addition to the presenters, the team includes: Marten Düring (co-PI, UniLu), Simon Clematide (co-PI, UZH), Ferdaous Affan (UniLu), Emanuela Boros (EPFL), Kaspar Beelen (London University), Estelle Bunout (UniLu), Pauline Conti (EPFL), Daniele Guido (Luxembourg University), Andrianos Michail (UZH), Arthur Michelet (UNIL), Juri Opitz (UZH). More on the team here.↩︎

  2. It should be noted that the project has only recently started and all areas of work are in progress.↩︎

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:
@misc{ehrmann2024,
  author = {Ehrmann, Maud and Ruppen Coutaz, Raphaëlle and Clematide,
    Simon and Düring, Marten},
  editor = {Baudry, Jérôme and Burkart, Lucas and Joyeux-Prunel,
    Béatrice and Kurmann, Eliane and Mähr, Moritz and Natale, Enrico and
    Sibille, Christiane and Twente, Moritz},
  title = {Impresso 2: {Connecting} {Historical} {Digitised}
    {Newspapers} and {Radio.} {A} {Challenge} at the {Crossroads} of
    {History,} {User} {Interfaces} and {Natural} {Language}
    {Processing.}},
  date = {2024-09-12},
  url = {https://digihistch24.github.io/submissions/443/},
  doi = {10.5281/zenodo.13907298},
  langid = {en},
  abstract = {The Impresso project pioneers the exploration of
    historical media content across temporal, linguistic, and
    geographical boundaries. In its initial phase (2017-2020), the
    project developed a scalable infrastructure for Swiss and
    Luxembourgish newspapers, featuring a powerful search interface. The
    second phase, beginning in 2023, expands the focus to connect media
    archives across languages and modalities, creating a Western
    European corpus of newspaper and broadcast collections for
    transnational research on historical media. In this presentation, we
    introduce Impresso 2 and discuss some of the specific challenges to
    connecting newspaper and radio.}
}
For attribution, please cite this work as:
Ehrmann, Maud, Raphaëlle Ruppen Coutaz, Simon Clematide, and Marten Düring. 2024. “Impresso 2: Connecting Historical Digitised Newspapers and Radio. A Challenge at the Crossroads of History, User Interfaces and Natural Language Processing.” Edited by Jérôme Baudry, Lucas Burkart, Béatrice Joyeux-Prunel, Eliane Kurmann, Moritz Mähr, Enrico Natale, Christiane Sibille, and Moritz Twente. Digital History Switzerland 2024: Book of Abstracts. https://doi.org/10.5281/zenodo.13907298.
A handful of pixels of blood
Learning to Read Digital? Constellations of Correspondence Project and Humanist Perspectives on the Aggregated 19th-century Finnish Letter Metadata
Source Code
---
submission_id: 443
categories: 'Session 3A'
title: "Impresso 2: Connecting Historical Digitised Newspapers and Radio. A Challenge at the Crossroads of History, User Interfaces and Natural Language Processing."
author:
  - name: Maud Ehrmann
    orcid: 0000-0001-9900-2193
    email: maud.ehrmann@epfl.ch
    affiliations:
      - Ecole Polytechnique Fédérale de Lausanne
  - name: Raphaëlle Ruppen Coutaz
    orcid: 0000-0002-3503-4978 
    email: raphaelle.ruppencoutaz@unil.ch
    affiliations:
      - University of Lausanne 
  - name: Simon Clematide
    orcid: 0000-0003-1365-0662
    email: simon.clematide@cl.uzh.ch
    affiliations:
      - University of Zurich 
  - name: Marten Düring
    orcid: 0000-0001-7411-771X
    email: marten.during@uni.lu
    affiliations:
      - University of Luxembourg 
keywords:
  - digitised historical newspaper and radio 
  - natural language processing
  - historical document processing
  - media history
  - digital history
  - historical scholarship 
abstract: |
 The Impresso project pioneers the exploration of historical media content across temporal, linguistic, and geographical boundaries. In its initial phase (2017-2020), the project developed a scalable infrastructure for Swiss and Luxembourgish newspapers, featuring a powerful search interface. The second phase, beginning in 2023, expands the focus to connect media archives across languages and modalities, creating a Western European corpus of newspaper and broadcast collections for transnational research on historical media. In this presentation, we introduce Impresso 2 and discuss some of the specific challenges to connecting newspaper and radio. 
date: 09-12-2024
doi: 10.5281/zenodo.13907298
bibliography: references.bib
---

## Introduction

The Impresso project pioneers the exploration of historical media content across temporal, linguistic, and geographical boundaries. In its initial phase (2017-2020), the project developed a scalable infrastructure for Swiss and Luxembourgish newspapers, featuring a powerful search interface. The second phase, beginning in 2023, expands the focus to connect media archives across languages and modalities, creating a Western European corpus of newspaper and broadcast collections for transnational research on historical media.

We introduce Impresso 2 and review the evolution from the first to the second project. We also discuss the specific challenges to connecting newspaper and radio, our efforts to adapt text mining and exploration tools to the affordances of historical material derived from different modalities, and our approach to conducting comparative and data-driven historical research using semantically enriched sources, accessible through both graphical and API-based interfaces.

## Media Monitoring of the Past: The Impresso Projects

At the intersection of natural language processing, history and design, the [Impresso projects](https://impresso-project.ch) focus on the processing, enrichment and exploration of large-scale historical media sources, with the aim of transforming the accessibility and usability of media archives for historical research.

The objectives of the first project (2017-2020) were to improve text mining tools for historical texts, to enrich historical newspapers with new information layers in the form of semantic annotations, and to integrate such data into historical research workflows by means of a newly developed user interface. Impresso 1 compiled and semantically enriched a corpus of digitised Swiss and Luxembourg newspapers, and designed a scalable infrastructure and user interface, that together form the [Impresso app](https://impresso-project.ch/app/) [@during_transparent_2024;@ehrmann_explorer_2021;@ehrmann_language_2020;@romanello_impresso_2020]. Beyond simple browsing and often unsatisfactory keyword searches, this interface exploits new opportunities offered by semantic enrichments such as word embeddings, named entities, topic modelling and text reuse for content search and discovery, as well as comparative and critical perspectives on the available data [@during_impresso_2021;@during_impresso_2023;@during_transparent_2024]. Its creation has been guided by the principles of co-design (where all team members actively contribute through continuous push-pull interaction), generosity (offering multiple complementary access points to the collection) and transparency (providing information on e.g. tools, document processing, and data quality). Impresso 1 aimed to harness machine reading to support historical scholarship, and its interface has helped popularise the use of text mining-based enrichment for the retrieval and exploration of newspaper articles - now a de facto standard.

Two key observations form the basis of the second Impresso project (2023-2027). First, although newspaper digitisation and online accessibility has outpaced that of broadcast archives over the past two decades, the digitisation of radio collections has recently caught up, making it easier for scholars to select them as sources and opening up new processing opportunities. Second, despite considerable progress in facilitating exploration of digitised sources, particularly for newspapers, existing digital portals still fall short of meeting the needs of historical research. This deficit arises for two main reasons. First, current frameworks for exploring digitised media archives remain fragmented, confined to digital archive silos with country-based institutional portals, and digital media silos, where enrichments and exploration capabilities are typically limited to one language, one modality and one media type. Even with enhanced search capabilities, the search horizon remains narrow. Second, portals offer only passive exploration of static collections, whereas historical research proceeds through iterative processes of association and comparison of multiple objects of study. Furthermore, as the digital transformation increasingly permeates all phases of historical research, historians are calling for new methods to critically analyse data, tools and interfaces.

Building on the achievements of Impresso 1, Impresso 2 envisions a comprehensive connection between media archives, aiming to enable the first-ever joint exploration of historical newspaper and radio content across temporal, linguistic, and national boundaries[^1]. The aim is not merely to juxtapose collections and provide full-text search across them, but to enrich and connect these sources through multiple layers of semantic enrichment represented in a shared multilingual vector space, and to design appropriate, meaningful and transparent exploration capabilities for historical research from a transmedia and transnational perspective. These efforts are guided by original research on historical media ecosystems making use of this newly available data and of data-driven research methods.

[^1]: Impresso is the result of the collaboration of an interdisciplinary team composed of computational linguists, designers and historians. In addition to the presenters, the team includes: Marten Düring (co-PI, UniLu), Simon Clematide (co-PI, UZH), Ferdaous Affan (UniLu), Emanuela Boros (EPFL), Kaspar Beelen (London University), Estelle Bunout (UniLu), Pauline Conti (EPFL), Daniele Guido (Luxembourg University), Andrianos Michail (UZH), Arthur Michelet (UNIL), Juri Opitz (UZH). More on the team [here](https://impresso-project.ch/consortium/team/).

To achieve this goal, Impresso 2 collaborates with [21 European partners](https://impresso-project.ch/consortium/associated-partners/), including national libraries, archives, newspapers, cultural heritage institutions, and international research networks.

## Challenges in Enriching and Integrating Historical Digitised Newspaper and Radio Archives

As the mass media of their time, historical newspapers and radio broadcasts offer a daily account of the past and are key to the study of historical media ecosystems. Since the 19th century in print and the 1920s on air, these media have disseminated news, opinion, and entertainment, reported on events, and offered insights into the daily lives of past societies.

::: {#fig-1}

![](images/Figure1.png)

:::

They have both shaped and been shaped by their social, cultural, and political environments.  Until now, these sources have mostly been studied in isolation, resulting in a plethora of parallel national histories [@fickers_hybrid_2018]. Although there has been a 'transnational turn' toward broader geographical and temporal perspectives [@badenoch_airy_2013;@fickers_transnational_2012;@cronqvist_entangled_2017], in media history, transnational and transmedia perspectives  are rare, particularly when focusing on the distribution of content rather than institutional histories.

How can we accurately connect large collections of newspapers and radio, provide effective means for their exploration in historical research, and put content-based transmedia history into practice? Impresso 2 undertakes a multi-dimensional approach to integrating historical newspaper and broadcasts, focusing on four interconnected areas (see also @fig-1):

1. **Laying the foundation stone** by compiling and making available an unprecedented corpus of Western European digitised print and broadcast media.
2. **Enriching and connecting historical sources** by transforming noisy and heterogeneous sources into semantically enriched and structured data, ultimately connected in a common vector space.
3. **Conducting original historical research**, which advances understanding of historical media ecosystems through various case studies, while also defining methods for data-driven analysis and identifying diverse user needs for data and interface design.
4. **Designing and implementing new interfaces** by creating different entry points to the data and its enrichments.

In pursuing these objectives, Impresso 2 faces a set of unique challenges arising from the central issue of aligning and connecting newspaper and radio archives.

::: {#fig-2}

![](images/Figure2.png)

:::

In many ways familiar sources, digitised radio and newspapers have however never been aligned and connected before. How to proceed? Their integration requires a careful examination of the nature of these media objects, their preservation by cultural heritage institutions, specific characteristics as archival records and legal boundaries which determine access to them. While many questions remain unanswered at this stage, we provide a brief overview of each project focus area, primarily focusing on the key issues raised by integrating radio and newspapers: how to document and contextualise the corpus, how to align radio and newspaper historical documents, how to integrate their content cross-lingually, how to analyse and compare, and how to explore and interact with the data[^2].

[^2]: It should be noted that the project has only recently started and all areas of work are in progress.

### The Impresso Corpus: A Silo-Breaking, Transmedia and Transnational Corpus

The foundation of Impresso 2 is the creation of a large corpus of print and broadcast media collections across several countries. Building on the newspapers collected in the first Impresso project, the corpus expands to a geographically and historically coherent set of neighbouring countries, encompassing both newspaper and radio archives from each.
To begin, let us review the core characteristics of the two media sources. Newspapers were disseminated and are preserved in print, while radio broadcasts, originally transmitted as audio signals, are preserved through audio recordings or typescripts (the text read by the speaker). Newspapers were typically published daily, with one issue per day, whereas radio broadcasts followed a highly variable rhythm, documented in radio schedules available in dedicated magazines, unpublished internal listings, and newspapers (see @fig-2). From a digitised archive perspective, newspaper materials consist of facsimiles and their transcriptions obtained via optical character recognition (OCR), whereas radio materials include facsimiles and OCR for typescripts and radio schedules, and audio recordings and transcripts generated through automatic speech recognition ASR for broadcasts. These materials encompass different modalities, such as text and image for newspapers, and text and sound for radio.

**Data Acquisition and Sharing Framework**

Acquiring such a large and diverse corpus is a lengthy and complex endeavour with several obstacles: collections include various data elements (metadata, facsimiles, audio records, transcripts) with differing copyright statuses and are held by institutions across multiple countries, each with its own jurisdiction and legal constraints based on data owners’ preferences and institutional policies. We aim to make these sources accessible to researchers for operations such viewing, searching, and exporting. To address these issues, we have developed a dual approach: operational, by implementing a rigorous organisation to conduct dialogue and facilitate data exchange with our partners; and legal, by designing a modular data sharing and access framework that respects copyright and institutional constraints while maximising research opportunities through differentiated access. We believe that this operational and legal basis will help to break down institutional barriers.

**Rich Contextual Information for Historical Research**

The practical acquisition of the corpus also provides an opportunity to deepen our understanding of the sources, which is essential for their use. Although radio and newspaper archival records come with standard metadata, this information is often heterogeneous and varies significantly in content, quantity and quality across collections and institutions. Additionally, there are other sources of contextual knowledge, including unspoken or unwritten information. Two ongoing initiatives aim to further document the production and preservation of these archives to provide rich contextual information for historical research, all the more important in this multi-source context.

The first seeks to leverage the information contained in newspaper directories, following the approach outlined in [@beelen_bias_2022]. As a starting point for the Swiss context, we are focusing on extracting semi-structured information from the Swiss Press Bibliography published by Fritz Blaser in 1956 [@alma991017981185503976]. This bibliography documents in great detail about 480 newspapers and around 1,000 periodicals published in Switzerland between 1803 and 1958. It offers rich insights into the origins and history of Swiss newspapers, which can be used to enhance the documentation of newspapers in the Impresso corpus and interface, as well as support the study of the newspaper ecosystem in Switzerland. A similar approach will be used to trace the development of radio programmes through radio schedules and magazines.

The second initiative focuses on radio, adopting an oral history perspective to gain a better understanding not only of radio archives, but also of each archive custodian. What important aspects about these archives might we be unaware of? The ‘Oral History of Radio Archivists” (OHRA) addresses this by conducting semi-structured interviews of Impresso audio partners.These interviews explore topics such as archival preservation and documentation policies, digitisation priorities, and the evolution of data quality over time, with the aim of providing complementary narrative descriptions of radio archives.

### Enriching and Connecting Historical Media Collections

A critical step towards Impresso’s vision is the application of text and image processing techniques to the corpus. We aim to enrich and connect media sources through multiple layers of semantic enrichment, ultimately represented and connected in a common vector space. This endeavour involves three main steps; we discuss here only the first and the last.

**Source consolidation.** Sources come in various digital formats and exhibit varying levels of refinement, both in terms of OCR and ASR quality and in the granularity of document layout analysis and article segmentation. We harmonise historical document data by establishing a consistent data representation and ensuring uniform characterisation of the material across all collections.

First, we aim to define an appropriate and consistent data representation framework for both radio and newspaper digital materials. By ‘representation’, we refer to how data or information is effectively encoded and structured in a machine-readable format. The world of digitised newspapers is well understood, with a clear and consistent structure: a title consists of issues, which include pages containing articles and images. Each issue's content is organised into sections, a framework that intersects with journalistic genres, exhibits distinct characteristics in both layout and content, and evolves over time. In contrast, digitised radio broadcasts present a more complex and heterogeneous landscape, lacking a shared definition of what constitutes a ‘standard’ broadcasting unit. There are varying levels of content organisation and an uneven distribution of material over time. Drawing from concrete evidence from files on disk, we carefully inventory all source elements for both media types, refining the terminology for newspapers and radio components – an often challenging process. With this groundwork, we then explore how to align collections’ structure and compositional units of both media sources. This alignment design, developed in parallel with data acquisition and in collaboration with archive partners, also informs and influences the rendering of sources in the interface.

Second, we elevate the corpus to a consistent and higher-quality level, through the assessment and optimisation of OCR and ASR quality, along with the homogenisation of content item segmentation and classes. The objective in this regard is to create “bridges” between radio broadcast and newspaper content types, enabling the retrieval of specific categories–such as editorials, sports-related (sub)sections, or radio schedules–across a set of titles, languages and a specific period in all digitised material.

**Cross-lingual connection of sources.** After enriching historical sources with semantic information to facilitate content-related search facets and support the exploration and comparison of people, locations, keyphrases, topics, and semantic classes across time periods, languages, and media channels (the second step not discussed here), a crucial task is establishing meaningful connections at the content level across both media, both mono- and cross-lingually. This process shifts focus away from structure, concentrating instead on content and its enrichments, with the goal of computing effective vector representations. This is relatively easy to achieve on a monolingual basis, but becomes more challenging cross-lingually, where the goal is to compute meaningful similarities of multilingual representations across languages.

### Unlocking the Dynamics of Historical Media Ecosystems

The focus of historical research activities within Impresso 2 is on examining influence—encompassing actors, themes, and formats—within a transnational media ecosystem. In our context, influence refers to the ability of individuals, groups, organisations or even texts (e.g. books, articles) to shape, direct, or alter narratives, imagery, content, opinions, behaviours, or outcomes related to a particular subject, issue, or topic represented by and through the media. Several case studies, which are both conceptually and methodologically complementary, investigate influences from various perspectives: external to the media, within the media ecosystem, within individual institutions, and across different content formats. This transnational and transmedia research aims to revisit and compare media autonomy, examining interactions between different media forms, such as radio and newspapers, beyond their competitive aspects. A central challenge is extend the historian’s traditional skill of making meaningful comparisons from incomplete and diverse sources to a large-scale, multilingual and transmedia corpus interconnected and enriched with semantic annotations. To this end, we are developing a comparative framework that leverages our data, tools, and interfaces.

### Interfaces

To facilitate research on our data, we are developing two complementary research-oriented user interfaces: the Impresso web app, a powerful graphical user interface, and the Impresso data lab, built around a user-oriented API. While the search interface offers a more traditional entry point to explore sources, enable close reading and the compilation of research datasets, the data lab is designed for automating access and enabling computational analysis. Our efforts focus on providing an easy programmatic access to the corpus and its enrichments through a public API and the Impresso Python library, allowing users to annotate external documents with Impresso tools and import external annotations into the web app (annotation and import services), and ensuring transparency with comprehensive documentation, including datasets, models notebook galleries, and tutorials. To further inform the design of the Impresso Datalab, we are surveying current approaches and realisation around data labs for computational humanities research [@beelen_surveying_2024].

## References

::: {#refs}
:::
  • Edit this page
  • Report an issue