DigiHistCH24
  • Home
  • Book of Abstracts
  • Conference Program
  • Call for Contributions
  • About

Tables are tricky. Testing Text Encoding Initiative (TEI) Guidelines for FAIR upcycling of digitised historical statistics.

  • Home
  • Book of Abstracts
    • Data-Driven Approaches to Studying the History of Museums on the Web: Challenges and Opportunities for New Discoveries
    • On a solid ground. Building software for a 120-year-old research project applying modern engineering practices
    • Tables are tricky. Testing Text Encoding Initiative (TEI) Guidelines for FAIR upcycling of digitised historical statistics.
    • Training engineering students through a digital humanities project: Techn’hom Time Machine
    • From manual work to artificial intelligence: developments in data literacy using the example of the Repertorium Academicum Germanicum (2001-2024)
    • A handful of pixels of blood
    • Impresso 2: Connecting Historical Digitised Newspapers and Radio. A Challenge at the Crossroads of History, User Interfaces and Natural Language Processing.
    • Learning to Read Digital? Constellations of Correspondence Project and Humanist Perspectives on the Aggregated 19th-century Finnish Letter Metadata
    • Teaching the use of Automated Text Recognition online. Ad fontes goes ATR
    • Geovistory, a LOD Research Infrastructure for Historical Sciences
    • Using GIS to Analyze the Development of Public Urban Green Spaces in Hamburg and Marseille (1945 - 1973)
    • Belpop, a history-computer project to study the population of a town during early industrialization
    • Contributing to a Paradigm Shift in Historical Research by Teaching Digital Methods to Master’s Students
    • Revealing the Structure of Land Ownership through the Automatic Vectorisation of Swiss Cadastral Plans
    • Rockefeller fellows as heralds of globalization: the circulation of elites, knowledge, and practices of modernization (1920–1970s): global history, database connection, and teaching experience
    • Theory and Practice of Historical Data Versioning
    • Towards Computational Historiographical Modeling
    • Efficacy of Chat GPT Correlations vs. Co-occurrence Networks in Deciphering Chinese History
    • Data Literacy and the Role of Libraries
    • 20 godparents and 3 wives – studying migrant glassworkers in post-medieval Estonia
    • From record cards to the dynamics of real estate transactions: Working with automatically extracted information from Basel’s historical land register, 1400-1700
    • When the Data Becomes Meta: Quality Control for Digitized Ancient Heritage Collections
    • On the Historiographic Authority of Machine Learning Systems
    • Films as sources and as means of communication for knowledge gained from historical research
    • Develop Yourself! Development according to the Rockefeller Foundation (1913 – 2013)
    • AI-assisted Search for Digitized Publication Archives
    • Digital Film Collection Literacy – Critical Research Interfaces for the “Encyclopaedia Cinematographica”
    • From Source-Criticism to System-Criticism, Born Digital Objects, Forensic Methods, and Digital Literacy for All
    • Connecting floras and herbaria before 1850 – challenges and lessons learned in digital history of biodiversity
    • A Digital History of Internationalization. Operationalizing Concepts and Exploring Millions of Patent Documents
    • From words to numbers. Methodological perspectives on large scale Named Entity Linking
    • Go Digital, They Said. It Will Be Fun, They Said. Teaching DH Methods for Historical Research
    • Unveiling Historical Depth: Semantic annotation of the Panorama of the Battle of Murten
    • When Literacy Goes Digital: Rethinking the Ethics and Politics of Digitisation
  • Conference Program
    • Schedule
    • Keynote
    • Practical Information
    • Event Digital History Network
    • Event SSH ORD
  • Call for Contributions
    • Key Dates
    • Evaluation Criteria
    • Submission Guidelines
  • About
    • Code of Conduct
    • Terms and Conditions

On this page

  • Introduction
  • Data description
  • Methods for capturing historical tables in XML
  • Table structure in TEI-XML
  • Challenges and problems
  • Ideas for Project Expansion
  • Conclusion
  • References
  • Edit this page
  • Report an issue

Other Links

  • Presentation Slides (PDF)

Tables are tricky. Testing Text Encoding Initiative (TEI) Guidelines for FAIR upcycling of digitised historical statistics.

Session 1B
Author
Affiliation

Gabi Wuethrich

University of Zurich, University Library

Published

September 12, 2024

Modified

February 11, 2025

Doi

10.5281/zenodo.13903990

Abstract

This project on digital data management explores the use of XML structures, specifically the Text Encoding Initiative (TEI), to digitize historical statistical tables from Zurich’s 1918 pandemic data. The goal was to make these health statistics tables reusable, interoperable, and machine-readable. Following the retro-digitization of statistical publications by Zurich’s Central Library, the content was semi-automatically captured with OCR in Excel and converted to XML using TEI guidelines.

However, OCR software struggled to accurately capture table content, requiring manual data entry, which introduced potential errors. Ideally, OCR tools would allow for direct XML export from PDFs. The implementation of TEI for tables remains a challenge, as TEI is primarily focused on running text rather than tabular data, as noted by TEI pioneer Lou Burnard.

Despite these challenges, TEI data processing offers opportunities for conceptualizing tabular data structures and ensuring traceability of changes, especially in serial statistics. An example is a project using early-modern Basle account books, which were “upcycled” following TEI principles. Additionally, TEI’s structured approach could help improve the accuracy of table text recognition in future projects.

Keywords

Text recognition, Table structure, TEI

For this paper, slides are available on Zenodo (PDF).

Introduction

In 2121, nothing is as it once was: a nasty virus is keeping the world on tenterhooks – and people trapped in their own four walls. In the depths of the metaverse, contemporaries are searching for data to compare the frightening death toll of the current killer virus with its predecessors during the Covid-19 pandemic and the «Spanish flu». There is an incredible amount of statistical material on the Covid-19 pandemic in particular, but annoyingly, this is only available in obscure data formats such as .xslx in the internet archives. They can still be opened with the usual text editors, but their structure is terribly confusing and unreadable with the latest statistical tools. If only those digital hillbillies in the 2020s had used a structured format that not only long-outdated machines but also people in the year 2121 could read…

Admittedly, very few epidemiologists, statisticians and federal officials are likely to have considered such future scenarios during the pandemic years. Quantitative social sciences and the humanities, including medical and economic history, but also memory institutions such as archives and libraries, should consciously consider how they can sustainably preserve the flood of digital data for future generations. Thus, the sustainable processing and storage of statistical printed data from the time of the First World War makes it possible to gain new insights into the so-called “Spanish flu” e. g. in the city of Zurich even today. The publications by the Statistical Office of the City of Zurich, which were previously only available in “analog” paper format, have been digitized by the Zentralbibliothek (Central Library, ZB) Zurich as part of Joël Floris’ Willy Bretscher Fellowship 2022/2023 (Floris 2023). This project paper has been written in the context of this digitisation project, as issues regarding digital recording, processing, and storage of historical statistics have always occupied quantitative economic historians “for professional reasons”.

The basic idea of this paper is to prepare tables with historical health statistics in a sustainable way so that they can be easily analysed using digital means. The aim was to capture the statistical publications retro-digitized by the ZB semi-automatically with OCR in Excel tables and to prepare them as XML documents according to the guidelines of the Text Encoding Initiative (TEI), a standardized vocabulary for text structures. To do this, it was first necessary to familiarise with TEI and its appropriate modules, and to apply them to a sample table in Excel. To be able to validate the Excel table manually transferred to XML, I then developed a schema based on the vocabularies of XML and TEI. This could then serve as the basis for an automated conversion of the Excel tables into TEI-compliant XML documents. Such clearly structured XML documents should ultimately be relatively easy to convert into formats that can be read into a wide variety of visualisation and statistical tools.

Data description

A table from the monthly reports of the Zurich Statistical Office serves as an example data set. The monthly reports were digitised as high-resolution pdfs with underlying Optical Character Recognition (OCR) based on Tesseract by the Central Library’s Digitisation Centre (DigiZ) as part of the Willi Bretscher Fellowship project. They are available on the ZB’s Zurich Open Platform (ZOP, Statistisches Amt der Stadt Zürich (1919)), including detailed metadata information. They were published by the Statistical Office of the City of Zurich as a journal volume under this title between 1908 and 1919, and then as «Quarterly Reports» until 1923. The monthly reports each consist of a 27-page table section with individual footnotes, and conclude with a two-page explanatory section in continuous text.

For this study, the data selection is limited to a table for the year 1914 and the month of January (Statistisches Amt der Stadt Zürich (1919)). In connection with Joël Floris’ project, which aims at obtaining quantitative information on Zurich’s demographic development during the «Spanish flu» from the retro-digitisation project, it was obvious to focus on tables with causes of death. The corresponding table number 12 entitled «Die Gestorbenen (in der Wohnbev.) nach Todesursachen und Alter» («The Deceased (in the Resident Pop.) by Cause of Death and Age») can be found on page seven of the monthly report. It contains monthly data on causes of death, broken down by age group and gender, as well as comparative figures for the same month of the previous year. The content of this table is to be prepared below in the form of a standardized XML document with an associated schema that complies with the TEI guidelines.

Methods for capturing historical tables in XML

The source of inspiration for this project paper was a pioneering research project originally based at the University of Basle. In the research project, the annual accounts of the city of Basle from 1535 to 1610 were digitally edited (Calvi 2015). Technical implementation was carried out by the Center for Information Modeling at the University of Graz. Based on a digital text edition prepared in accordance with the TEI standard, the project manages to combine facsimile, web editing in HTML, and table editing via an RDF (Resource Description Framework ) and XSLT (eXtensible Stylesheet Language Transformations ) in an exemplary manner. The edition thus allows users to compile their own selection of booking data in a “data basket” for subsequent machine-readable analysis. In an accompanying article, project team member Georg Vogeler describes the first-time implementation of a numerical evaluation and how “even extensive holdings can be efficiently edited digitally” (Vogeler 2015). However, as mentioned, the central basis for this is XML processing of the corresponding tabular information based on the TEI standard.

This project is based on the April 2022 version (4.4.0) of the TEI guidelines (Burnard 2022). They include a short chapter on the preparation of tables, formulas, graphics, and music. And even the introduction to Chapter 14 is being rather cautious with regard to TEI application for table formats, warning that layout and presentation details are more important in table formats than in running text, and that they are already covered more comprehensively by other standards and should be prepared accordingly in these notations. On asking the TEI-L mailing list whether it made sense to prepare historical tables with the TEI table module, the answers were rather reserved (https://listserv.brown.edu/cgi-bin/wa?A1=ind2206&L=TEI-L#24). Only the Graz team remained optimistic that TEI could be used to process historical tables, albeit in combination with an RDF including a corresponding ontology. Christopher Pollin also provided github links via TEI to the DEPCHA project, in which they are developing an ontology for annotating transactions in historical account books.

Table structure in TEI-XML

Basically, the TEI schema treats a table as a special text element consisting of line elements, which in turn contain cell elements. This basic structure was used to code Table 12 from 1914, which I transcribed manually as an Excel file. Because exact formatting including precise reproduction of the frame lines is very time-consuming, the frame lines in the project work only served as structural information and are not included as topographical line elements as TEI demands. Long dashes, which correspond to zero values in the source, are interpreted as empty values in the TEI-XML. I used the resulting worksheet as the basis for the TEI-XML annotation, in which I also added some metadata. I then had to create an adapted local schema as well as a TEI header, before structuring the table’s text body. Suitable heading (“head”) elements are the title of the table, the table number as a note and the «date» of the table. The first table row contains the column headings and is assigned the role attribute “label” accordingly. The third-last cell of each row contains the row total, which I have given the attribute “ana” for analysis and the value “#sum” for total, following the example of the Basle Edition.

The first cell of each row again names the cause of death and must therefore also be labelled with the role attribute “label”. The second last row shows the sum of the current monthly table, which is why it is given the “#sum” attribute for all respective cells. Finally, the last line shows the total for the previous year’s month. It is therefore not only marked with the sum attribute, but also with a date in the label cell. A potential confounding factor for later calculations is the row “including diarrhea”, which further specifies diseases of the digestive organs but must not be included in the column total. Accordingly, it is provided with another analytical attribute called “#exsum”. As each cell in the code represents a separate element, the »digitally upcycled table 12 in XML format ultimately extends over a good 550 lines of code, which I’m happy to share on request.

Challenges and problems

An initial problem already arose during the OCR-based digitisation. The Central Library (ZB)’s Tesseract-based OCR software, which specializes in continuous text, simply failed to capture the text in the tables. I therefore first had to transcribe the table by hand, which is error-prone. In principle, however, it is irrelevant in TEI in which format the original text was created. The potential for errors when transferring Excel data into the “original” XML is also high, especially if the table is complex and/or detailed. Ideally, i. e. with a clean OCR table, it ought to be possible to export OCR content in pdfs to XML. When speaking with the ZB’s DigiZ, they confirmed not being happy with OCR quality anymore, and are considering improvement with regard to precision.

Due to the extremely short instructions for table preparation in TEI, I underestimated the variety of different text components that TEI offers. The complexity of TEI is not clear from the rough overview of the individual chapters and their introductory descriptions. This only became clear while adjusting table 12 to TEI standards. By becoming accustomed to TEI, its limitations regarding table preparation also became more evident: It is fundamentally geared towards structuring continuous text rather than text forms, where the structure or layout also indicates meaning, as is the case with tables.

The conversion of the sample table into XML and the preparation of an associated TEI schema, which is reduced to the elements present in the sample document, yet remains valid with the TEI standard, proved to be time-consuming code work. Thus, both the sample XML and the local schema each comprise over 500 lines of code – and this basically for only a single – though complex – table with a few metadata. In addition, the extremely comprehensive and complex TEI schema on which my XML document is based is not suitable for implementation in Excel. As a result, I had to prepare an XML table schema that was as general as possible, which may be used to convert the Excel tables into XML in the future, thus reducing error potential of the XML conversion.

Ideas for Project Expansion

Because, as mentioned, the OCR output of the tables in this case is not usable, it should now be crucial for any digitisation project to achieve high-quality OCR of the retro-digitised tables. Table recognition is definitely an issue in economic history research, and there are several open source development tools around on Git-Repositories, which yet have to set a standard, however.

Ideally, the tables recognized in this way would then provide better text structures in the facsimile. With the module for the transcription of original sources, TEI offers extensive possibilities for linking text passages in the transcription with the corresponding passages in the facsimiles. Such links could ideally be used as training data for text recognition programs to improve their performance in the area of table recognition. Other TEI elements that lend structure to the table, such as the dividing lines and the long dashes for the empty values, could also serve as such structural recognition features.

Additional important TEI elements such as locations and gender would further increase the content of the TEI XML format. Detailed metadata, as e.g. provided by the retro-digitized version of the ZOP, can be easily integrated into the TEI header area “xenodata”. Finally, in view of the complex structure of the tables, it is essential to understand and implement XSLT (eXtensible Stylesheet Language Transformation) for automated structuring, and as a basis for RDF used e.g. by the Graz team.

Conclusion

So far, tables seem to have had a shadowy existence within the Text Encoding Initiative (TEI) – or, as TEI pioneer Lou Burnard remarked in the TEI mailing list on behalf of my question whether TEI processing of tables made sense: “Tables are tricky”. The main reason for this probably lies in the continuous text orientation of existing tools and users, who are also less interested in numerical formats.

In principle, however, preparation according to the TEI standard offers the opportunity to think conceptually about the function of tabularly structured data and to make changes, e.g. in serial sources such as statistical tables, comprehensible. The clearly structured text processing of TEI could provide a basis for improving the still rather poor quality of text recognition programs when recording tables. And a platform-independent, non-proprietary data structure such as XML would be almost indispensable for the sustainable long-term archiving of “digitally born” statistics, which have experienced a boom in recent years, and especially during the pandemic. After all, our descendants should also be able to access historical statistics during the next one.

References

Burnard, Syd, Lou und Bauman. 2022. P5: Guidelines for Electronic Text Encoding and Interchange, Version 4.4.0. https://tei-c.org/release/doc/tei-p5-doc/en/html/index.html.
Calvi, Meili, Sonja. 2015. “Jahrrechnungen Der Stadt Basle 1535 Bis 1610 – Digital.” Edited by Susanna Burghartz. 2015. http://gams.uni-graz.at/context:srbas.
Floris, Joël. 2023. “Die Spanische Grippe in Zürich Erzählen.” Willy-Bretscher-Fellow 2022/23. Juni 2023. <. November 2023. https://www.zb.uzh.ch/de/zuerich/die-spanische-grippe-zuerich-erzaehlen.
Statistisches Amt der Stadt Zürich. 1919. Monats-Berichte Des Statistischen Amtes Der Stadt Zürich 1918. Zurich. https://doi.org/https://doi.org/10.20384/zop-871.
Vogeler, Georg. 2015. “Warum Werden Mittelalterliche Und Frühneuzeitliche Rechnungsbücher Eigentlich Nicht Digital Ediert?” In Grenzen Und Möglichkeiten Der Digital Humanities, edited by Constanze Baum and Thomas Stäcker. Sonderband Der Zeitschrift Für Digitale Geisteswissenschaften, 1. https://doi.org/10.17175/sb001_007.
Back to top

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:
@misc{wuethrich2024,
  author = {Wuethrich, Gabi},
  editor = {Baudry, Jérôme and Burkart, Lucas and Joyeux-Prunel,
    Béatrice and Kurmann, Eliane and Mähr, Moritz and Natale, Enrico and
    Sibille, Christiane and Twente, Moritz},
  title = {Tables Are Tricky. {Testing} {Text} {Encoding} {Initiative}
    {(TEI)} {Guidelines} for {FAIR} Upcycling of Digitised Historical
    Statistics.},
  date = {2024-09-12},
  url = {https://digihistch24.github.io/submissions/428/},
  doi = {10.5281/zenodo.13903990},
  langid = {en},
  abstract = {This project on digital data management explores the use
    of XML structures, specifically the Text Encoding Initiative (TEI),
    to digitize historical statistical tables from Zurich’s 1918
    pandemic data. The goal was to make these health statistics tables
    reusable, interoperable, and machine-readable. Following the
    retro-digitization of statistical publications by Zurich’s Central
    Library, the content was semi-automatically captured with OCR in
    Excel and converted to XML using TEI guidelines. However, OCR
    software struggled to accurately capture table content, requiring
    manual data entry, which introduced potential errors. Ideally, OCR
    tools would allow for direct XML export from PDFs. The
    implementation of TEI for tables remains a challenge, as TEI is
    primarily focused on running text rather than tabular data, as noted
    by TEI pioneer Lou Burnard. Despite these challenges, TEI data
    processing offers opportunities for conceptualizing tabular data
    structures and ensuring traceability of changes, especially in
    serial statistics. An example is a project using early-modern Basle
    account books, which were “upcycled” following TEI principles.
    Additionally, TEI’s structured approach could help improve the
    accuracy of table text recognition in future projects.}
}
For attribution, please cite this work as:
Wuethrich, Gabi. 2024. “Tables Are Tricky. Testing Text Encoding Initiative (TEI) Guidelines for FAIR Upcycling of Digitised Historical Statistics.” Edited by Jérôme Baudry, Lucas Burkart, Béatrice Joyeux-Prunel, Eliane Kurmann, Moritz Mähr, Enrico Natale, Christiane Sibille, and Moritz Twente. Digital History Switzerland 2024: Book of Abstracts. https://doi.org/10.5281/zenodo.13903990.
On a solid ground. Building software for a 120-year-old research project applying modern engineering practices
Training engineering students through a digital humanities project: Techn’hom Time Machine
Source Code
---
submission_id: 428
categories: 'Session 1B'
title: Tables are tricky. Testing Text Encoding Initiative (TEI) Guidelines for FAIR upcycling of digitised historical statistics.
author: 
  name: Gabi Wuethrich
  orcid: 0000-0002-9055-2743
  email: gabi.wuethrich@ub.uzh.ch
  affiliation: University of Zurich, University Library
keywords:
- Text recognition
- Table structure
- TEI
abstract: |
  This project on digital data management explores the use of XML structures, specifically the Text Encoding Initiative (TEI), to digitize historical statistical tables from Zurich's 1918 pandemic data. The goal was to make these health statistics tables reusable, interoperable, and machine-readable. Following the retro-digitization of statistical publications by Zurich's Central Library, the content was semi-automatically captured with OCR in Excel and converted to XML using TEI guidelines.
  
  However, OCR software struggled to accurately capture table content, requiring manual data entry, which introduced potential errors. Ideally, OCR tools would allow for direct XML export from PDFs. The implementation of TEI for tables remains a challenge, as TEI is primarily focused on running text rather than tabular data, as noted by TEI pioneer Lou Burnard.
  
  Despite these challenges, TEI data processing offers opportunities for conceptualizing tabular data structures and ensuring traceability of changes, especially in serial statistics. An example is a project using early-modern Basle account books, which were "upcycled" following TEI principles. Additionally, TEI's structured approach could help improve the accuracy of table text recognition in future projects.
date: 09-12-2024
date-modified: last-modified
doi: 10.5281/zenodo.13903990
other-links:
  - text: Presentation Slides (PDF)
    href: https://doi.org/10.5281/zenodo.13903990
bibliography: references.bib
---

::: {.callout-note appearance="simple" icon=false}

For this paper, slides are available [on Zenodo (PDF)](https://zenodo.org/records/13903990/files/428_DigiHistCH24_TrickyTables_Slides.pdf).

:::

## Introduction

In 2121, nothing is as it once was: a nasty virus is keeping the world on tenterhooks – and people trapped in their own four walls. In the depths of the metaverse, contemporaries are searching for data to compare the frightening death toll of the current killer virus with its predecessors during the Covid-19 pandemic and the «Spanish flu». There is an incredible amount of statistical material on the Covid-19 pandemic in particular, but annoyingly, this is only available in obscure data formats such as .xslx in the internet archives. They can still be opened with the usual text editors, but their structure is terribly confusing and unreadable with the latest statistical tools. If only those digital hillbillies in the 2020s had used a structured format that not only long-outdated machines but also people in the year 2121 could read...

Admittedly, very few epidemiologists, statisticians and federal officials are likely to have considered such future scenarios during the pandemic years. Quantitative social sciences and the humanities, including medical and economic history, but also memory institutions such as archives and libraries, should consciously consider how they can sustainably preserve the flood of digital data for future generations. Thus, the sustainable processing and storage of statistical printed data from the time of the First World War makes it possible to gain new insights into the so-called "Spanish flu" e. g. in the city of Zurich even today. The publications by the Statistical Office of the City of Zurich, which were previously only available in “analog” paper format, have been digitized by the Zentralbibliothek (Central Library, ZB) Zurich as part of Joël Floris' Willy Bretscher Fellowship 2022/2023 [@Floris2023]. This project paper has been written in the context of this digitisation project, as issues regarding digital recording, processing, and storage of historical statistics have always occupied quantitative economic historians “for professional reasons”.

The basic idea of this paper is to prepare tables with historical health statistics in a sustainable way so that they can be easily analysed using digital means. The aim was to capture the statistical publications retro-digitized by the ZB semi-automatically with OCR in Excel tables and to prepare them as XML documents according to the guidelines of the Text Encoding Initiative (TEI), a standardized vocabulary for text structures. To do this, it was first necessary to familiarise with TEI and its appropriate modules, and to apply them to a sample table in Excel. To be able to validate the Excel table manually transferred to XML, I then developed a schema based on the vocabularies of XML and TEI. This could then serve as the basis for an automated conversion of the Excel tables into TEI-compliant XML documents. Such clearly structured XML documents should ultimately be relatively easy to convert into formats that can be read into a wide variety of visualisation and statistical tools.

## Data description

A table from the monthly reports of the Zurich Statistical Office serves as an example data set. The monthly reports were digitised as high-resolution pdfs with underlying Optical Character Recognition (OCR) based on Tesseract by the Central Library's Digitisation Centre (DigiZ) as part of the Willi Bretscher Fellowship project. They are available on the ZB’s Zurich Open Platform (ZOP, @SASZ1919), including detailed metadata information.  They were published by the Statistical Office of the City of Zurich as a journal volume under this title between 1908 and 1919, and then as «Quarterly Reports» until 1923.  The monthly reports each consist of a 27-page table section with individual footnotes, and conclude with a two-page explanatory section in continuous text.

For this study, the data selection is limited to a table for the year 1914 and the month of January (@SASZ1919). In connection with Joël Floris' project, which aims at obtaining quantitative information on Zurich's demographic development during the «Spanish flu» from the retro-digitisation project, it was obvious to focus on tables with causes of death. The corresponding table number 12 entitled «Die Gestorbenen (in der Wohnbev.) nach Todesursachen und Alter» («The Deceased (in the Resident Pop.) by Cause of Death and Age») can be found on page seven of the monthly report. It contains monthly data on causes of death, broken down by age group and gender, as well as comparative figures for the same month of the previous year. The content of this table is to be prepared below in the form of a standardized XML document with an associated schema that complies with the TEI guidelines.

## Methods for capturing historical tables in XML

The source of inspiration for this project paper was a pioneering research project originally based at the University of Basle. In the research project, the annual accounts of the city of Basle from 1535 to 1610 were digitally edited [@Burghartz2015]. Technical implementation was carried out by the Center for Information Modeling at the University of Graz.  Based on a digital text edition prepared in accordance with the TEI standard, the project manages to combine facsimile, web editing in HTML, and table editing via an RDF (Resource Description Framework ) and XSLT (eXtensible Stylesheet Language Transformations ) in an exemplary manner.  The edition thus allows users to compile their own selection of booking data in a "data basket" for subsequent machine-readable analysis. In an accompanying article, project team member Georg Vogeler describes the first-time implementation of a numerical evaluation and how "even extensive holdings can be efficiently edited digitally" [@Vogeler2015].  However, as mentioned, the central basis for this is XML processing of the corresponding tabular information based on the TEI standard.

This project is based on the April 2022 version (4.4.0) of the TEI guidelines [@Burnard2022]. They include a short chapter on the preparation of tables, formulas, graphics, and music.   And even the introduction to Chapter 14 is being rather cautious with regard to TEI application for table formats, warning that layout and presentation details are more important in table formats than in running text, and that they are already covered more comprehensively by other standards and should be prepared accordingly in these notations.
On asking the TEI-L mailing list whether it made sense to prepare historical tables with the TEI table module, the answers were rather reserved (<https://listserv.brown.edu/cgi-bin/wa?A1=ind2206&L=TEI-L#24>).  Only the Graz team remained optimistic that TEI could be used to process historical tables, albeit in combination with an RDF including a corresponding ontology.  Christopher Pollin also provided github links via TEI to the DEPCHA project, in which they are developing an ontology for annotating transactions in historical account books.

## Table structure in TEI-XML

Basically, the TEI schema treats a table as a special text element consisting of line elements, which in turn contain cell elements.  This basic structure was used to code Table 12 from 1914, which I transcribed manually as an Excel file. Because exact formatting including precise reproduction of the frame lines is very time-consuming, the frame lines in the project work only served as structural information and are not included as topographical line elements as TEI demands.  Long dashes, which correspond to zero values in the source, are interpreted as empty values in the TEI-XML. I used the resulting worksheet as the basis for the TEI-XML annotation, in which I also added some metadata.
I then had to create an adapted local schema as well as a TEI header, before structuring the table’s text body. Suitable heading ("head") elements are the title of the table, the table number as a note and the «date» of the table. The first table row contains the column headings and is assigned the role attribute "label" accordingly. The third-last cell of each row contains the row total, which I have given the attribute "ana" for analysis and the value "#sum" for total, following the example of the Basle Edition.

The first cell of each row again names the cause of death and must therefore also be labelled with the role attribute "label". The second last row shows the sum of the current monthly table, which is why it is given the "#sum" attribute for all respective cells. Finally, the last line shows the total for the previous year's month. It is therefore not only marked with the sum attribute, but also with a date in the label cell. A potential confounding factor for later calculations is the row "including diarrhea", which further specifies diseases of the digestive organs but must not be included in the column total. Accordingly, it is provided with another analytical attribute called "#exsum". As each cell in the code represents a separate element, the »digitally upcycled table 12 in XML format ultimately extends over a good 550 lines of code, which I’m happy to share on request.

## Challenges and problems

An initial problem already arose during the OCR-based digitisation. The Central Library (ZB)'s Tesseract-based OCR software, which specializes in continuous text, simply failed to capture the text in the tables. I therefore first had to transcribe the table by hand, which is error-prone. In principle, however, it is irrelevant in TEI in which format the original text was created.  The potential for errors when transferring Excel data into the "original" XML is also high, especially if the table is complex and/or detailed. Ideally, i. e. with a clean OCR table, it ought to be possible to export OCR content in pdfs to XML. When speaking with the ZB’s DigiZ, they confirmed not being happy with OCR quality anymore, and are considering improvement with regard to precision.

Due to the extremely short instructions for table preparation in TEI, I underestimated the variety of different text components that TEI offers. The complexity of TEI is not clear from the rough overview of the individual chapters and their introductory descriptions. This only became clear while adjusting table 12 to TEI standards. By becoming accustomed to TEI, its limitations regarding table preparation also became more evident: It is fundamentally geared towards structuring continuous text rather than text forms, where the structure or layout also indicates meaning, as is the case with tables.

The conversion of the sample table into XML and the preparation of an associated TEI schema, which is reduced to the elements present in the sample document, yet remains valid with the TEI standard, proved to be time-consuming code work. Thus, both the sample XML and the local schema each comprise over 500 lines of code – and this basically for only a single – though complex – table with a few metadata. In addition, the extremely comprehensive and complex TEI schema on which my XML document is based is not suitable for implementation in Excel. As a result, I had to prepare an XML table schema that was as general as possible, which may be used to convert the Excel tables into XML in the future, thus reducing error potential of the XML conversion.

## Ideas for Project Expansion

Because, as mentioned, the OCR output of the tables in this case is not usable, it should now be crucial for any digitisation project to achieve high-quality OCR of the retro-digitised tables. Table recognition is definitely an issue in economic history research, and there are several open source development tools around on Git-Repositories, which yet have to set a standard, however.

Ideally, the tables recognized in this way would then provide better text structures in the facsimile. With the module for the transcription of original sources, TEI offers extensive possibilities for linking text passages in the transcription with the corresponding passages in the facsimiles. Such links could ideally be used as training data for text recognition programs to improve their performance in the area of table recognition. Other TEI elements that lend structure to the table, such as the dividing lines and the long dashes for the empty values, could also serve as such structural recognition features.

Additional important TEI elements such as locations and gender would further increase the content of the TEI XML format. Detailed metadata, as e.g. provided by the retro-digitized version of the ZOP, can be easily integrated into the TEI header area "xenodata". Finally, in view of the complex structure of the tables, it is essential to understand and implement XSLT (eXtensible Stylesheet Language Transformation) for automated structuring, and as a basis for RDF used e.g. by the Graz team.

## Conclusion

So far, tables seem to have had a shadowy existence within the Text Encoding Initiative (TEI) – or, as TEI pioneer Lou Burnard remarked in the TEI mailing list on behalf of my question whether TEI processing of tables made sense: "Tables are tricky". The main reason for this probably lies in the continuous text orientation of existing tools and users, who are also less interested in numerical formats.

In principle, however, preparation according to the TEI standard offers the opportunity to think conceptually about the function of tabularly structured data and to make changes, e.g. in serial sources such as statistical tables, comprehensible. The clearly structured text processing of TEI could provide a basis for improving the still rather poor quality of text recognition programs when recording tables. And a platform-independent, non-proprietary data structure such as XML would be almost indispensable for the sustainable long-term archiving of "digitally born" statistics, which have experienced a boom in recent years, and especially during the pandemic. After all, our descendants should also be able to access historical statistics during the next one.

## References

::: {#refs}
:::
  • Edit this page
  • Report an issue