Tables are tricky. Testing Text Encoding Initiative (TEI) Guidelines for FAIR upcycling of digitised historical statistics.
This project on digital data management explores the use of XML structures, specifically the Text Encoding Initiative (TEI), to digitize historical statistical tables from Zurich’s 1918 pandemic data. The goal was to make these health statistics tables reusable, interoperable, and machine-readable. Following the retro-digitization of statistical publications by Zurich’s Central Library, the content was semi-automatically captured with OCR in Excel and converted to XML using TEI guidelines.
However, OCR software struggled to accurately capture table content, requiring manual data entry, which introduced potential errors. Ideally, OCR tools would allow for direct XML export from PDFs. The implementation of TEI for tables remains a challenge, as TEI is primarily focused on running text rather than tabular data, as noted by TEI pioneer Lou Burnard.
Despite these challenges, TEI data processing offers opportunities for conceptualizing tabular data structures and ensuring traceability of changes, especially in serial statistics. An example is a project using early-modern Basle account books, which were “upcycled” following TEI principles. Additionally, TEI’s structured approach could help improve the accuracy of table text recognition in future projects.
Text recognition, Table structure, TEI
Introduction
In 2121, nothing is as it once was: a nasty virus is keeping the world on tenterhooks – and people trapped in their own four walls. In the depths of the metaverse, contemporaries are searching for data to compare the frightening death toll of the current killer virus with its predecessors during the Covid-19 pandemic and the «Spanish flu». There is an incredible amount of statistical material on the Covid-19 pandemic in particular, but annoyingly, this is only available in obscure data formats such as .xslx in the internet archives. They can still be opened with the usual text editors, but their structure is terribly confusing and unreadable with the latest statistical tools. If only those digital hillbillies in the 2020s had used a structured format that not only long-outdated machines but also people in the year 2121 could read… Admittedly, very few epidemiologists, statisticians and federal officials are likely to have considered such future scenarios during the pandemic years. Quantitative social sciences and the humanities, including medical and economic history, but also memory institutions such as archives and libraries, should consciously consider how they can sustainably preserve the flood of digital data for future generations. Thus, the sustainable processing and storage of statistical printed data from the time of the First World War makes it possible to gain new insights into the so-called “Spanish flu” e. g. in the city of Zurich even today. The publications by the Statistical Office of the City of Zurich, which were previously only available in “analog” paper format, have been digitized by the Zentralbibliothek (Central Library, ZB) Zurich as part of Joël Floris’ Willy Bretscher Fellowship 2022/2023 (Floris (2023)). This project paper has been written in the context of this digitisation project, as issues regarding digital recording, processing, and storage of historical statistics have always occupied quantitative economic historians “for professional reasons”. The basic idea of this paper is to prepare tables with historical health statistics in a sustainable way so that they can be easily analysed using digital means. The aim was to capture the statistical publications retro-digitized by the ZB semi-automatically with OCR in Excel tables and to prepare them as XML documents according to the guidelines of the Text Encoding Initiative (TEI), a standardized vocabulary for text structures. To do this, it was first necessary to familiarise with TEI and its appropriate modules, and to apply them to a sample table in Excel. To be able to validate the Excel table manually transferred to XML, I then developed a schema based on the vocabularies of XML and TEI. This could then serve as the basis for an automated conversion of the Excel tables into TEI-compliant XML documents. Such clearly structured XML documents should ultimately be relatively easy to convert into formats that can be read into a wide variety of visualisation and statistical tools.
Data description
A table from the monthly reports of the Zurich Statistical Office serves as an example data set. The monthly reports were digitised as high-resolution pdfs with underlying Optical Character Recognition (OCR) based on Tesseract by the Central Library’s Digitisation Centre (DigiZ) as part of the Willi Bretscher Fellowship project. They are available on the ZB’s Zurich Open Platform (ZOP, Statistisches Amt der Stadt Zürich (1919)), including detailed metadata information. They were published by the Statistical Office of the City of Zurich as a journal volume under this title between 1908 and 1919, and then as «Quarterly Reports» until 1923. The monthly reports each consist of a 27-page table section with individual footnotes, and conclude with a two-page explanatory section in continuous text. For this study, the data selection is limited to a table for the year 1914 and the month of January (Statistisches Amt der Stadt Zürich (1919)). In connection with Joël Floris’ project, which aims at obtaining quantitative information on Zurich’s demographic development during the «Spanish flu» from the retro-digitisation project, it was obvious to focus on tables with causes of death. The corresponding table number 12 entitled «Die Gestorbenen (in der Wohnbev.) nach Todesursachen und Alter» («The Deceased (in the Resident Pop.) by Cause of Death and Age») can be found on page seven of the monthly report. It contains monthly data on causes of death, broken down by age group and gender, as well as comparative figures for the same month of the previous year. The content of this table is to be prepared below in the form of a standardized XML document with an associated schema that complies with the TEI guidelines.
Methods for capturing historical tables in XML
The source of inspiration for this project paper was a pioneering research project originally based at the University of Basle. In the research project, the annual accounts of the city of Basle from 1535 to 1610 were digitally edited (Calvi (2015)). Technical implementation was carried out by the Center for Information Modeling at the University of Graz. Based on a digital text edition prepared in accordance with the TEI standard, the project manages to combine facsimile, web editing in HTML, and table editing via an RDF (Resource Description Framework ) and XSLT (eXtensible Stylesheet Language Transformations ) in an exemplary manner. The edition thus allows users to compile their own selection of booking data in a “data basket” for subsequent machine-readable analysis. In an accompanying article, project team member Georg Vogeler describes the first-time implementation of a numerical evaluation and how “even extensive holdings can be efficiently edited digitally” (Vogeler 2015). However, as mentioned, the central basis for this is XML processing of the corresponding tabular information based on the TEI standard. This project is based on the April 2022 version (4.4.0) of the TEI guidelines (Burnard (2022)). They include a short chapter on the preparation of tables, formulas, graphics, and music. And even the introduction to Chapter 14 is being rather cautious with regard to TEI application for table formats, warning that layout and presentation details are more important in table formats than in running text, and that they are already covered more comprehensively by other standards and should be prepared accordingly in these notations. On asking the TEI-L mailing list whether it made sense to prepare historical tables with the TEI table module, the answers were rather reserved (https://listserv.brown.edu/cgi-bin/wa?A1=ind2206&L=TEI-L#24). Only the Graz team remained optimistic that TEI could be used to process historical tables, albeit in combination with an RDF including a corresponding ontology. Christopher Pollin also provided github links via TEI to the DEPCHA project, in which they are developing an ontology for annotating transactions in historical account books.
Table structure in TEI-XML
Basically, the TEI schema treats a table as a special text element consisting of line elements, which in turn contain cell elements. This basic structure was used to code Table 12 from 1914, which I transcribed manually as an Excel file. Because exact formatting including precise reproduction of the frame lines is very time-consuming, the frame lines in the project work only served as structural information and are not included as topographical line elements as TEI demands. Long dashes, which correspond to zero values in the source, are interpreted as empty values in the TEI-XML. I used the resulting worksheet as the basis for the TEI-XML annotation, in which I also added some metadata. I then had to create an adapted local schema as well as a TEI header, before structuring the table’s text body. Suitable heading (“head”) elements are the title of the table, the table number as a note and the «date» of the table. The first table row contains the column headings and is assigned the role attribute “label” accordingly. The third-last cell of each row contains the row total, which I have given the attribute “ana” for analysis and the value “#sum” for total, following the example of the Basle Edition. The first cell of each row again names the cause of death and must therefore also be labelled with the role attribute “label”. The second last row shows the sum of the current monthly table, which is why it is given the “#sum” attribute for all respective cells. Finally, the last line shows the total for the previous year’s month. It is therefore not only marked with the sum attribute, but also with a date in the label cell. A potential confounding factor for later calculations is the row “including diarrhea”, which further specifies diseases of the digestive organs but must not be included in the column total. Accordingly, it is provided with another analytical attribute called “#exsum”. As each cell in the code represents a separate element, the »digitally upcycled table 12 in XML format ultimately extends over a good 550 lines of code, which I’m happy to share on request.
Challenges and problems
An initial problem already arose during the OCR-based digitisation. The Central Library (ZB)’s Tesseract-based OCR software, which specializes in continuous text, simply failed to capture the text in the tables. I therefore first had to transcribe the table by hand, which is error-prone. In principle, however, it is irrelevant in TEI in which format the original text was created. The potential for errors when transferring Excel data into the “original” XML is also high, especially if the table is complex and/or detailed. Ideally, i. e. with a clean OCR table, it ought to be possible to export OCR content in pdfs to XML. When speaking with the ZB’s DigiZ, they confirmed not being happy with OCR quality anymore, and are considering improvement with regard to precision. Due to the extremely short instructions for table preparation in TEI, I underestimated the variety of different text components that TEI offers. The complexity of TEI is not clear from the rough overview of the individual chapters and their introductory descriptions. This only became clear while adjusting table 12 to TEI standards. By becoming accustomed to TEI, its limitations regarding table preparation also became more evident: It is fundamentally geared towards structuring continuous text rather than text forms, where the structure or layout also indicates meaning, as is the case with tables. The conversion of the sample table into XML and the preparation of an associated TEI schema, which is reduced to the elements present in the sample document, yet remains valid with the TEI standard, proved to be time-consuming code work. Thus, both the sample XML and the local schema each comprise over 500 lines of code – and this basically for only a single – though complex – table with a few metadata. In addition, the extremely comprehensive and complex TEI schema on which my XML document is based is not suitable for implementation in Excel. As a result, I had to prepare an XML table schema that was as general as possible, which may be used to convert the Excel tables into XML in the future, thus reducing error potential of the XML conversion.
Ideas for Project Expansion
Because, as mentioned, the OCR output of the tables in this case is not usable, it should now be crucial for any digitisation project to achieve high-quality OCR of the retro-digitised tables. Table recognition is definitely an issue in economic history research, and there are several open source development tools around on Git-Repositories, which yet have to set a standard, however. Ideally, the tables recognized in this way would then provide better text structures in the facsimile. With the module for the transcription of original sources, TEI offers extensive possibilities for linking text passages in the transcription with the corresponding passages in the facsimiles. Such links could ideally be used as training data for text recognition programs to improve their performance in the area of table recognition. Other TEI elements that lend structure to the table, such as the dividing lines and the long dashes for the empty values, could also serve as such structural recognition features. Additional important TEI elements such as locations and gender would further increase the content of the TEI XML format. Detailed metadata, as e.g. provided by the retro-digitized version of the ZOP, can be easily integrated into the TEI header area “xenodata”. Finally, in view of the complex structure of the tables, it is essential to understand and implement XSLT (eXtensible Stylesheet Language Transformation) for automated structuring, and as a basis for RDF used e.g. by the Graz team.
Conclusion
So far, tables seem to have had a shadowy existence within the Text Encoding Initiative (TEI) – or, as TEI pioneer Lou Burnard remarked in the TEI mailing list on behalf of my question whether TEI processing of tables made sense: “Tables are tricky”. The main reason for this probably lies in the continuous text orientation of existing tools and users, who are also less interested in numerical formats. In principle, however, preparation according to the TEI standard offers the opportunity to think conceptually about the function of tabularly structured data and to make changes, e.g. in serial sources such as statistical tables, comprehensible. The clearly structured text processing of TEI could provide a basis for improving the still rather poor quality of text recognition programs when recording tables. And a platform-independent, non-proprietary data structure such as XML would be almost indispensable for the sustainable long-term archiving of “digitally born” statistics, which have experienced a boom in recent years, and especially during the pandemic. After all, our descendants should also be able to access historical statistics during the next one.
References
Reuse
Citation
@misc{wuethrich2024,
author = {Wuethrich, Gabi},
editor = {Baudry, Jérôme and Burkart, Lucas and Joyeux-Prunel,
Béatrice and Kurmann, Eliane and Mähr, Moritz and Natale, Enrico and
Sibille, Christiane and Twente, Moritz},
title = {Tables Are Tricky. {Testing} {Text} {Encoding} {Initiative}
{(TEI)} {Guidelines} for {FAIR} Upcycling of Digitised Historical
Statistics.},
date = {2024-08-15},
url = {https://digihistch24.github.io/submissions/428/},
langid = {en},
abstract = {This project on digital data management explores the use
of XML structures, specifically the Text Encoding Initiative (TEI),
to digitize historical statistical tables from Zurich’s 1918
pandemic data. The goal was to make these health statistics tables
reusable, interoperable, and machine-readable. Following the
retro-digitization of statistical publications by Zurich’s Central
Library, the content was semi-automatically captured with OCR in
Excel and converted to XML using TEI guidelines. However, OCR
software struggled to accurately capture table content, requiring
manual data entry, which introduced potential errors. Ideally, OCR
tools would allow for direct XML export from PDFs. The
implementation of TEI for tables remains a challenge, as TEI is
primarily focused on running text rather than tabular data, as noted
by TEI pioneer Lou Burnard. Despite these challenges, TEI data
processing offers opportunities for conceptualizing tabular data
structures and ensuring traceability of changes, especially in
serial statistics. An example is a project using early-modern Basle
account books, which were “upcycled” following TEI principles.
Additionally, TEI’s structured approach could help improve the
accuracy of table text recognition in future projects.}
}