Source data and text normalization — Cullen Project Explorer

Source data

The Cullen Project Explorer derives its data from the MariaDB database produced by the Cullen Project at the University of Glasgow. The raw data is publicly available at researchdata.gla.ac.uk/1225. The ETL (Extract-Transform-Load) pipeline reads the database, parses the TEI XML body of each letter, computes register features, and produces a compact read-only SQLite file (~50 MB) that the web application queries at runtime.

Corpus scope

The derived corpus contains 5,617 letters with transcribed content. Glasgow's public file release contains 3,799 standalone XML files — a subset of the database representing letters that had been exported at the time of the data release. The Explorer can filter to the file-release subset via the "Corpus scope" control in the subcorpus builder.

One record (doc 3194) is excluded as a test record with placeholder metadata. See the voice classification methodology page for details on the ~4% of letters excluded from voice analysis.

Text normalization

The prose text used for all analyses is extracted from the TEI XML with these transformations:

Editorial apparatus removed: notes, annotations, page-break markers, illegible-section markers.
Recipe blocks excluded: Latin formulaic prescription text is removed from prose but retained as metadata (recipe count, ingredient concepts).
Editorial choices resolved: abbreviations expanded, spellings regularized, scribal substitutions use the final form.
Line breaks and whitespace normalized: soft hyphens joined, whitespace collapsed.
Folio/page-number headers removed.
Mojibake repaired: ~70% of letters contained double-encoded Latin-1 artifacts (Â¬), corrected during ETL.

What is excluded from voice analysis

Approximately 4% of letters are classified as "excluded" — primarily non-medical correspondence (letters from merchants, lawyers, students) and letters where the author's relationship to the patient could not be determined.