Source data

The Cullen Project Explorer derives its data from the MariaDB database produced by the Cullen Project at the University of Glasgow. The raw data is publicly available at researchdata.gla.ac.uk/1225. The ETL (Extract-Transform-Load) pipeline reads the database, parses the TEI XML body of each letter, computes register features, and produces a compact read-only SQLite file (~50 MB) that the web application queries at runtime.

Corpus scope

The derived corpus contains 5,617 letters with transcribed content. Glasgow's public file release contains 3,799 standalone XML files — a subset of the database representing letters that had been exported at the time of the data release. The Explorer can filter to the file-release subset via the "Corpus scope" control in the subcorpus builder.

One record (doc 3194) is excluded as a test record with placeholder metadata. See the voice classification methodology page for details on the ~4% of letters excluded from voice analysis.

Text normalization

The prose text used for all analyses is extracted from the TEI XML with these transformations:

What is excluded from voice analysis

Approximately 4% of letters are classified as "excluded" — primarily non-medical correspondence (letters from merchants, lawyers, students) and letters where the author's relationship to the patient could not be determined.