Methodology
How keyness and register profiles work
Keyness analysis and register profiling compare two subcorpora to reveal how their vocabulary and linguistic style differ. This page explains the statistics and the feature definitions.
What is keyness?
Keyness measures which words are statistically over-represented in one corpus compared to another. A word is "key" to a corpus when it appears significantly more often than you would expect by chance, given the sizes of the two corpora being compared. Keyness analysis was developed by Mike Scott for the WordSmith Tools software and has been widely used in corpus linguistics since the 1990s.
The log-likelihood statistic (G²)
The Explorer uses Dunning's log-likelihood ratio (G²) to measure keyness. For each word, a 2×2 contingency table is constructed: the word's frequency in corpus A, its frequency in corpus B, and the total tokens in each corpus. The expected frequency is what the word's count would be if it were distributed proportionally across both corpora. G² measures how far the observed frequencies deviate from this expectation. Higher G² values indicate more statistically distinctive words. The formula is:
G² = 2 × (a × ln(a/E₁) + b × ln(b/E₂))
where a = frequency in corpus A, b = frequency in corpus B, E₁ = expected frequency in A, E₂ = expected frequency in B. Expected values are calculated as: E₁ = (total tokens in A) × (a + b) / (total tokens in A + B).
Words are assigned to the corpus where they are over-represented (normalized frequency higher than expected).
Minimum frequency filter
By default, words appearing fewer than 10 times across both corpora are excluded. Rare words can produce high G² values that are statistically unreliable. The minimum frequency can be adjusted.
Register profiles
Below the keyness tables, a register profile compares surface-level linguistic features between the two corpora. All features are computed at ETL time on each letter and aggregated at query time per subcorpus. Features reported per 10,000 tokens:
- 1st person singular (I, me, my, mine, myself): Self-reference. High rates indicate personal narrative; low rates indicate third-person clinical description.
- 1st person plural (we, us, our, ours, ourselves): Institutional or collaborative voice.
- 2nd person (you, your, yours, thou, thee, thy, thine): Direct address. Includes 18th-century second-person forms.
- 3rd person singular (he, him, his, she, her, hers, it, its): Third-person reference. High rates indicate writing about someone rather than as someone.
- Modal hedging (may, might, perhaps, seems, seemed, possibly, probably): Epistemic caution. High rates indicate uncertainty or professional qualification.
- Nominalization (-tion, -sion, -ment, -ness endings): Abstract nominal style.
- Passive construction proxy (was/were/been/being/be + past participle): Agent-deleted constructions. A surface-form approximation, not a full syntactic parse.
- Mean sentence length (words per sentence): Sentence boundaries detected by terminal punctuation (. ! ?).
These are surface-form counts — no lemmatization is applied. This is a deliberate methodological choice: 18th-century English lemmatizers are imperfect, and surface-form counts are fully transparent and reproducible.