What is keyness?

Keyness measures which words are statistically over-represented in one corpus compared to another. A word is "key" to a corpus when it appears significantly more often than you would expect by chance, given the sizes of the two corpora being compared. Keyness analysis was developed by Mike Scott for the WordSmith Tools software and has been widely used in corpus linguistics since the 1990s.

The log-likelihood statistic (G²)

The Explorer uses Dunning's log-likelihood ratio (G²) to measure keyness. For each word, a 2×2 contingency table is constructed: the word's frequency in corpus A, its frequency in corpus B, and the total tokens in each corpus. The expected frequency is what the word's count would be if it were distributed proportionally across both corpora. G² measures how far the observed frequencies deviate from this expectation. Higher G² values indicate more statistically distinctive words. The formula is:

G² = 2 × (a × ln(a/E₁) + b × ln(b/E₂))

where a = frequency in corpus A, b = frequency in corpus B, E₁ = expected frequency in A, E₂ = expected frequency in B. Expected values are calculated as: E₁ = (total tokens in A) × (a + b) / (total tokens in A + B).

Words are assigned to the corpus where they are over-represented (normalized frequency higher than expected).

Minimum frequency filter

By default, words appearing fewer than 10 times across both corpora are excluded. Rare words can produce high G² values that are statistically unreliable. The minimum frequency can be adjusted.

Register profiles

Below the keyness tables, a register profile compares surface-level linguistic features between the two corpora. All features are computed at ETL time on each letter and aggregated at query time per subcorpus. Features reported per 10,000 tokens:

These are surface-form counts — no lemmatization is applied. This is a deliberate methodological choice: 18th-century English lemmatizers are imperfect, and surface-form counts are fully transparent and reproducible.