6.4. Chi-Squared (\(\chi^2\)) Scoring#

As we’ve seen previously, frequency analysis can be helpful with some ciphers, especially the Caesar cipher. However, trying to arrange characters in frequency order to make a correspondence between plaintext and ciphertext letters is not easy, and we’re no closer to finding a method better that’s than brute force and visual inspection to crack the Affine and Keyword ciphers.

The barchart was helpful when cracking Caesar cipher messages because it allowed us to quickly observe that the ciphertext was in fact enciphered using Caesar, since we could see the A/E, H/I, etc., spikes, and quickly determine what the best possible key would be based on how closely the ciphertext distribution matched the English language distribution. If the ciphertext distribution was shifted by 12, then the key was 12. We’ll build on this idea, comparing the ciphertext to what we’d expect English to look like, but instead of visually inspecting the two distributions, we’ll instead create a scoring method to rate each key used to decipher a ciphertext message. We can then use the score to rank the possible keys and choose the key with the best score.

Developing the Score#

To create a score for each possible key we’ll use a common method in mathematics and statistics, summing squared errors. In statistics, we call the difference between the expected result and the actual result an error. Errors are positive when the actual value is larger than the predicted value, and negative when the actual value is less than the predicted value. In the case of our use, we’ll be using the count of each character in the message as our value. The actual value will be how many times the character appears in our deciphered ciphertext, and the expected value would be how many times we would have expected the character to appear in the deciphered ciphertext based on known letter frequencies.

For example, using the same ciphertext from the Frequency Analysis section:

KRETI JUKRP TUCHI GRDPT UHUJK XUDET IVVKP RIPER EYPWD KHPWO UPTIJ ULKJJ UDOKP TPTUU RDYIL OIHAY ERDER IINWY AUJJR IHWUP EDHWV EHUYE RDWTI JUOKP TRIPT KRCKR KPPIY KPDIO RIRIH PIUEP KPOEY ETIVV KPTIJ UERDP TEPAU ERYMI ALIHP KPTED EZUHL UMPJW HIGRD DIIHJ KSUEZ IHPTI JUZEK RPUDC HUURO KPTEY TKRWW UJJIO VHEYY SRIVK RPTUU FEMPA KDDJU PTUDI IHIZU RUDIR PIEPG VUYTE ZUDTE JJJKS UEPGR RUJEX UHWMI ALIHP EVJUP GRRUJ OKPTI GPYAI SUOKP TZERU JJUDO EJJYE RDLJI IHYPK JUDER DMEHZ UPUDZ HIXKD UDOKP TZIJK YTUDM TEKHY ERDJI PYERD JIPYI LZUCY LIHTE PYERD MIEPY PTUTI VVKPO EYLIR DILXK YKPIH YPTUP GRRUJ OIGRD IRERD IRCIK RCLEK HJWVG PRIPQ GKPUY PHEKC TPKRP IPTUY KDUIL PTUTK JJPTU TKJJE YEJJP TUZUI ZJULI HAERW AKJUY HIGRD MEJJU DKPER DAERW JKPPJ UHIGR DDIIH YIZUR UDIGP ILKPL KHYPI RIRUY KDUER DPTUR IRERI PTUHR ICIKR CGZYP EKHYL IHPTU TIVVK PVUDH IIAYV EPTHI IAYMU JJEHY ZERPH KUYJI PYILP TUYUO EHDHI VUYTU TEDOT IJUHI IAYDU XIPUD PIMJI PTUYS KPMTU RYDKR KRCHI IAYEJ JOUHU IRPTU YEAUL JIIHE RDKRD UUDIR PTUYE AUZEY YECUP TUVUY PHIIA YOUHU EJJIR PTUJU LPTER DYKDU CIKRC KRLIH PTUYU OUHUP TUIRJ WIRUY PITEX UOKRD IOYDU UZYUP HIGRD OKRDI OYJII SKRCI XUHTK YCEHD URERD AUEDI OYVUW IRDYJ IZKRC DIORP IPTUH KXUH

This message is \(994\) characters long. Since we know that \(8.167%\) of English characters are A’s, we would expect around \(994 \cdot 0.08167 = 81.17998\) A’s in the plaintext. Likewise we would expect, \(994 \cdot 0.01492 = 14.83048\) B’s in the plaintext. Continuing these calculations we can generate a table:

Letter

Expected

A

81.17998

B

14.83048

C

27.11631

D

42.27482

Y

19.62156

Z

0.73556

If we attempted to decipher this message using akey = 10 and mkey = 7 and counted the number of each character, we obtain the following actual values:

Letter

Expected

Actual

A

81.17998

18

B

14.83048

0

C

27.11631

16

D

42.27482

62

Y

19.62156

60

Z

0.73556

18

We can see that there’s a difference between what we expected and what actually happened when we deciphered the message with our guessed key values. Our job is to quantify how different these two distributions are. We can start by calculating the error, or difference between actual and expected values. We will subtract expected from actual.

Letter

Expected

Actual

Error

A

81.17998

18

-63.17998

B

14.83048

0

-14.83048

C

27.11631

16

-11.11631

D

42.27482

62

19.72518

Y

19.62156

60

41.37844

Z

0.73556

18

17.25444

That’s still a lot to keep track of, so it may be tempting to sum all the individual error values to get a total error value. However, the sum of the individual errors will always be 0. This is because if there are \(994\) letters in the message, and there are 11 less C’s than expected, those 11 characters are going to appear as “extra” in the count for other letters. An error in one character will always get balanced out by errors in other letters. In the end the total positive errors will completely balance out the total negative errors. One way we can avoid this is by ensuring all errors are positive, so when they are summed the total is guaranteed to be positive. A common way to make values positive is to square them. So for A, the squared error would be \(\left(-63.17998\right)^2 = 3991.7098728\).

Letter

Expected

Actual

Error

Squared Error

A

81.17998

18

-63.17998

3991.7098728

B

14.83048

0

-14.83048

219.94313703

C

27.11631

16

-11.11631

123.572348016

D

42.27482

62

19.72518

389.082726032

Y

19.62156

60

41.37844

1712.17529683

Z

0.73556

18

17.25444

297.715699714

You now have error values for each character that are positive, but comparing between different letters wouldn’t make much sense since the size of the squared error should be considered in context to the expected number of characters.

alt text

After all, having a squared error of 4 for the letter A (actual was 2 off from the expected) means very different things if you were expecting 5 A’s (pretty bad) or 5,000 As (pretty good) in your plaintext. One way to account for this is to divide the squared error by the expected count. This will help normalize the squared error so you can compare the value between different letters even though the letters have different expected counts. So for A, the normalized squared error would be \(3991.7098728 / 81.17998 = 49.1711118037\)

Letter

Expected

Actual

Error

Squared Error

Normalized Squared Error

A

81.17998

18

-63.17998

3991.7098728

49.1711118037

B

14.83048

0

-14.83048

219.94313703

14.83048

C

27.11631

16

-11.11631

123.572348016

4.55712255893

D

42.27482

62

19.72518

389.082726032

9.20365186728

Y

19.62156

60

41.37844

1712.17529683

87.2598966051

Z

0.73556

18

17.25444

297.715699714

404.746995098

Using the normalized squared error scores you can see that the error for A is actually less significant than the error we had for Y, even though A had a higher squared error than Y. We were expecting there to be a lot more A’s than Y’s, so the error of \(\approx 63\) is less significant than the error of \(\approx 41\). The normalized squared error gives an easy way to compare the error for each letter.

Summing all the normalized squared errors yields a single number known as a chi-squared statistic. For this passage of text the chi-squared score would be \(3894.0128\).

More generally written, the formula to calculate this score would be:

\[\chi^2 = \sum_{i=A}^Z \frac{\left(A_i - E_i\right)^2}{E_i} \]

Where \(A_i\) represents the actual number of character \(i\) was in the message and \(E_i\) represents the expected number of character \(i\) in the message.

When this number is very low, it means that there was very little difference between the actual count of each letter and the expected count of each letter. When this number is very high, it means there were many differences. We’ll know that we used a good key value if our deciphered message has a small chi-squared value, since that means there was little difference between the actual and expected values.

Using the Chi-Squared Statistic#

You just went through calculating the chi-squared score for the ciphertext. However, in practice you won’t be concerned with the chi-squared score for the ciphertext, since we already know the ciphertext isn’t the plaintext. Instead, you’ll want to create plaintext candidates by attempting to decipher the ciphertext with all possible keys and score the candidate texts. Only one of the candidates will be the plaintext message, and that candidate will have a relatively low chi-squared score compared to the others. All others should have a relatively high chi-squared score. By scoring every possible candidate you can rank them by their corresponding chi-squared scores and select the candidate that produced the lowest score. The key (or keys) that generate the candidate with the lowest chi-squared score are most likely the same keys that generated the ciphertext message.

Scoring all 312 possible plaintext candidates for this ciphertext generated using the affine cipher yields the following top 25 choices, sorted from lowest chi-squared score to highest:

a-key:  4   m-key: 17   candidate: inaholeinthegroundtherelivedah   chi-squared score:   66.868   
a-key: 20   m-key:  7   candidate: ghulcraghdlaqncyhfdlanargtaful   chi-squared score: 1312.935   
a-key: 20   m-key: 15   candidate: ifstubaifrtaenugflrtanabivalst   chi-squared score: 1423.628   
a-key:  7   m-key: 15   candidate: vsfghonvsegnrahtsyegnanovinyfg   chi-squared score: 1447.688   
a-key: 22   m-key: 25   candidate: mfsdoncmfhdcupoqfthdcpcnmzctsd   chi-squared score: 1449.188   
a-key:  7   m-key: 11   candidate: fivutmnfiwunjathicwunanmfsncvu   chi-squared score: 1737.258   
a-key: 21   m-key:  5   candidate: duhknifduekfrsnxumekfsfidqfmhk   chi-squared score: 1749.117   
a-key:  2   m-key:  1   candidate: ipcrghsipnrsafgepbnrsfshivsbcr   chi-squared score: 1790.433   
a-key: 11   m-key: 25   candidate: buhsdcrbuwsrjedfuiwsrercborihs   chi-squared score: 1950.675   
a-key: 22   m-key: 23   candidate: etgbwnsetlbsyfwotplbsfsnerspgb   chi-squared score: 2062.046   
a-key: 15   m-key:  3   candidate: hsfkpythsaktngpxswaktgtyhutwfk   chi-squared score: 2112.114   
a-key:  7   m-key: 21   candidate: pylifknpyoinbafvygoinankpcngli   chi-squared score: 2121.818   
a-key: 16   m-key: 15   candidate: khuvwdckhtvcgpwihntvcpcdkxcnuv   chi-squared score: 2142.082   
a-key: 25   m-key:  3   candidate: vgtydmhvgoyhbudlgkoyhuhmvihkty   chi-squared score: 2357.962   
a-key: 15   m-key: 25   candidate: fylwhgvfyawvnihjymawvivgfsvmlw   chi-squared score: 2421.259   
a-key:  5   m-key:  3   candidate: terwbkftemwfzsbjeimwfsfktgfirw   chi-squared score: 2521.048   
a-key:  2   m-key: 23   candidate: gvidypugvnduahyqvrnduhupgturid   chi-squared score: 2605.389   
a-key: 21   m-key: 15   candidate: bylmnutbykmtxgnzyekmtgtubotelm   chi-squared score: 2684.794   
a-key: 23   m-key:  9   candidate: nivohkrnicorpehbiscorerknarsvo   chi-squared score: 2724.996   
a-key: 23   m-key: 23   candidate: ncpkfwbncukbhofxcyukbobwnabypk   chi-squared score: 2936.705   
a-key: 19   m-key: 23   candidate: dsfavmrdskarxevnsokarermdqrofa   chi-squared score: 2952.532   
a-key: 16   m-key: 21   candidate: wfspmruwfvpuihmcfnvpuhurwjunsp   chi-squared score: 3038.804   
a-key:  6   m-key: 11   candidate: ybonmfgybpngctmabvpngtgfylgvon   chi-squared score: 3083.875   
a-key:  7   m-key: 19   candidate: hgtclwnhgkcnxalpgikcnanwhunitc   chi-squared score: 3136.088   
a-key:  4   m-key:  3   candidate: cnafktocnvfoibksnrvfobotcporaf   chi-squared score: 3247.261   

You can see that the additive key of 4 and multiplicative key of 17 produced the lowest chi-squared score by far, and the corresponding candidate text appears to contain English. This technique works for any mono-alphabetic cipher (Caesar, Multiplicative, Affine, Keyword, etc.) as long as generating all possible candidates is feasible. Given their relatively small number of keys, this methods works very well for Caesar and Affine ciphers. Using this for a keyword cipher is particularly challenging since, as previously mentioned, there are \(26!\) possible keys for a Keyword cipher, which would take even a modern computer far too long to create and score all possible candidates.

Exercise for the Reader#

Can you write functions to:

  • Count how many letters there are in a string and return a list of the count in order from A to Z?

  • Compute how many letters were expected in a string and return a list of the expected values in order from A to Z?

  • Use the first two functions to compute the chi-squared score for a passage of text?