2007-02-09

Cryptanalysis and histograms

I've been interested in cryptanalysis and encryption for a long time. I can remember using PGP in the early 90's, my little Zenith laptop taking literally hours to create puny 256-bit keys, and being utterly fascinated by the whole process.

One way that cryptography is analyzed is letter frequency counting. There are certain characters that show up more frequently than others depending on what natural language is used. In letter substitution and other simple encryption schemes, creating a histogram of some random samples of that language (for my sake, we'll assume this is English), one can take that and lay it over the histogram of ciphertext, finding letters that are more common in the ciphertext and in the language sample. Analysts begin substituting letters that peak the charts, and eventually may come up with a partial decryption that's complete enough to go trial-and-error.

Histograms are just neat, though. Histograms of color data in photos are cool, too. Loosely speaking, histograms are bar graphs of frequency data.

Here is a histogram of the first chapter of Genesis (KJV):


On the far right are control characters such as carriage returns. The big vertical bar is the frequency count of spaces. To the left of that are numbers, then capital letters, punctuation, and finally lowercase letters toward the middle of the histogram. It's easy to tell that this is a text file from the histogram.

Next I'll show you a binary executable file, next to a GIF image. You can see these are binary files with a lot of "nulls" on the far right, and lots of high-bit characters on the left. They look nothing like a text file.



The beauty of a really good random one-time pad as I'd discussed in Wednesday's post becomes evident. On the left is a histogram of a really good random pad file. On the right is the first chapter of genesis (used in our first example) XORed against the random pad.



The entropy is evident!

I have an implementation of the newLISP one-time pad tool that I'll write about this weekend.

I used PHP/GD to create the histograms used in this article. Here's the quick-and-dirty histogram generator code I made:

<?php
####
# 256x256 byte frequency histogram maker
# by ax0n
###
header("Content-type: image/png"); // Tell browser to expect a PNG
if(! isset($_GET['file'])){
$filename="data";
}else{
$filename=$_GET['file'];
}
$im=imagecreate(256,256); // allocate image resource
$background_color = imagecolorallocate($im, 0, 0, 0);
$text_color = imagecolorallocate($im, 233, 14, 91);
$handle = fopen($filename, "r");
$contents = fread($handle, filesize($filename));
fclose($handle);
$histarray=(count_chars($contents, 1)); // built the frequency array
foreach ($histarray as $i => $val) {
$graph=($val/max($histarray))*256; // Normalize the values
imageline($im,$i,1,$i,$graph,$text_color); // Draw chart
}
$im=imagerotate($im, 180, 0); // make bars go from down to up (cheap hack)
imagepng($im); // dump image to browser
imagedestroy($im); // release image resource
?>

blog comments powered by Disqus