Hidden data: Steganography
November 19, 2005
In last night's show, Charlie plays a key role in capturing the bad guy by detecting hidden data. He does this in two ways: first by finding a hidden, pornographic, picture inside another innocent-appearing one, and again by detecting a hidden "partition" inside a computer hard-drive. Today's blog will deal with the pictures. I'll talk about computer drives later in the week.
A file containing a digital photo contains two parts. One is a header containing background information about the nature of the image data and how it is to be interpreted by monitors, printers, etc. These are called "embedded profiles" and are present in most common types of picture files (e.g. TIFF, JPEG, GIF etc.). They are created by the newer cameras and by image-processing software such as PhotoShop. They can be used in hiding information, but are not the main object of interest here. What is of most concern is the second part of the file, where the color of each pixel or dot is described. This is the vast bulk of the file, since a picture will typically contain an array of at least 1024 x 1024 or 1 million+ pixels (a "megapixel"). The color of each of these pixels can be described by saying how much of each of the three primary colors (RGB = Red, Green, Blue, for example) the pixel contains. In the simplest case, each pixel contains one byte (one byte = a sequence of 8 zeroes or ones; e.g. 01101001) per color. This byte measures the intensity ( = roughly the amount) of that color; it translates into a number from 0 to 255. We will not work in binary here, but use decimal notation since that is what most people are used to. If you know binary, then you can easily translate what I'm about to say into this more accurate notation.
As an example, a certain pixel may contain RGB bytes of 81, 153, 14. This means that there's a modest amount of Red, quite a lot of Green, and not much Blue.
Now suppose we round "down" all these numbers by making the last digit zero. We then get the new pixel RGB reading of 80, 150 and 10. How much difference would this make in the appearance of the image? The answer is: it depends on the image, its complexity, the accuracy of the device displaying the picture, and the sensitivity to color of the viewer. You'd undoubtedly see the difference if you could see, in quick succession, the "before" and "after" images. However, in a large and complex image, a change in each pixel made by rounding its RGB components upward by say 5 or 6 out of a possible 255 would probably yield a picture that looked quite realistic.
OK, so what's the point of this? The point is that you can play with the last digit (or last binary digit = 1 part out of 255) and not affect the appearance of the image very much. (If you were clever, you might even be able to hide this further by changing some of the parameters in the embedded "header" part of the file, so that the display device would cover up some of what you'd done.)
So here's how to hide a message inside a picture. Say you want to encode a text message which is made up of combinations 100 different characters (the alphabet together with a bunch of other symbols like spaces, parentheses, punctuation etc.), each associated with one of the numbers 001, 002, ..., 099, 100. Say the 348th symbol of your message is "D", the symbol associated with 068. Then you go to the 348th pixel of the picture. Say its RGB values are 81, 153 and 14. You replace the last digits by the digits of your symbol, so the new RGB values are 80, 156, and 18. Since there are millions of pixels in the image, you can actually encode quite a long message. Or, you can even encode a whole other image (which is what is done with the pornographic image in last night's show). To reconstruct the "hidden image", all you need is software to pull out the "last digit" for each of the RGB values in each pixel, and use them to reconstruct the symbols they are associated with.
(As I said before, this is all done in the binary system, but the idea is exactly the same, and the apparent change in the image is even less obvious when done in binary than in the example just given.)
This idea of hiding a message in an image is very very old. The word for it, "steganography" is from the Greek for "hidden writing". The famous caricaturist Al Hirschfield, who worked for the NY Times, always embedded the name of his daughter, Nina, into each of his sketches -- a benign form of steganography. The technique has also been used by spies and possibly terrorists.
How can you tell if a message has been hidden in a picture? One method is to use some sort of statistical analysis of the patterns formed by the last or "lower order digits" in the picture file. The pattern of letters that form words or the components of a hidden picture are often different from those formed by a single, simple image. Unfortunately, a clever steganographer will first translate the message into some complicated code (not just the numbers from 1 to 100 say), and then transmit this encoded message. These things are very difficult to detect, and people with detection algorithms don't make them known, to avoid tipping off would-be coders about what to avoid.
What Charlie (and Larry) did last night was to write a computer program that simply extracted the lower order digits of the pixels, played around with them mathematically, and then displayed them as an image. By trial-and-error and educated guesses, they hit upon the method used by the bad guy. As they got closer, the innocent image began to transform itself into the pornographic one.
A most interesting episode. Creating and/or detecting clever coding methods is an interesting occupation for creative people with strong mathematical backgrounds. If you're good at it there is no shortage of excellent employment possibilities. It's another example of how people use math every day...
Your blogmeister