Sep 26, 2012

Character recognition, the simple way

Recently I solved a small problem and found it funny enough to write a post.

I had a PDF document with numbers in it (ones and zeros). I needed the numbers for a program, but they were embedded as a scanned picture instead of text. Copying them by hand would be boring and error-prone. I wouldn't want any typos, seeing as the numbers themselves were supposed to be part of an error-correcting code.

Then I thought: perhaps Perl could do this for me! I came up with this:

Running the script produces:

~/koodi/redsea/test - zsh ×
2206 windy@pentti~/koodi/redsea/test ) perl bitit.pl 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0x2000077 0 1 0 1 0 1 1 1 0 0 1 1 1 0x10002e7 0 1 0 1 1 1 0 1 0 1 1 1 1 0x08003af 0 1 0 1 1 0 0 0 0 1 0 1 1 0x040030b 0 1 0 1 1 0 1 0 1 1 0 0 1 0x0200359 0 1 0 1 1 0 1 1 1 0 0 0 0 0x0100370 0 1 0 0 1 1 0 1 1 1 0 0 0 0x00801b8 0 1 0 0 0 1 1 0 1 1 1 0 0 0x00400dc 0 1 0 0 0 0 1 1 0 1 1 1 0 0x002006e 0 1 0 0 0 0 0 1 1 0 1 1 1 0x0010037 0 1 0 1 0 1 1 0 0 0 1 1 1 0x00082c7 0 1 0 1 1 1 0 1 1 1 1 1 1 0x00043bf 0 1 0 1 1 0 0 0 0 0 0 1 1 0x0002303 0 1 0 1 1 0 1 0 1 1 1 0 1 0x000135d 0 1 0 1 1 0 1 1 1 0 0 1 0 0x0000b72 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 1 0x00005b9 2206 windy@pentti~/koodi/redsea/test ) █

How does it do that? The script divides the image -- read pixel-by-pixel via ImageMagick -- into squares and counts the black pixels in every square. The character "0" has obviously more black than "1"; the threshold was found by experimenting. An empty square has nearly no black pixels at all, and depicts a zero in this example. Calculating a hex value for every row is simple.

I ended up having to write only slightly more characters than the image contained! :)