Friday, March 29, 2019

Character Sets

It has occurred to me that some confusions would never mislead a couple of classes of programmer: the one who started about 1970 and retired about 1995; the one who will start in 2025. The first will have spent a career in a world where every character may be represented in an 8-bit byte, most of them in only 7 bits, in so-called US ASCII. (Yes, I'm ignoring IBM and the world of EBCDIC.) The second will have a career in which Unicode is everywhere, and nobody confuses bytes and characters. But perhaps I should date the latter's career from 2025, or later

For the first programmer the bit pattern 01000001 (decimal 65) meant 'A'. It was the sole representation for 'A'. For the second programmer, code point 65 is 'A', but code point 65 may be represented in many ways: identical to US ASCII if the Unicode representation is UTF-8; with an all-zero byte preceding or following in the UCS-16 representations; for all I know, whistled on a bosun's pipe or carved on a rock. The second programmer knows that always and everywhere there is a distinction between bytes and characters, even if a particular encoding maps them one-to-one.

I live in the uneasy world betwixt and between. I know that a string is made up of characters, and a byte sequence of bytes, but the distinction is not generally in my thoughts. Most of the time, I can pretend strings and byte sequences are the same. Now and then, I can't.

Recently I have been helping a co-worker by pulling data from the internet and using scripts to transform it. Today one of the scripts failed with an error: it encountered the character 0x95, which is not valid within UTF-8. As it happens, 0x95 represents a bullet in the code set CP-1252. I had a look at the headers of the web page, and was not greatly surprised to discover that they included "charset=UTF-8". Changing the script to read the page as CP-1252 resolved the difficulty.

This was not the first time that I have encountered such a difficulty. The word has gone out among developers--of web frameworks, at least--that one should always announce the character set. The users of the frameworks, though, haven't always heard that one should ensure that the data provided is in the character set announced.

No comments:

Post a Comment