Account services

Character sets and encodings

ASCII, ISO or Unicode

The computer-age started mainly in USA in the fifties. For the first decade or so computers were handbuilt, and only slowly did the need arise to exchange data. You can’t do that unless you have an agreement on how letters are encoded into the bits used by computers. Therefore in 1963, the ASCII coding standard was published. ASCII encodes 95 characters in seven bits in both lower and upper case, which meant only the 26 letters used in English are available. Very little accomodation were given to international use. Bracket and brace characters of ASCII were assigned to “national use” code points that were used for accented letters in other national variants of ASCII, a German, French, or Swedish, etc., programmer had to get used to reading and writing ä aÄiÜ='Ön'; ü instead of { a[i]='\n'; }.

In retrospect one can see that ASCII had a harmful effect on the English language. Certain words, such as “naïve” couldn’t be spelt correctly because the ‘ï’ wasn’t available, and it means that millions of people now incorrectly think it is spelt “naive”. Likewise, the m-dash ‘―’ has become two dashes -- dagger ‘†’ and double-dagger ‘‡’ for footnotes have fallen out of use.

As computers spread to Europe, this state of affairs became unworkable. Therefore around 1985, a set of ISO-standards were published that use the upper half of the now common 8-bit bytes to hold additional character definitions. This was not enough to contain all the diacritical marks used in Europe, and the ISO standards grouped several countries together. Western Europe (all the early adopters) was given ISO-8859-1, while Central Europe was given ISO-8859-2. Since the second group at the time was on the other side of the Iron Curtain, is wasn’t seen as important to ensure interoperability.

Now that the former Eastern block is in the European Union, it is negligent to continue with ISO-8859-1. We need a coding standard that can embrace all diacritical marks, and all three alphabets in use in EU. We need Unicode.

Unicode in details

The Unicode standard is a superset of all other character set standards. It contains the characters required to represent practically all known languages. Since it contains so many alphabets, it uses 32 bits for every character, which can be a waste of space if you only use the European alphabets. This is where encodings come in.

Unicode comes in many encodings: UTF-7, UTF-8, UTF-16LE, UTF-16BE etc. UTF-8 is the one most supported in operating systems and web browsers at the moment.

UTF-8 has the following properties: Unicode characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. All Unicode characters over U+007F are encoded as a sequence of several bytes. All European characters occupy the lower end of the code space and can all be written with at most two bytes. But since the old axiom that all characters occupy exactly one byte each is broken, you have to ensure your software understands UTF-8.

Workarounds

All webbrowsers and office suites work with Unicode today. But old fileformats (Shapefile), old database systems (Mysql 4) and old tools still exist that don’t understand multibyte characters. If your system doesn’t understand Unicode yet, there are workarounds.

In some cases you can just ignore the multibyte issue and check that field sizes don’t cut a character in the middle of two bytes. That is probably the preferred approach, but you might have some issues with sorting.

Another approach is to use character entities. That is when you convert all characters that don’t exist in ASCII into &-escaped entities. It can be done in various ways:

One benefit is that it doesn’t matter what encoding the webbrowser thinks you’re using. Since you now send everything in ASCII, and ASCII is a subset of most relevant encodings, it just works. The webbrowser works internally in Unicode and therefore unescapes the entities for the internal representation.

There are two ways you can do the escaping.

  1. Between getting the data from the database (in ISO-8859-1) and sending it to the user’s webbrowser. Sorting in the database will work correctly, but you won't be able to store characters like ‘Ĕ’ in the database.

  2. The second option is to store the escaped text in the database. Now you can store everything at the expense of some space, but sorting might be a problem. ‘Ĕ’ will show as Ĕ and this comes before ‘A’. Certain word-oriented search mechanisms might also break; E.g. “lyžovanie” (skiing) will be seen as “lyžovanie” and therefore not as one word.

    But remember; once you have started to put escaped text in the database, you can’t escape again when displaying. The characters < and & must be stored in the database as escaped also. Otherwise it is possible to enter text like <img src="http:crack.cc..."/> and your application now is a conduit for cross site scripting.

References