Appendix E

Character Encodings

Appendix D, “Color Names and Values,” discusses how computers store information, how a character-encoding scheme is a table that translates between characters, and how they are stored in the computer.

The most common character set (or character encoding) in use on computers is The American Standard Code for Information Interchange (ASCII), which is probably the most widely used character set for encoding text electronically. You can expect all computers browsing the web to understand ASCII.

The problem with ASCII is that it supports only the uppercase and lowercase Latin alphabet, the numbers 0–9, and some extra characters: a total of 128 characters. Table E-1 lists the printable characters of ASCII. (The other characters are things such as line feeds and carriage-return characters.)

Table E-1: Printable Characters of ASCII

Table bapp05-01

However, many languages use either accented Latin characters or completely different alphabets. ASCII does not address these characters, so you need to learn about character encodings if you want to use any non-ASCII characters.

Character encodings are also important if you want to use symbols because these cannot be guaranteed to transfer properly between different encodings (from some dashes to some quotation mark characters). If you do not indicate the character encoding the document is written in, some of the special characters might not display.

The International Standards Organization created a range of character sets to deal with different national characters. ISO-8859-1 is commonly used in Western versions of authoring tools such as Adobe Dreamweaver, as well as applications such as Windows Notepad, as shown in Table E-2.

Table E-2: ISO Character Sets

Character SetDescription
ISO-8859-1Latin alphabet part 1Covering North America, Western Europe, Latin America, the Caribbean, Canada, and Africa
ISO-8859-2Latin alphabet part 2Covering Eastern Europe including Bosnian, Croatian, Czech, Hungarian, Polish, Romanian, Serbian (in Latin transcription), Serbo-Croatian, Slovak, Slovenian, Upper Sorbian, and Lower Sorbian
ISO-8859-3Latin alphabet part 3Covering SE Europe, Esperanto, Maltese, Turkish, and miscellaneous others
ISO-8859-4Latin alphabet part 4Covering Scandinavia/Baltics (and others not in ISO-8859-1)
ISO-8859-5Latin/Cyrillic alphabet part 5
ISO-8859-6Latin/Arabic alphabet part 6
ISO-8859-7Latin/Greek alphabet part 7
ISO-8859-8Latin/Hebrew alphabet part 8
ISO-8859-9Latin 5 alphabet part 9 (same as ISO-8859-1 except Turkish characters replace Icelandic ones)
ISO-8859-10Latin 6 Lappish, Nordic, and Eskimo
ISO-8859-15The same as ISO-8859-1 but with more characters added
ISO-8859-16Latin 10Covering SE Europe, Albanian, Croatian, Hungarian, Polish, Romanian and Slovenian, plus can be used in French, German, Italian, and Irish Gaelic
ISO-2022-JPLatin/Japanese alphabet part 1
ISO-2022-JP-2Latin/Japanese alphabet part 2
ISO-2022-KRLatin/Korean alphabet part 1

It is helpful to note that the first 128 characters of ISO-8859-1 match those of ASCII, so you can safely use those characters as you would in ASCII.

The Unicode Consortium was then set up to devise a way to show all characters of different languages, rather than have these different, incompatible character codes for different languages.

Therefore, if you want to create documents that use characters from multiple character sets, you can do so using the single Unicode character encodings. Furthermore, users can view documents written in different character sets, providing their processor (and fonts) supports the Unicode standards, no matter what platform they are on or which country they are in. By having the single-character encoding, you can reduce software development costs because the programs do not need to be designed to support multiple character encodings.

One problem with Unicode is that a lot of older programs were written to support only 8-bit character sets (limiting them to 256 characters), which is nowhere near the number required for all languages.

Unicode therefore specifies encodings that can deal with a string in special ways to make enough space for the huge character set it encompasses. These are known as UTF-8, UTF-16, and UTF-32, as shown in Table E-3.

Table E-3: Unicode Character Sets

Character SetDescription
UTF-8A Unicode Translation Format that comes in 8-bit units. That is, it comes in bytes. A character in UTF-8 can be from 1 to 4 bytes, making UTF-8 a variable width.
UTF-16A Unicode Translation Format that comes in 16-bit units. That is, it comes in shorts. It can be 1 or 2 shorts, making UTF-16 a variable width.
UTF-32A Unicode Translation Format that comes in 32-bit units. That is, it comes in longs. It is a fixed-width format and is always 1 “long” in length.

The first 256 characters of Unicode character sets correspond to the 256 characters of ISO-8859-1.

By default, HTML 4 processors should support UTF-8, and XML processors are supposed to support UTF-8 and UTF-16; therefore, all XHTML-compliant processors should also support UTF-16 (because XHTML is an application of XML). The HTML5 specification is strongly biased toward UTF-8.

In practice you almost always want to use UTF-8.

For more information on internationalization and different character sets and encodings, see www.i18nguy.com and the article “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!)” at www.joelonsoftware.com/articles/Unicode.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset