Character Encoding - ASCII, ISO-8859-1, UTF-8, UTF-16

Character encoding is a way of assigning a set of characters to a sequence of numbers called code points in order to facilitate data transmission.

ASCII is one of the oldest encoding schemes used in legacy systems. Since ASCII is a 7 bit encoding (128 code points), it only supports the English alphabet, punctuation marks, and some special characters.

As computers become more widely used, encodings like ISO-8859-1(Latin-1), Windows-1252, and ISO/IEC 8859 extended the ASCII capacity to 8 bits (256 code points) and were able to support European accented characters. All these encodings assign the same characters for the first 128 code points exactly like ASCII. They slightly differ in the assignment of new characters for the remaining 128 code points.

Many encodings were later introduced to support a vast majority of the characters in the world. Among them Unicode/ISO 10646(UCS) have garnered widespread adoption. Unicode now supports all the world's languages as well as many other symbols. Unicode is backwards compatible with ISO-8859-1 and ASCII. It is a 16 bit scheme and can represent quite a lot of characters and symbols. For English documents, using 16 bit for a character is a little wasteful. The 16 bit scheme requires twice the size needed for ISO-8859-1.

To mitigate this issue a UCS transformation called UTF-8 is created. In this encoding, ASCII characters have the same transformation so that a UTF-8 encoded English document is exactly the same as the document encoded in ASCII. Unlike the other encodings, UTF-8 is variable length. For the other Unicode characters, the transformation generates up to 6 bytes for each character.

Surrogate Characters

To accommodate the ever growing demand of code points specially for Chinese characters, a UTF-32 transformation is created. In this transformation, a character called surrogate is represented in two 16 bit code points called high surrogate (DB80–DBFF) and low surrogate (DC00–DFFF).

The formula in JavaScript for converting a surrogate character to its high and low parts and vice versa is given by:
S = 0x10000 + (H − 0xD800) * 0x400 + (L − 0xDC00);
H = Math.floor((S - 0x10000) / 0x400) + 0xD800;
L = ((S - 0x10000) % 0x400) + 0xDC00;

Character Encoding in HTML

The default encoding for most browsers is ISO-8859-1. Browsers look at the HTTP header first to determine the encoding of an HTML document. Then they will look to see if a meta content type is specified in the document itself. For more details follow the following link http://www.w3.org/QA/2008/03/html-charset.html.

To declare character encodings in HTML documents add, the following in the head section of HTML documents:
<meta charset="UTF-8"> for HTML5 and
<meta http-equiv="Content-type" content="text/html;charset=UTF-8"> for HTML4 and XHTML. For other character encodings replace UTf-8 with the desired encoding. No meta tag is necessary for UTF-16 documents

The most obvious encoding error that is rampant in many websites is the use of the right single quotation mark ’ (U+2019) in pages with ISO-8859-1 encoding. The offending character which is not part of the ISO-8859-1 character set displays as ’. There are two ways of resolving this issue. One is to use an apostrophe ' (0x27) in place of the right single quotation mark. The second option is to convert the page's encoding to UTF-8.

Byte Order Mark

When creating UTF-8 documents, some editors add a byte order mark (BOM). BOM is a marker that is added at the beginning of a document and is not available for editing. The presence of BOM is known to cause issues in PHP pages. To avoid these issues, use for example Notepad++ when creating PHP pages and set the encoding to UTF-8 without BOM.

Character Entities

Character entities &lt;, &gt;, &amp;, &apos; and &quot; for the reserved html characters <, >, &, ', and " need to be used respectively. Entity names (&lt;) are recommended over entity numbers (&#60;) because they are easy to remember.

Character Encoding in URLs

URL paths start with a forward slash / and their size is not restricted by the standard. The standard doesn't also mandate the type of encoding. Most servers can handle UTF-8. Other than letters, numbers, and some characters, it is safe to encode all characters. The encoding should be done by preceding the code point with percent sign % for each code point. The URL http://en.wikipedia.org/wiki/Main_Page is therefore the same as http://en.wikipedia.org/%77%69%6b%69/%4d%61%69%6e%5f%50%61%67%65.

A URL query string is the portion of the path after the question mark. A query string is used to pass key/value pairs to a page. The format is http://domain/page?key1=value1&key2=value2&key3=value3... The question mark, the equal sign and the ampersand is not to be encoded. Whereas the keys and values can be encoded.

The domain portion of a URL is not to be encoded even if it is in scripts like Cyrillic or Arabic http://президент.рф. At the time of this writing, Firefox (v19.0.2) properly displays domain names in Cyrillic.

Character Encoding in XML

The character encoding for XML documents must be specified at the beginning of the document with the syntax <?xml version="1.0" encoding="utf-8"?>. If the document contains characters not specified by the encoding you will get "An invalid character was found..." error when opening the xml document.

Just as in HTML, character entities need to be used for XML as well for the reserved characters.

Character Encoding in JavaScript

You can't specify the encoding for JavaScript files other than making sure the characters used are supported by the file's encoding. A file saved as ASCII should not contain UTF-8 characters. Using charset attribute for the script tag is not widely supported and it is deprecated in HTML5.

JavaScript allows using code point literals in strings like var c = "\u00e4". This is equivalent to var c = "ä". Since JavaScript strings are UTF-16 internally, standard string functions like length, charCodeAt, and fromCharCode don't work for Surrogate characters.

Character Encoding in MySQL

To allow columns to store UTF-8 characters, set the collation property to utf_general_ci. Before viewing records run the query SET NAMES utf8; in the command line. When connecting to MySQL using PHP run the statement mysql_query("SET NAMES utf8"); first thing.

Character Encoding in SQL Server

For SQL Server use nchar, nvarchar, and ntext instead of char, varchar, and text. When making query use the N before a string literal like SELECT column FROM table WHERE column = N'data';.

Comments

View archived comments here