+<h2><a name="Unicode"></a>Unicode code points</h2>
+
+
+Enter the <a href="www.unicode.org">Unicode</a> standard...<br>
+
+<br>
+
+Unicode solves the problem of encoding by assigning unique numbers to <b>every</b>
+ character that is used anywhere in the world. Since it is not possible
+to do this in 8 bits (with a maximum of 256 code positions), a Unicode character
+is usually represented by 16 bits, denoted by U+0000 to U+FFFF in hexadecimal
+style. A number such as U+0123 is named a "code point".<br>
+
+<br>
+ Recently (Unicode 3.1), some extensions have even been defined so that in
+fact the defined range is now U+000000 to U+10FFFF (21 bits), and formally,
+the character set is defined as 31-bits to allow for future expansion.<br>
+
+<br>
+
+The Unicode character set is backward compatible with the ISO-8859-1 or Latin-1
+character set (and thus automatically also with the ASCII character set),
+because for every ISO-8859-1 character with hexadecimal value 0xXY, the corresponding
+Unicode code point is U+00XY.<br>
+
+<br>
+ Some examples of Unicode code points (some of the characters here may not
+be displayed correctly in all browsers; current Mozilla works perfectly for
+this, but it also depends on the installed fonts of course):<br>
+<br>
+
+<table cellpadding="2" cellspacing="2" border="1" width="50%" align="center">
+
+ <tbody>
+ <tr>
+ <td valign="top" align="center"><b>Unicode code point</b><br>
+ </td>
+ <td valign="top" align="center"><b>Character</b><br>
+ </td>
+ </tr>
+ <tr>
+ <td valign="top">U+0041<br>
+ </td>
+ <td valign="top">A<br>
+ </td>
+ </tr>
+ <tr>
+ <td valign="top">U+00E9<br>
+ </td>
+ <td valign="top">é<br>
+ </td>
+ </tr>
+ <tr>
+ <td valign="top">U+03B8<br>
+ </td>
+ <td valign="top">θ (the Greek theta)<br>
+ </td>
+ </tr>
+ <tr>
+ <td valign="top">U+20AC<br>
+ </td>
+ <td valign="top">€ (the euro)<br>
+ </td>
+ </tr>
+ </tbody>
+</table>
+
+<br>
+Using the Unicode code points there is no confusion anymore which character
+is meant, because they uniquely define the character. The full Unicode
+code charts can be found <a href="http://www.unicode.org/charts">here</a>
+ (as a set of PDF documents). A nice application to see all Unicode
+characters is the Unicode Character Map (ucm), which can be found <a href="ftp://ftp.dcs.ed.ac.uk/pub/jec/programs/">here</a>, and which allows to select and paste any Unicode character.<br>
+<br>
+Some additional terminology (more terminology follows in the next section):<br>
+<ul>
+ <li><b>UCS</b> (Universal Character Set): the official name for the Unicode character set<br>
+ <br>
+ </li>
+ <li><b>ISO 10646</b>: the international standard that defines the Unicode character set<br>
+ <br>
+ </li>
+ <li><b>BMP</b>
+ (Basic Multilingual Plane) or Plane 0 is the 16-bit subset of UCS, i.e.
+the characters U+0000 to U+FFFF, which is supposed to cover all characters
+is all currently used languages. Code points outside that range are used
+for historical character sets (e.g. hieroglyphs) and special symbols.</li>
+</ul>
+<hr width="100%" size="2">
+<h2><a name="Unicode_encodings_UTF-8"></a>Unicode encodings, UTF-8</h2>
+Since Unicode characters are generally represented by a number that is 16
+bits wide, as seen above (for the basic plane), it would seem that all text
+files would double in size, since the usual ASCII characters are 8 bits wide.
+ However, the Unicode code points are not necessarily the values that
+are written to files... <br>
+<br>
+Indeed, the simplest solution is to take the code point that defines a character,
+split it up into two bytes, and write the two bytes to the file. This
+is called the UCS-2 encoding scheme:<br>
+<br>
+<table cellpadding="2" cellspacing="2" border="1" width="75%" align="center">
+ <tbody>
+ <tr>
+ <td valign="top" align="center"><b>Character</b><br>
+ </td>
+ <td valign="top" align="center"><b>Unicode code point</b><br>
+ </td>
+ <td valign="top" align="center"><b>Byte values in file (UCS-2)</b><br>
+ </td>
+ </tr>
+ <tr>
+ <td valign="top">A<br>
+ </td>
+ <td valign="top">U+0041<br>
+ </td>
+ <td valign="top">0x00, 0x41<br>
+ </td>
+ </tr>
+ <tr>
+ <td valign="top">é<br>
+ </td>
+ <td valign="top">U+00E9<br>
+ </td>
+ <td valign="top">0x00, 0xE9<br>
+ </td>
+ </tr>
+ <tr>
+ <td valign="top">θ (theta)<br>
+ </td>
+ <td valign="top">U+03B8<br>
+ </td>
+ <td valign="top">0x03, 0xB8<br>
+ </td>
+ </tr>
+ <tr>
+ <td valign="top">€ (euro)<br>
+ </td>
+ <td valign="top">U+20AC<br>
+ </td>
+ <td valign="top">0x20, 0xAC<br>
+ </td>
+ </tr>
+ </tbody>
+</table>
+<br>
+This table assumes a big-endian encoding of UCS-2: the endianness is in principle
+not defined, so there are two versions of UCS-2. The little-endian
+encoding results in the same values as in the table above, but in the inverse
+order.<br>
+<br>
+So, we see that the UCS-2 encoding results in a doubling of file sizes for
+files that contain only English text. This is a disadvantage for this
+encoding. Another disadvantage is that null bytes can occur in normal
+text, breaking all conventions for null-terminated C strings if you use the
+normal <code>char</code> type. This is why C also defines the <code>wchar_t</code>
+ type, which can hold a 32-bit character (at least in GNU systems). To
+avoid both of these disadvantages, UTF-8 was introduced.<br>
+<br>
+In UTF-8, the number of bytes used to write a character to a file depends
+on the Unicode code point. The corresponding table to the table above
+is:<br>
+<br>
+<table cellpadding="2" cellspacing="2" border="1" width="75%" align="center">
+<tbody><tr><td valign="top" align="center"><b>Character</b><br></td><td valign="top" align="center"><b>Unicode code point</b><br></td><td valign="top" align="center"><b>Byte values in file (UTF-8)</b><br></td></tr><tr><td valign="top">A<br></td><td valign="top">U+0041<br></td><td valign="top">0x41<br></td></tr><tr><td valign="top">é<br></td><td valign="top">U+00E9<br></td><td valign="top">0xC3, 0xA9<br></td></tr><tr><td valign="top">θ (theta)<br></td><td valign="top">U+03B8<br></td><td valign="top">0xCE, 0xB8<br></td></tr><tr><td valign="top">€ (euro)<br></td><td valign="top">U+20AC<br></td><td valign="top">0xE2, 0x82, 0xAC<br></td></tr></tbody>
+</table>
+<br>
+Some immediate observations:<br>
+<ul>
+ <li>Null characters don't occur, so there is no problem to put such characters
+in a null-terminated C string (without having to use the <code>wchar_t</code> type), although it is a little more difficult to get the length of the string.<br>
+ <br>
+ </li>
+ <li>Strict ASCII characters are encoded into 1 byte, which makes UTF-8
+completely backward compatible with ASCII. It doesn't change the size
+of normal ASCII text strings or files.<br>
+ <br>
+ </li>
+ <li>Some characters need 3 bytes in UTF-8. Indeed, all basic plane
+characters (U+0000 to U+FFFF) can be encoded in 1, 2 or 3 bytes.</li>
+</ul>
+An excellent explanation of how to characters are encoded in UTF-8 can be found <a href="http://www.cl.cam.ac.uk/%7Emgk25/unicode.html#utf-8">on this page</a>.<br>
+<br>
+Some additional terminology regarding encoding schemes (less important here):<br>
+<ul>
+ <li><b>UCS-2</b>: the encoding of Unicode characters in 2 bytes (16 bits) per character, can only encode U+0000 to U+FFFF (the basic plane)<br>
+ <br>
+ </li>
+ <li><b>UCS-4</b>: the encoding of Unicode characters in 4 bytes (32 bits) per character, can encode the entire UCS range<br>
+ <br>
+ </li>
+ <li><b>UTF-8</b>: the encoding scheme described above (UTF = UCS Transformation Format)<br>
+ <br>
+ </li>
+ <li><b>UTF-16</b>: an extension to UCS-2 to be able to encode characters
+outside the basic plane (the 21-bit range of Unicode 3.1), by using a sequence
+of two 16-bit characters<br>
+ <br>
+ </li>
+ <li><b>UTF-32</b>: the 4-byte encoding of the 21-bit range of Unicode 3.1, in fact the same as UCS-4<br>
+ <br>
+ </li>
+ <li><b>UTF-7</b>: can be safely forgotten... (a kind of 7-bit clean version of UTF-8)</li>
+</ul>
+Note that the byte order of UCS-2, UCS-4, UTF-16 and UTF-32 is not defined, so it can be big endian or little endian !<br>
+