doc/encoding.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
   2   <title>Character encoding</title>
   3   <meta http-equiv="content-type" content="text/html;
   4 charset=UTF-8">
   5 </head><body>
   6 <h1 align="center">Character encoding<br>
   7 </h1>
   8
   9 <h2>Index</h2>
  10
  11 <ul>
  12
  13   <li><a href="#The_character_encoding_problem">The character encoding problem</a></li>
  14   <li><a href="#Unicode">Unicode code points</a></li><li><a href="#Unicode_encodings_UTF-8">Unicode encodings, UTF-8</a><br>
  15   </li>
  16
  17 </ul>
  18
  19 <br>
  20
  21 <hr width="100%" size="2">
  22 <h2><a name="The_character_encoding_problem"></a>The character encoding problem</h2>
  23
  24
  25 Developers are usually familiar with the ASCII character set. &nbsp;This
  26 is a character set that assigns a unique number to some characters, e.g.
  27 an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or
  28 0x61 in hex). &nbsp;Some people may also have used ASCII codes for several
  29 drawing characters (such as a horizontal bar, a vertical bar, or a top-right
  30 corner) in the old DOS days, to be able to draw nice windows in text mode.<br>
  31
  32 <br>
  33
  34
  35 However, these last characters are strictly spoken not part of the ASCII
  36 set. &nbsp;The standard ASCII set contains only the character positions from
  37 0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). &nbsp;An
  38 example of this table can be found <a href="http://web.cs.mun.ca/%7Emichael/c/ascii-table.html">here</a>. &nbsp;Anything that has an ASCII code between 128 and 255 is in principle undefined.<br>
  39
  40 <br>
  41
  42
  43 Now, several systems (including the old DOS) have defined those character
  44 positions anyway, but usually in totally different ways. &nbsp;Some well
  45 known extensions are:<br>
  46
  47 <ul>
  48
  49 <li>the <a href="http://czyborra.com/charsets/cp437.gif">DOS</a>
  50  character set, nowadays usually known as Code Page 437, but sometimes also
  51 named LatinUS, ECS (Extended Character Set) or PC-8; note that the table
  52 displayed in the link also contains the standard ASCII part</li><li>the <a href="http://czyborra.com/charsets/cp1252.gif">ANSI</a> character set, also known as Code Page 1252, and usually the default on Windows</li><li>the <a href="http://czyborra.com/charsets/iso8859-1.gif">ISO-8859-1</a> character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices</li><li>the <a href="http://czyborra.com/charsets/adobe-stdenc.gif">Adobe Standard Encoding</a>, which is by default used in Postscript, unless overridden</li>
  53 </ul>
  54
  55
  56 And these are only examples of character sets used in West-European languages.
  57 &nbsp;For Japanese, Chinese, Korean, Vietnamese, ... there are separate character
  58 sets in which one byte's meaning can even be influenced by what the previous
  59 byte was, i.e. these are multi-byte character sets. &nbsp;This is because
  60 even 256 characters (the maximum for 8 bits) is totally inadequate to represent all characters in
  61 such languages.<br>
  62
  63 <br>
  64
  65
  66 So, summarizing, if a text file contains a byte that has a value 65, it is
  67 pretty safe to assume that this byte represents an "A", if we ignore the
  68 multi-byte character sets spoken of before. &nbsp;However, a value 233 cannot
  69 be interpreted without knowing in which character set the text file is written.
  70 &nbsp;In Latin-1, it happens to be the character "&eacute;", but in another
  71 character set it can be something totally different (e.g. in the DOS character
  72 set it is the Greek letter theta).<br>
  73
  74 <br>
  75
  76 <div align="center"><img src="images/fromhex.png" alt="Conversion from byte value" width="386" height="140">
  77 <br>
  78 </div>
  79
  80
  81 <br>
  82
  83
  84 Vice versa, if you need to write a character "&eacute;" to a file, it depends
  85 on the character set you will use what the numerical value will be in the
  86 file: in Latin-1 it will be 233, but if you use the DOS character set it
  87 will be 130, making it necessary again to know the encoding when you want to re-read the file.<br>
  88
  89 <br>
  90
  91 <div align="center"><img src="images/tohex.png" alt="Conversion to byte value" width="382" height="140">
  92 <br>
  93 <br>
  94 <div align="left">This is a source of great confusion as soon as you go outside
  95 the normal English character set, especially when you are using files on
  96 different systems...<br>
  97 </div>
  98 </div>
  99
 100
 101 <hr width="100%" size="2">
 102 <h2><a name="Unicode"></a>Unicode code points</h2>
 103
 104
 105 Enter the&nbsp;<a href="www.unicode.org">Unicode</a> standard...<br>
 106
 107 <br>
 108
 109 Unicode solves the problem of encoding by assigning unique numbers to <b>every</b>
 110  character that is used anywhere in the world. &nbsp;Since it is not possible
 111 to do this in 8 bits (with a maximum of 256 code positions), a Unicode character
 112 is usually represented by 16 bits, denoted by U+0000 to U+FFFF in hexadecimal
 113 style. &nbsp;A number such as U+0123 is named a "code point".<br>
 114
 115 <br>
 116  Recently (Unicode 3.1), some extensions have even been defined so that in
 117 fact the defined range is now U+000000 to U+10FFFF (21 bits), and formally,
 118 the character set is defined as 31-bits to allow for future expansion.<br>
 119
 120 <br>
 121
 122 The Unicode character set is backward compatible with the ISO-8859-1 or Latin-1
 123 character set (and thus automatically also with the ASCII character set),
 124 because for every ISO-8859-1 character with hexadecimal value 0xXY, the corresponding
 125 Unicode code point is U+00XY.<br>
 126
 127 <br>
 128  Some examples of Unicode code points (some of the characters here may not
 129 be displayed correctly in all browsers; current Mozilla works perfectly for
 130 this, but it also depends on the installed fonts of course):<br>
 131 <br>
 132
 133 <table cellpadding="2" cellspacing="2" border="1" width="50%" align="center">
 134
 135   <tbody>
 136     <tr>
 137       <td valign="top" align="center"><b>Unicode code point</b><br>
 138       </td>
 139       <td valign="top" align="center"><b>Character</b><br>
 140       </td>
 141     </tr>
 142     <tr>
 143       <td valign="top">U+0041<br>
 144       </td>
 145       <td valign="top">A<br>
 146       </td>
 147     </tr>
 148     <tr>
 149       <td valign="top">U+00E9<br>
 150       </td>
 151       <td valign="top">&eacute;<br>
 152       </td>
 153     </tr>
 154     <tr>
 155       <td valign="top">U+03B8<br>
 156       </td>
 157       <td valign="top">&#952; (the Greek theta)<br>
 158       </td>
 159     </tr>
 160     <tr>
 161       <td valign="top">U+20AC<br>
 162       </td>
 163       <td valign="top">&#8364; (the euro)<br>
 164       </td>
 165     </tr>
 166   </tbody>
 167 </table>
 168
 169 <br>
 170 Using the Unicode code points there is no confusion anymore which character
 171 is meant, because they uniquely define the character. &nbsp;The full Unicode
 172 code charts can be found <a href="http://www.unicode.org/charts">here</a>
 173  (as a set of PDF documents). &nbsp;A nice application to see all Unicode
 174 characters is the Unicode Character Map (ucm), which can be found <a href="ftp://ftp.dcs.ed.ac.uk/pub/jec/programs/">here</a>, and which allows to select and paste any Unicode character.<br>
 175 <br>
 176 Some additional terminology (more terminology follows in the next section):<br>
 177 <ul>
 178   <li><b>UCS</b> (Universal Character Set): the official name for the Unicode character set<br>
 179     <br>
 180   </li>
 181   <li><b>ISO 10646</b>: the international standard that defines the Unicode character set<br>
 182     <br>
 183   </li>
 184   <li><b>BMP</b> (Basic Multilingual Plane) or Plane 0 is the 16-bit subset
 185 of UCS, i.e. the characters U+0000 to U+FFFF, which is supposed to cover
 186 all characters is all currently used languages. &nbsp;Code points outside that range are used for historical character sets (e.g. hieroglyphs) and special symbols.</li>
 187 </ul>
 188 <hr width="100%" size="2">
 189 <h2><a name="Unicode_encodings_UTF-8"></a>Unicode encodings, UTF-8</h2>
 190 Since Unicode characters are generally represented by a number that is 16
 191 bits wide, as seen above (for the basic plane), it would seem that all text
 192 files would double in size, since the usual ASCII characters are 8 bits wide.
 193 &nbsp;However, the Unicode code points are not necessarily the values that
 194 are written to files... &nbsp;<br>
 195 <br>
 196 Indeed, the simplest solution is to take the code point that defines a character,
 197 split it up into two bytes, and write the two bytes to the file. &nbsp;This
 198 is called the UCS-2 encoding scheme:<br>
 199 <br>
 200 <table cellpadding="2" cellspacing="2" border="1" width="75%" align="center">
 201   <tbody>
 202     <tr>
 203       <td valign="top" align="center"><b>Character</b><br>
 204       </td>
 205       <td valign="top" align="center"><b>Unicode code point</b><br>
 206       </td>
 207       <td valign="top" align="center"><b>Byte values in file (UCS-2)</b><br>
 208       </td>
 209     </tr>
 210     <tr>
 211       <td valign="top">A<br>
 212       </td>
 213       <td valign="top">U+0041<br>
 214       </td>
 215       <td valign="top">0x00, 0x41<br>
 216       </td>
 217     </tr>
 218     <tr>
 219       <td valign="top">&eacute;<br>
 220       </td>
 221       <td valign="top">U+00E9<br>
 222       </td>
 223       <td valign="top">0x00, 0xE9<br>
 224       </td>
 225     </tr>
 226     <tr>
 227       <td valign="top">&#952; (theta)<br>
 228       </td>
 229       <td valign="top">U+03B8<br>
 230       </td>
 231       <td valign="top">0x03, 0xB8<br>
 232       </td>
 233     </tr>
 234     <tr>
 235       <td valign="top">&#8364; (euro)<br>
 236       </td>
 237       <td valign="top">U+20AC<br>
 238       </td>
 239       <td valign="top">0x20, 0xAC<br>
 240       </td>
 241     </tr>
 242   </tbody>
 243 </table>
 244 <br>
 245 This table assumes a big-endian encoding of UCS-2: the endianness is in principle
 246 not defined, so there are two versions of UCS-2. &nbsp;The little-endian
 247 encoding results in the same values as in the table above, but in the inverse
 248 order.<br>
 249 <br>
 250 So, we see that the UCS-2 encoding results in a doubling of file sizes for
 251 files that contain only English text. &nbsp;This is a disadvantage for this
 252 encoding. &nbsp;Another disadvantage is that null bytes can occur in normal
 253 text, breaking all conventions for null-terminated C strings if you use the
 254 normal <code>char</code> type. &nbsp;This is why C also defines the <code>wchar_t</code>
 255  type, which can hold a 32-bit character (at least in GNU systems). &nbsp;To
 256 avoid both of these disadvantages, UTF-8 was introduced.<br>
 257 <br>
 258 In UTF-8, the number of bytes used to write a character to a file depends
 259 on the Unicode code point. &nbsp;The corresponding table to the table above
 260 is:<br>
 261 <br>
 262 <table cellpadding="2" cellspacing="2" border="1" width="75%" align="center">
 263 <tbody><tr><td valign="top" align="center"><b>Character</b><br></td><td valign="top" align="center"><b>Unicode code point</b><br></td><td valign="top" align="center"><b>Byte values in file (UTF-8)</b><br></td></tr><tr><td valign="top">A<br></td><td valign="top">U+0041<br></td><td valign="top">0x41<br></td></tr><tr><td valign="top">&eacute;<br></td><td valign="top">U+00E9<br></td><td valign="top">0xC3, 0xA9<br></td></tr><tr><td valign="top">&#952; (theta)<br></td><td valign="top">U+03B8<br></td><td valign="top">0xCE, 0xB8<br></td></tr><tr><td valign="top">&#8364; (euro)<br></td><td valign="top">U+20AC<br></td><td valign="top">0xE2, 0x82, 0xAC<br></td></tr></tbody>
 264 </table>
 265 <br>
 266 Some immediate observations:<br>
 267 <ul>
 268   <li>Null characters don't occur, so there is no problem to put such characters
 269 in a null-terminated C string (without having to use the <code>wchar_t</code> type), although it is a little more difficult to get the length of the string.<br>
 270     <br>
 271   </li>
 272   <li>Strict ASCII characters are encoded into 1 byte, which makes UTF-8
 273 completely backward compatible with ASCII. &nbsp;It doesn't change the size
 274 of normal ASCII text files.<br>
 275     <br>
 276   </li>
 277   <li>Some characters need 3 bytes in UTF-8. &nbsp;Indeed, all basic plane
 278 characters (U+0000 to U+FFFF) can be encoded in 1, 2 or 3 bytes.</li>
 279 </ul>
 280 An excellent explanation of how to encode characters in UTF-8 can be found <a href="http://www.cl.cam.ac.uk/%7Emgk25/unicode.html#utf-8">on this page</a>.<br>
 281 <br>
 282 Some additional terminology regarding encoding schemes (less important here):<br>
 283 <ul>
 284   <li><b>UCS-2</b>: the encoding of Unicode characters in 2 bytes (16 bits) per character, can only encode U+0000 to U+FFFF (the basic plane)<br>
 285     <br>
 286   </li>
 287   <li><b>UCS-4</b>: the encoding of Unicode characters in 4 bytes (32 bits) per character, can encode the entire UCS range<br>
 288     <br>
 289   </li>
 290   <li><b>UTF-8</b>: the encoding scheme described above (UTF = UCS Transformation Format)<br>
 291     <br>
 292   </li>
 293   <li><b>UTF-16</b>: an extension to UCS-2 to be able to encode characters
 294 outside the basic plane (the 21-bit range of Unicode 3.1), by using a sequence
 295 of two 16-bit characters<br>
 296     <br>
 297   </li>
 298   <li><b>UTF-32</b>: the 4-byte encoding of the 21-bit range of Unicode 3.1, in fact the same as UCS-4<br>
 299     <br>
 300   </li>
 301   <li><b>UTF-7</b>: can be safely forgotten... (a kind of 7-bit clean version of UTF-8)</li>
 302 </ul>
 303 Note that the byte order of UCS-2, UCS-4, UTF-16 and UTF-32 is not defined, so it can be big endian or little endian !<br>
 304
 305 <hr width="100%" size="2">
 306 <pre><font size="-1">$Id$<br>$Name$</font><br></pre>
 307
 308 <br>
 309
 310 </body></html>