doc/encoding.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title>Character encoding</title><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></head><body>
   2 <h1 align="center">Character encoding</h1>
   3 <br>
   4 <h2>Index</h2>
   5 <ul>
   6   <li><a href="#The_character_encoding_problem">The character encoding problem</a></li>
   7   <li><a href="#Unicode">Unicode</a><br>
   8   </li>
   9 </ul>
  10 <br>
  11 <hr width="100%" size="2">
  12 <h2><a name="The_character_encoding_problem"></a>The character encoding problem</h2>
  13
  14 Developers are usually familiar with the ASCII character set. &nbsp;This
  15 is a character set that assigns a unique number to some characters, e.g.
  16 an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or
  17 0x61 in hex). &nbsp;Some people may also have used ASCII codes for several
  18 drawing characters (such as a horizontal bar, a vertical bar, or a top-right
  19 corner) in the old DOS days, to be able to draw nice windows in text mode.<br>
  20 <br>
  21
  22 However, these last characters are strictly spoken not part of the ASCII
  23 set. &nbsp;The standard ASCII set contains only the character positions from
  24 0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). &nbsp;An
  25 example of this table can be found <a href="http://web.cs.mun.ca/%7Emichael/c/ascii-table.html">here</a>. &nbsp;Anything that has an ASCII code between 128 and 255 is in principle undefined.<br>
  26 <br>
  27
  28 Now, several systems (including the old DOS) have defined those character
  29 positions anyway, but usually in totally different ways. &nbsp;Some well
  30 known extensions are:<br>
  31 <ul>
  32 <li>the <a href="http://czyborra.com/charsets/cp437.gif">DOS</a>
  33  character set, nowadays usually known as Code Page 437, but sometimes also
  34 named LatinUS, ECS (Extended Character Set) or PC-8; note that the table
  35 displayed in the link also contains the standard ASCII part</li><li>the <a href="http://czyborra.com/charsets/cp1252.gif">ANSI</a> character set, also known as Code Page 1252, and usually the default on Windows</li><li>the <a href="http://czyborra.com/charsets/iso8859-1.gif">ISO-8859-1</a> character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices</li><li>the <a href="http://czyborra.com/charsets/adobe-stdenc.gif">Adobe Standard Encoding</a>, which is by default used in Postscript, unless overridden</li>
  36 </ul>
  37
  38 And these are only examples of character sets used in West-European languages.
  39 &nbsp;For Japanese, Chinese, Korean, Vietnamese, ... there are separate character
  40 sets in which one byte's meaning can even be influenced by what the previous
  41 byte was, i.e. these are multi-byte character sets. &nbsp;This is because
  42 even 256 characters (the maximum for 8 bits) is totally inadequate to represent all characters in
  43 such languages.<br>
  44 <br>
  45
  46 So, summarizing, if a text file contains a byte that has a value 65, it is
  47 pretty safe to assume that this byte represents an "A", if we ignore the
  48 multi-byte character sets spoken of before. &nbsp;However, a value 233 cannot
  49 be interpreted without knowing in which character set the text file is written.
  50 &nbsp;In Latin-1, it happens to be the character "&eacute;", but in another
  51 character set it can be something totally different (e.g. in the DOS character
  52 set it is the Greek letter theta).<br>
  53 <br>
  54
  55 Vice versa, if you need to write a character "&eacute;" to a file, it depends
  56 on the character set you will use what the numerical value will be in the
  57 file: in Latin-1 it will be 233, but if you use the DOS character set it
  58 will be 130.<br>
  59 <hr width="100%" size="2">
  60 <h2><a name="Unicode"></a>Unicode</h2>
  61
  62 Enter the Unicode standard...<br>
  63 <hr width="100%" size="2">
  64 <pre><font size="-1">$Id$<br>$Name$</font><br></pre>
  65 <br>
  66 </body></html>