Further explanation of Unicode.
authorPeter Verthez <Peter.Verthez@advalvas.be>
Sat, 12 Jan 2002 16:02:12 +0000 (16:02 +0000)
committerPeter Verthez <Peter.Verthez@advalvas.be>
Sat, 12 Jan 2002 16:02:12 +0000 (16:02 +0000)
doc/encoding.html
doc/images/fromhex.obj [new file with mode: 0644]
doc/images/fromhex.png [new file with mode: 0644]
doc/images/tohex.obj [new file with mode: 0644]
doc/images/tohex.png [new file with mode: 0644]

index a8f0a1e614ba799cbcfc629fea3d835c51dd3ef1..08d953e8c8ed6f34d70f573905c8a875629f9a40 100644 (file)
@@ -1,48 +1,68 @@
-<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title>Character encoding</title><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></head><body>
-<h1 align="center">Character encoding</h1>
-<br>
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
+  <title>Character encoding</title>
+  <meta http-equiv="content-type" content="text/html; 
+charset=UTF-8">
+</head><body>
+<h1 align="center">Character encoding<br>
+</h1>
+
 <h2>Index</h2>
+
 <ul>
+
   <li><a href="#The_character_encoding_problem">The character encoding problem</a></li>
-  <li><a href="#Unicode">Unicode</a><br>
+  <li><a href="#Unicode">Unicode code points</a></li><li><a href="#Unicode_encodings_UTF-8">Unicode encodings, UTF-8</a><br>
   </li>
+
 </ul>
+
 <br>
+
 <hr width="100%" size="2">
 <h2><a name="The_character_encoding_problem"></a>The character encoding problem</h2>
 
+
 Developers are usually familiar with the ASCII character set. &nbsp;This
 is a character set that assigns a unique number to some characters, e.g.
 an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or
 0x61 in hex). &nbsp;Some people may also have used ASCII codes for several
 drawing characters (such as a horizontal bar, a vertical bar, or a top-right
 corner) in the old DOS days, to be able to draw nice windows in text mode.<br>
+
 <br>
 
+
 However, these last characters are strictly spoken not part of the ASCII
 set. &nbsp;The standard ASCII set contains only the character positions from
 0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). &nbsp;An
 example of this table can be found <a href="http://web.cs.mun.ca/%7Emichael/c/ascii-table.html">here</a>. &nbsp;Anything that has an ASCII code between 128 and 255 is in principle undefined.<br>
+
 <br>
 
+
 Now, several systems (including the old DOS) have defined those character
 positions anyway, but usually in totally different ways. &nbsp;Some well
 known extensions are:<br>
+
 <ul>
+
 <li>the <a href="http://czyborra.com/charsets/cp437.gif">DOS</a>
  character set, nowadays usually known as Code Page 437, but sometimes also
 named LatinUS, ECS (Extended Character Set) or PC-8; note that the table
 displayed in the link also contains the standard ASCII part</li><li>the <a href="http://czyborra.com/charsets/cp1252.gif">ANSI</a> character set, also known as Code Page 1252, and usually the default on Windows</li><li>the <a href="http://czyborra.com/charsets/iso8859-1.gif">ISO-8859-1</a> character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices</li><li>the <a href="http://czyborra.com/charsets/adobe-stdenc.gif">Adobe Standard Encoding</a>, which is by default used in Postscript, unless overridden</li>
 </ul>
 
+
 And these are only examples of character sets used in West-European languages.
 &nbsp;For Japanese, Chinese, Korean, Vietnamese, ... there are separate character
 sets in which one byte's meaning can even be influenced by what the previous
 byte was, i.e. these are multi-byte character sets. &nbsp;This is because
 even 256 characters (the maximum for 8 bits) is totally inadequate to represent all characters in
 such languages.<br>
+
 <br>
 
+
 So, summarizing, if a text file contains a byte that has a value 65, it is
 pretty safe to assume that this byte represents an "A", if we ignore the
 multi-byte character sets spoken of before. &nbsp;However, a value 233 cannot
@@ -50,17 +70,241 @@ be interpreted without knowing in which character set the text file is written.
 &nbsp;In Latin-1, it happens to be the character "&eacute;", but in another
 character set it can be something totally different (e.g. in the DOS character
 set it is the Greek letter theta).<br>
+
+<br>
+
+<div align="center"><img src="images/fromhex.png" alt="Conversion from byte value" width="386" height="140">
 <br>
+</div>
+
+
+<br>
+
 
 Vice versa, if you need to write a character "&eacute;" to a file, it depends
 on the character set you will use what the numerical value will be in the
 file: in Latin-1 it will be 233, but if you use the DOS character set it
-will be 130.<br>
+will be 130, making it necessary again to know the encoding when you want to re-read the file.<br>
+
+<br>
+
+<div align="center"><img src="images/tohex.png" alt="Conversion to byte value" width="382" height="140">
+<br>
+<br>
+<div align="left">This is a source of great confusion as soon as you go outside
+the normal English character set, especially when you are using files on
+different systems...<br>
+</div>
+</div>
+
+
 <hr width="100%" size="2">
-<h2><a name="Unicode"></a>Unicode</h2>
+<h2><a name="Unicode"></a>Unicode code points</h2>
+
+
+Enter the&nbsp;<a href="www.unicode.org">Unicode</a> standard...<br>
+
+<br>
+
+Unicode solves the problem of encoding by assigning unique numbers to <b>every</b>
+ character that is used anywhere in the world. &nbsp;Since it is not possible
+to do this in 8 bits (with a maximum of 256 code positions), a Unicode character
+is usually represented by 16 bits, denoted by U+0000 to U+FFFF in hexadecimal
+style. &nbsp;A number such as U+0123 is named a "code point".<br>
+
+<br>
+ Recently (Unicode 3.1), some extensions have even been defined so that in 
+fact the defined range is now U+000000 to U+10FFFF (21 bits), and formally,
+the character set is defined as 31-bits to allow for future expansion.<br>
+
+<br>
+
+The Unicode character set is backward compatible with the ISO-8859-1 or Latin-1
+character set (and thus automatically also with the ASCII character set),
+because for every ISO-8859-1 character with hexadecimal value 0xXY, the corresponding
+Unicode code point is U+00XY.<br>
+
+<br>
+ Some examples of Unicode code points (some of the characters here may not
+be displayed correctly in all browsers; current Mozilla works perfectly for
+this, but it also depends on the installed fonts of course):<br>
+<br>
+
+<table cellpadding="2" cellspacing="2" border="1" width="50%" align="center">
+
+  <tbody>
+    <tr>
+      <td valign="top" align="center"><b>Unicode code point</b><br>
+      </td>
+      <td valign="top" align="center"><b>Character</b><br>
+      </td>
+    </tr>
+    <tr>
+      <td valign="top">U+0041<br>
+      </td>
+      <td valign="top">A<br>
+      </td>
+    </tr>
+    <tr>
+      <td valign="top">U+00E9<br>
+      </td>
+      <td valign="top">&eacute;<br>
+      </td>
+    </tr>
+    <tr>
+      <td valign="top">U+03B8<br>
+      </td>
+      <td valign="top">&#952; (the Greek theta)<br>
+      </td>
+    </tr>
+    <tr>
+      <td valign="top">U+20AC<br>
+      </td>
+      <td valign="top">&#8364; (the euro)<br>
+      </td>
+    </tr>
+  </tbody>
+</table>
+
+<br>
+Using the Unicode code points there is no confusion anymore which character
+is meant, because they uniquely define the character. &nbsp;The full Unicode
+code charts can be found <a href="http://www.unicode.org/charts">here</a>
+ (as a set of PDF documents). &nbsp;A nice application to see all Unicode
+characters is the Unicode Character Map (ucm), which can be found <a href="ftp://ftp.dcs.ed.ac.uk/pub/jec/programs/">here</a>, and which allows to select and paste any Unicode character.<br>
+<br>
+Some additional terminology (more terminology follows in the next section):<br>
+<ul>
+  <li><b>UCS</b> (Universal Character Set): the official name for the Unicode character set<br>
+    <br>
+  </li>
+  <li><b>ISO 10646</b>: the international standard that defines the Unicode character set<br>
+    <br>
+  </li>
+  <li><b>BMP</b> (Basic Multilingual Plane) or Plane 0 is the 16-bit subset
+of UCS, i.e. the characters U+0000 to U+FFFF, which is supposed to cover
+all characters is all currently used languages. &nbsp;Code points outside that range are used for historical character sets (e.g. hieroglyphs) and special symbols.</li>
+</ul>
+<hr width="100%" size="2">
+<h2><a name="Unicode_encodings_UTF-8"></a>Unicode encodings, UTF-8</h2>
+Since Unicode characters are generally represented by a number that is 16
+bits wide, as seen above (for the basic plane), it would seem that all text
+files would double in size, since the usual ASCII characters are 8 bits wide.
+&nbsp;However, the Unicode code points are not necessarily the values that
+are written to files... &nbsp;<br>
+<br>
+Indeed, the simplest solution is to take the code point that defines a character,
+split it up into two bytes, and write the two bytes to the file. &nbsp;This
+is called the UCS-2 encoding scheme:<br>
+<br>
+<table cellpadding="2" cellspacing="2" border="1" width="75%" align="center">
+  <tbody>
+    <tr>
+      <td valign="top" align="center"><b>Character</b><br>
+      </td>
+      <td valign="top" align="center"><b>Unicode code point</b><br>
+      </td>
+      <td valign="top" align="center"><b>Byte values in file (UCS-2)</b><br>
+      </td>
+    </tr>
+    <tr>
+      <td valign="top">A<br>
+      </td>
+      <td valign="top">U+0041<br>
+      </td>
+      <td valign="top">0x00, 0x41<br>
+      </td>
+    </tr>
+    <tr>
+      <td valign="top">&eacute;<br>
+      </td>
+      <td valign="top">U+00E9<br>
+      </td>
+      <td valign="top">0x00, 0xE9<br>
+      </td>
+    </tr>
+    <tr>
+      <td valign="top">&#952; (theta)<br>
+      </td>
+      <td valign="top">U+03B8<br>
+      </td>
+      <td valign="top">0x03, 0xB8<br>
+      </td>
+    </tr>
+    <tr>
+      <td valign="top">&#8364; (euro)<br>
+      </td>
+      <td valign="top">U+20AC<br>
+      </td>
+      <td valign="top">0x20, 0xAC<br>
+      </td>
+    </tr>
+  </tbody>
+</table>
+<br>
+This table assumes a big-endian encoding of UCS-2: the endianness is in principle
+not defined, so there are two versions of UCS-2. &nbsp;The little-endian
+encoding results in the same values as in the table above, but in the inverse
+order.<br>
+<br>
+So, we see that the UCS-2 encoding results in a doubling of file sizes for
+files that contain only English text. &nbsp;This is a disadvantage for this
+encoding. &nbsp;Another disadvantage is that null bytes can occur in normal
+text, breaking all conventions for null-terminated C strings if you use the
+normal <code>char</code> type. &nbsp;This is why C also defines the <code>wchar_t</code>
+ type, which can hold a 32-bit character (at least in GNU systems). &nbsp;To
+avoid both of these disadvantages, UTF-8 was introduced.<br>
+<br>
+In UTF-8, the number of bytes used to write a character to a file depends
+on the Unicode code point. &nbsp;The corresponding table to the table above
+is:<br>
+<br>
+<table cellpadding="2" cellspacing="2" border="1" width="75%" align="center">
+<tbody><tr><td valign="top" align="center"><b>Character</b><br></td><td valign="top" align="center"><b>Unicode code point</b><br></td><td valign="top" align="center"><b>Byte values in file (UTF-8)</b><br></td></tr><tr><td valign="top">A<br></td><td valign="top">U+0041<br></td><td valign="top">0x41<br></td></tr><tr><td valign="top">&eacute;<br></td><td valign="top">U+00E9<br></td><td valign="top">0xC3, 0xA9<br></td></tr><tr><td valign="top">&#952; (theta)<br></td><td valign="top">U+03B8<br></td><td valign="top">0xCE, 0xB8<br></td></tr><tr><td valign="top">&#8364; (euro)<br></td><td valign="top">U+20AC<br></td><td valign="top">0xE2, 0x82, 0xAC<br></td></tr></tbody>
+</table>
+<br>
+Some immediate observations:<br>
+<ul>
+  <li>Null characters don't occur, so there is no problem to put such characters
+in a null-terminated C string (without having to use the <code>wchar_t</code> type), although it is a little more difficult to get the length of the string.<br>
+    <br>
+  </li>
+  <li>Strict ASCII characters are encoded into 1 byte, which makes UTF-8
+completely backward compatible with ASCII. &nbsp;It doesn't change the size
+of normal ASCII text files.<br>
+    <br>
+  </li>
+  <li>Some characters need 3 bytes in UTF-8. &nbsp;Indeed, all basic plane
+characters (U+0000 to U+FFFF) can be encoded in 1, 2 or 3 bytes.</li>
+</ul>
+An excellent explanation of how to encode characters in UTF-8 can be found <a href="http://www.cl.cam.ac.uk/%7Emgk25/unicode.html#utf-8">on this page</a>.<br>
+<br>
+Some additional terminology regarding encoding schemes (less important here):<br>
+<ul>
+  <li><b>UCS-2</b>: the encoding of Unicode characters in 2 bytes (16 bits) per character, can only encode U+0000 to U+FFFF (the basic plane)<br>
+    <br>
+  </li>
+  <li><b>UCS-4</b>: the encoding of Unicode characters in 4 bytes (32 bits) per character, can encode the entire UCS range<br>
+    <br>
+  </li>
+  <li><b>UTF-8</b>: the encoding scheme described above (UTF = UCS Transformation Format)<br>
+    <br>
+  </li>
+  <li><b>UTF-16</b>: an extension to UCS-2 to be able to encode characters
+outside the basic plane (the 21-bit range of Unicode 3.1), by using a sequence
+of two 16-bit characters<br>
+    <br>
+  </li>
+  <li><b>UTF-32</b>: the 4-byte encoding of the 21-bit range of Unicode 3.1, in fact the same as UCS-4<br>
+    <br>
+  </li>
+  <li><b>UTF-7</b>: can be safely forgotten... (a kind of 7-bit clean version of UTF-8)</li>
+</ul>
+Note that the byte order of UCS-2, UCS-4, UTF-16 and UTF-32 is not defined, so it can be big endian or little endian !<br>
 
-Enter the Unicode standard...<br>
 <hr width="100%" size="2">
 <pre><font size="-1">$Id$<br>$Name$</font><br></pre>
+
 <br>
+
 </body></html>
\ No newline at end of file
diff --git a/doc/images/fromhex.obj b/doc/images/fromhex.obj
new file mode 100644 (file)
index 0000000..ba18018
--- /dev/null
@@ -0,0 +1,110 @@
+%TGIF 4.1.39
+state(0,37,100.000,0,0,0,16,0,9,1,1,1,0,1,0,1,0,'Symbol',0,144000,0,0,0,10,0,0,1,1,0,16,0,0,1,1,1,1,1088,1408,1,0,2880,0).
+%
+% @(#)$Header$
+% %W%
+%
+unit("1 pixel/pixel").
+color_info(11,65535,0,[
+       "magenta", 65535, 0, 65535, 65535, 0, 65535, 1,
+       "red", 65535, 0, 0, 65535, 0, 0, 1,
+       "green", 0, 65535, 0, 0, 65535, 0, 1,
+       "blue", 0, 0, 65535, 0, 0, 65535, 1,
+       "yellow", 65535, 65535, 0, 65535, 65535, 0, 1,
+       "pink", 65535, 49344, 52171, 65535, 49344, 52171, 1,
+       "cyan", 0, 65535, 65535, 0, 65535, 65535, 1,
+       "CadetBlue", 24415, 40606, 41120, 24415, 40606, 41120, 1,
+       "white", 65535, 65535, 65535, 65535, 65535, 65535, 1,
+       "black", 0, 0, 0, 0, 0, 0, 1,
+       "DarkSlateGray", 12079, 20303, 20303, 12079, 20303, 20303, 1
+]).
+script_frac("0.6").
+fg_bg_colors('black','white').
+page(1,"",1,'').
+text('black',64,18,1,1,1,92,17,0,14,3,0,0,0,0,2,92,17,0,0,"",0,0,0,0,32,'',[
+minilines(92,17,0,0,1,0,0,[
+mini_line(92,14,3,0,0,0,[
+str_block(0,92,14,3,0,0,0,0,0,[
+str_seg('black','Times-Roman',0,80640,92,14,3,0,0,0,0,0,0,0,
+       "byte value in file")])
+])
+])]).
+text('black',334,18,1,0,1,56,17,4,14,3,0,0,0,0,2,56,17,0,0,"",0,0,0,0,32,'',[
+minilines(56,17,0,0,0,0,0,[
+mini_line(56,14,3,0,0,0,[
+str_block(0,56,14,3,0,0,0,0,0,[
+str_seg('black','Times-Roman',0,80640,56,14,3,0,0,0,0,0,0,0,
+       "character")])
+])
+])]).
+text('black',8,45,1,0,1,120,24,6,19,5,0,0,0,0,2,120,24,0,0,"",0,0,0,0,64,'',[
+minilines(120,24,0,0,0,0,0,[
+mini_line(120,19,5,0,0,0,[
+str_block(0,120,19,5,0,-8,0,0,0,[
+str_seg('black','Courier',0,144000,120,19,5,0,-8,0,0,0,0,0,
+       "65(0x41)")])
+])
+])]).
+text('black',354,45,1,0,1,15,24,8,19,5,0,0,0,0,2,15,24,0,0,"",0,0,0,0,64,'',[
+minilines(15,24,0,0,0,0,0,[
+mini_line(15,19,5,0,0,0,[
+str_block(0,15,19,5,0,-1,0,0,0,[
+str_seg('black','Courier',0,144000,15,19,5,0,-1,0,0,0,0,0,
+       "A")])
+])
+])]).
+poly('black','',2,[
+       146,56,338,56],1,2,1,10,0,0,0,0,0,0,0,'2',0,0,
+    "0","",[
+    0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
+]).
+text('black',7,102,1,0,1,135,24,44,19,5,0,0,0,0,2,135,24,0,0,"",0,0,0,0,121,'',[
+minilines(135,24,0,0,0,0,0,[
+mini_line(135,19,5,0,0,0,[
+str_block(0,135,19,5,0,-8,0,0,0,[
+str_seg('black','Courier',0,144000,135,19,5,0,-8,0,0,0,0,0,
+       "233(0xE9)")])
+])
+])]).
+poly('black','',2,[
+       146,112,338,112],1,2,1,46,0,0,0,0,0,0,0,'2',0,0,
+    "0","",[
+    0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
+]).
+text('black',349,94,1,0,1,17,29,51,23,6,0,0,0,0,2,17,29,0,0,"",0,0,0,0,117,'',[
+minilines(17,29,0,0,0,0,0,[
+mini_line(17,23,6,0,0,0,[
+str_block(0,17,23,6,0,-1,0,0,0,[
+str_seg('black','Times-Roman',0,144000,17,23,6,0,-1,0,0,0,0,0,
+       " \351")])
+])
+])]).
+text('black',281,90,1,1,1,110,17,53,14,3,0,0,0,0,2,110,17,0,0,"",0,0,0,0,104,'',[
+minilines(110,17,0,0,1,0,0,[
+mini_line(110,14,3,0,0,0,[
+str_block(0,110,14,3,0,-1,0,0,0,[
+str_seg('black','Times-Roman',0,80640,110,14,3,0,-1,0,0,0,0,0,
+       "iso-8859-1, ANSI")])
+])
+])]).
+poly('black','',3,[
+       178,112,178,144,339,144],1,2,1,55,0,0,0,0,0,0,0,'2',0,0,
+    "0","",[
+    0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
+]).
+text('black',355,131,1,0,1,12,24,64,19,5,0,0,0,0,2,12,24,0,0,"",0,0,0,0,150,'',[
+minilines(12,24,0,0,0,0,0,[
+mini_line(12,19,5,0,0,0,[
+str_block(0,12,19,5,0,-1,0,0,0,[
+str_seg('black','Symbol',0,144000,12,19,5,0,-1,0,0,0,0,0,
+       "q")])
+])
+])]).
+text('black',319,122,1,1,1,28,17,68,14,3,0,0,0,0,2,28,17,0,0,"",0,0,0,0,136,'',[
+minilines(28,17,0,0,1,0,0,[
+mini_line(28,14,3,0,0,0,[
+str_block(0,28,14,3,0,-1,0,0,0,[
+str_seg('black','Times-Roman',0,80640,28,14,3,0,-1,0,0,0,0,0,
+       "DOS")])
+])
+])]).
diff --git a/doc/images/fromhex.png b/doc/images/fromhex.png
new file mode 100644 (file)
index 0000000..ee2a485
Binary files /dev/null and b/doc/images/fromhex.png differ
diff --git a/doc/images/tohex.obj b/doc/images/tohex.obj
new file mode 100644 (file)
index 0000000..4fff9a7
--- /dev/null
@@ -0,0 +1,110 @@
+%TGIF 4.1.39
+state(0,37,100.000,0,0,0,16,0,9,1,1,1,0,2,0,1,0,'Symbol',0,144000,0,0,0,10,0,0,1,1,0,16,0,0,1,1,1,1,1088,1408,1,0,2880,0).
+%
+% @(#)$Header$
+% %W%
+%
+unit("1 pixel/pixel").
+color_info(11,65535,0,[
+       "magenta", 65535, 0, 65535, 65535, 0, 65535, 1,
+       "red", 65535, 0, 0, 65535, 0, 0, 1,
+       "green", 0, 65535, 0, 0, 65535, 0, 1,
+       "blue", 0, 0, 65535, 0, 0, 65535, 1,
+       "yellow", 65535, 65535, 0, 65535, 65535, 0, 1,
+       "pink", 65535, 49344, 52171, 65535, 49344, 52171, 1,
+       "cyan", 0, 65535, 65535, 0, 65535, 65535, 1,
+       "CadetBlue", 24415, 40606, 41120, 24415, 40606, 41120, 1,
+       "white", 65535, 65535, 65535, 65535, 65535, 65535, 1,
+       "black", 0, 0, 0, 0, 0, 0, 1,
+       "DarkSlateGray", 12079, 20303, 20303, 12079, 20303, 20303, 1
+]).
+script_frac("0.6").
+fg_bg_colors('black','white').
+page(1,"",1,'').
+text('black',64,18,1,1,1,92,17,0,14,3,0,0,0,0,2,92,17,0,0,"",0,0,0,0,32,'',[
+minilines(92,17,0,0,1,0,0,[
+mini_line(92,14,3,0,0,0,[
+str_block(0,92,14,3,0,0,0,0,0,[
+str_seg('black','Times-Roman',0,80640,92,14,3,0,0,0,0,0,0,0,
+       "byte value in file")])
+])
+])]).
+text('black',329,18,1,0,1,56,17,4,14,3,0,0,0,0,2,56,17,0,0,"",0,0,0,0,32,'',[
+minilines(56,17,0,0,0,0,0,[
+mini_line(56,14,3,0,0,0,[
+str_block(0,56,14,3,0,0,0,0,0,[
+str_seg('black','Times-Roman',0,80640,56,14,3,0,0,0,0,0,0,0,
+       "character")])
+])
+])]).
+text('black',7,44,1,0,1,120,24,6,19,5,0,0,0,0,2,120,24,0,0,"",0,0,0,0,63,'',[
+minilines(120,24,0,0,0,0,0,[
+mini_line(120,19,5,0,0,0,[
+str_block(0,120,19,5,0,-8,0,0,0,[
+str_seg('black','Courier',0,144000,120,19,5,0,-8,0,0,0,0,0,
+       "65(0x41)")])
+])
+])]).
+text('black',349,45,1,0,1,15,24,8,19,5,0,0,0,0,2,15,24,0,0,"",0,0,0,0,64,'',[
+minilines(15,24,0,0,0,0,0,[
+mini_line(15,19,5,0,0,0,[
+str_block(0,15,19,5,0,-1,0,0,0,[
+str_seg('black','Courier',0,144000,15,19,5,0,-1,0,0,0,0,0,
+       "A")])
+])
+])]).
+poly('black','',2,[
+       141,56,333,56],2,2,1,10,0,0,0,0,0,0,0,'2',0,0,
+    "0","",[
+    0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
+]).
+text('black',6,100,1,0,1,135,24,44,19,5,0,0,0,0,2,135,24,0,0,"",0,0,0,0,119,'',[
+minilines(135,24,0,0,0,0,0,[
+mini_line(135,19,5,0,0,0,[
+str_block(0,135,19,5,0,-8,0,0,0,[
+str_seg('black','Courier',0,144000,135,19,5,0,-8,0,0,0,0,0,
+       "233(0xE9)")])
+])
+])]).
+poly('black','',2,[
+       141,112,333,112],2,2,1,46,0,0,0,0,0,0,0,'2',0,0,
+    "0","",[
+    0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
+]).
+text('black',344,94,1,0,1,17,29,51,23,6,0,0,0,0,2,17,29,0,0,"",0,0,0,0,117,'',[
+minilines(17,29,0,0,0,0,0,[
+mini_line(17,23,6,0,0,0,[
+str_block(0,17,23,6,0,-1,0,0,0,[
+str_seg('black','Times-Roman',0,144000,17,23,6,0,-1,0,0,0,0,0,
+       " \351")])
+])
+])]).
+text('black',201,89,1,1,1,110,17,53,14,3,0,0,0,0,2,110,17,0,0,"",0,0,0,0,103,'',[
+minilines(110,17,0,0,1,0,0,[
+mini_line(110,14,3,0,0,0,[
+str_block(0,110,14,3,0,-1,0,0,0,[
+str_seg('black','Times-Roman',0,80640,110,14,3,0,-1,0,0,0,0,0,
+       "iso-8859-1, ANSI")])
+])
+])]).
+poly('black','',3,[
+       302,112,302,144,141,144],1,2,1,55,0,0,0,0,0,0,0,'2',0,0,
+    "0","",[
+    0,10,4,0,'10','4','0'],[0,10,4,0,'10','4','0'],[
+]).
+text('black',159,122,1,1,1,28,17,68,14,3,0,0,0,0,2,28,17,0,0,"",0,0,0,0,136,'',[
+minilines(28,17,0,0,1,0,0,[
+mini_line(28,14,3,0,0,0,[
+str_block(0,28,14,3,0,-1,0,0,0,[
+str_seg('black','Times-Roman',0,80640,28,14,3,0,-1,0,0,0,0,0,
+       "DOS")])
+])
+])]).
+text('black',6,131,1,0,1,135,24,78,19,5,0,0,0,0,2,135,24,0,0,"",0,0,0,0,150,'',[
+minilines(135,24,0,0,0,0,0,[
+mini_line(135,19,5,0,0,0,[
+str_block(0,135,19,5,0,-8,0,0,0,[
+str_seg('black','Courier',0,144000,135,19,5,0,-8,0,0,0,0,0,
+       "130(0x82)")])
+])
+])]).
diff --git a/doc/images/tohex.png b/doc/images/tohex.png
new file mode 100644 (file)
index 0000000..b79391b
Binary files /dev/null and b/doc/images/tohex.png differ