From 37246a4b0ab22fb948d6fb5b9b91917441db26cf Mon Sep 17 00:00:00 2001 From: Peter Verthez <Peter.Verthez@advalvas.be> Date: Sat, 12 Jan 2002 13:13:14 +0000 Subject: [PATCH] Moved encoding introduction to separate html page. --- doc/Makefile.am | 3 ++- doc/encoding.html | 66 +++++++++++++++++++++++++++++++++++++++++++++++ doc/parser.html | 57 ++++------------------------------------ 3 files changed, 73 insertions(+), 53 deletions(-) create mode 100644 doc/encoding.html diff --git a/doc/Makefile.am b/doc/Makefile.am index 7ce3c3f..5bf8e72 100644 --- a/doc/Makefile.am +++ b/doc/Makefile.am @@ -3,7 +3,8 @@ # $Name$ SUBDIRS = images . -DOC_FILES = index.html usage.html parser.html interface.html links.html +DOC_FILES = index.html usage.html parser.html interface.html links.html \ + encoding.html VERSIONED_FILES = index.html EXTRA_DIST = $(DOC_FILES) docdir = $(datadir)/doc/@PACKAGE@-@VERSION@ diff --git a/doc/encoding.html b/doc/encoding.html new file mode 100644 index 0000000..a8f0a1e --- /dev/null +++ b/doc/encoding.html @@ -0,0 +1,66 @@ +<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title>Character encoding</title><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></head><body> +<h1 align="center">Character encoding</h1> +<br> +<h2>Index</h2> +<ul> + <li><a href="#The_character_encoding_problem">The character encoding problem</a></li> + <li><a href="#Unicode">Unicode</a><br> + </li> +</ul> +<br> +<hr width="100%" size="2"> +<h2><a name="The_character_encoding_problem"></a>The character encoding problem</h2> + +Developers are usually familiar with the ASCII character set. This +is a character set that assigns a unique number to some characters, e.g. +an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or +0x61 in hex). Some people may also have used ASCII codes for several +drawing characters (such as a horizontal bar, a vertical bar, or a top-right +corner) in the old DOS days, to be able to draw nice windows in text mode.<br> +<br> + +However, these last characters are strictly spoken not part of the ASCII +set. The standard ASCII set contains only the character positions from +0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). An +example of this table can be found <a href="http://web.cs.mun.ca/%7Emichael/c/ascii-table.html">here</a>. Anything that has an ASCII code between 128 and 255 is in principle undefined.<br> +<br> + +Now, several systems (including the old DOS) have defined those character +positions anyway, but usually in totally different ways. Some well +known extensions are:<br> +<ul> +<li>the <a href="http://czyborra.com/charsets/cp437.gif">DOS</a> + character set, nowadays usually known as Code Page 437, but sometimes also +named LatinUS, ECS (Extended Character Set) or PC-8; note that the table +displayed in the link also contains the standard ASCII part</li><li>the <a href="http://czyborra.com/charsets/cp1252.gif">ANSI</a> character set, also known as Code Page 1252, and usually the default on Windows</li><li>the <a href="http://czyborra.com/charsets/iso8859-1.gif">ISO-8859-1</a> character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices</li><li>the <a href="http://czyborra.com/charsets/adobe-stdenc.gif">Adobe Standard Encoding</a>, which is by default used in Postscript, unless overridden</li> +</ul> + +And these are only examples of character sets used in West-European languages. + For Japanese, Chinese, Korean, Vietnamese, ... there are separate character +sets in which one byte's meaning can even be influenced by what the previous +byte was, i.e. these are multi-byte character sets. This is because +even 256 characters (the maximum for 8 bits) is totally inadequate to represent all characters in +such languages.<br> +<br> + +So, summarizing, if a text file contains a byte that has a value 65, it is +pretty safe to assume that this byte represents an "A", if we ignore the +multi-byte character sets spoken of before. However, a value 233 cannot +be interpreted without knowing in which character set the text file is written. + In Latin-1, it happens to be the character "é", but in another +character set it can be something totally different (e.g. in the DOS character +set it is the Greek letter theta).<br> +<br> + +Vice versa, if you need to write a character "é" to a file, it depends +on the character set you will use what the numerical value will be in the +file: in Latin-1 it will be 233, but if you use the DOS character set it +will be 130.<br> +<hr width="100%" size="2"> +<h2><a name="Unicode"></a>Unicode</h2> + +Enter the Unicode standard...<br> +<hr width="100%" size="2"> +<pre><font size="-1">$Id$<br>$Name$</font><br></pre> +<br> +</body></html> \ No newline at end of file diff --git a/doc/parser.html b/doc/parser.html index fbaefdf..893670c 100644 --- a/doc/parser.html +++ b/doc/parser.html @@ -1,9 +1,9 @@ <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head> - <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><title>The Gedcom parser library</title></head><body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099"> + <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><title>The Gedcom parser library internals</title></head><body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099"> <div align="center"> -<h1>The Gedcom parser library</h1> +<h1>The Gedcom parser library internals</h1> <div align="left">The intention of this page is to provide some explanation of the gedcom parser, to aid development on and with it. First, @@ -151,56 +151,9 @@ the file.<br> <br> This basic description ignores the problem of character encoding. The next section describes what this problem exactly is.<br> <br> - <h3><a name="Character_encoding"></a>Character encoding</h3> - <h4><i>The character encoding problem</i><br> - </h4> -Developers are usually familiar with the ASCII character set. This -is a character set that assigns a unique number to some characters, e.g. -an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or -0x61 in hex). Some people may also have used ASCII codes for several -drawing characters (such as a horizontal bar, a vertical bar, or a top-right -corner) in the old DOS days, to be able to draw nice windows in text mode.<br> - <br> -However, these last characters are strictly spoken not part of the ASCII -set. The standard ASCII set contains only the character positions from -0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). An -example of this table can be found <a href="http://web.cs.mun.ca/%7Emichael/c/ascii-table.html">here</a>. Anything that has an ASCII code between 128 and 255 is in principle undefined.<br> - <br> -Now, several systems (including the old DOS) have defined those character -positions anyway, but usually in totally different ways. Some well -known extensions are:<br> - <ul> - <li>the <a href="http://czyborra.com/charsets/cp437.gif">DOS</a> - character set, nowadays usually known as Code Page 437, but sometimes also -named LatinUS, ECS (Extended Character Set) or PC-8; note that the table -displayed in the link also contains the standard ASCII part</li> - <li>the <a href="http://czyborra.com/charsets/cp1252.gif">ANSI</a> character set, also known as Code Page 1252, and usually the default on Windows</li> - <li>the <a href="http://czyborra.com/charsets/iso8859-1.gif">ISO-8859-1</a> character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices</li> - <li>the <a href="http://czyborra.com/charsets/adobe-stdenc.gif">Adobe Standard Encoding</a>, which is by default used in Postscript, unless overridden</li> - </ul> -And these are only examples of character sets used in West-European languages. - For Japanese, Chinese, Korean, Vietnamese, ... there are separate character -sets in which one byte's meaning can even be influenced by what the previous -byte was, i.e. these are multi-byte character sets. This is because -even 256 characters is totally inadequate to represent all characters in -such languages.<br> - <br> -So, summarizing, if a text file contains a byte that has a value 65, it is -pretty safe to assume that this byte represents an "A", if we ignore the -multi-byte character sets spoken of before. However, a value 233 cannot -be interpreted without knowing in which character set the text file is written. - In Latin-1, it happens to be the character "é", but in another -character set it can be something totally different (e.g. in the DOS character -set it is the Greek letter theta).<br> - <br> -Vice versa, if you need to write a character "é" to a file, it depends -on the character set you will use what the numerical value will be in the -file: in Latin-1 it will be 233, but if you use the DOS character set it -will be 130.<br> - <br> - <h4><i>Unicode</i></h4> -Enter the Unicode standard...<br> - <br> + <h3><a name="Character_encoding"></a>Character encoding</h3>Refer to <a href="encoding.html">this page</a> for some introduction on character encoding...<br> + + <h4></h4><br> TO BE COMPLETED<br> -- 2.30.2