From 37246a4b0ab22fb948d6fb5b9b91917441db26cf Mon Sep 17 00:00:00 2001 From: Peter Verthez Date: Sat, 12 Jan 2002 13:13:14 +0000 Subject: [PATCH] Moved encoding introduction to separate html page. --- doc/Makefile.am | 3 ++- doc/encoding.html | 66 +++++++++++++++++++++++++++++++++++++++++++++++ doc/parser.html | 57 ++++------------------------------------ 3 files changed, 73 insertions(+), 53 deletions(-) create mode 100644 doc/encoding.html diff --git a/doc/Makefile.am b/doc/Makefile.am index 7ce3c3f..5bf8e72 100644 --- a/doc/Makefile.am +++ b/doc/Makefile.am @@ -3,7 +3,8 @@ # $Name$ SUBDIRS = images . -DOC_FILES = index.html usage.html parser.html interface.html links.html +DOC_FILES = index.html usage.html parser.html interface.html links.html \ + encoding.html VERSIONED_FILES = index.html EXTRA_DIST = $(DOC_FILES) docdir = $(datadir)/doc/@PACKAGE@-@VERSION@ diff --git a/doc/encoding.html b/doc/encoding.html new file mode 100644 index 0000000..a8f0a1e --- /dev/null +++ b/doc/encoding.html @@ -0,0 +1,66 @@ +Character encoding +

Character encoding

+
+

Index

+ +
+
+

The character encoding problem

+ +Developers are usually familiar with the ASCII character set.  This +is a character set that assigns a unique number to some characters, e.g. +an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or +0x61 in hex).  Some people may also have used ASCII codes for several +drawing characters (such as a horizontal bar, a vertical bar, or a top-right +corner) in the old DOS days, to be able to draw nice windows in text mode.
+
+ +However, these last characters are strictly spoken not part of the ASCII +set.  The standard ASCII set contains only the character positions from +0 to 127 (i.e. anything that fits into an integer that is 7 bits wide).  An +example of this table can be found here.  Anything that has an ASCII code between 128 and 255 is in principle undefined.
+
+ +Now, several systems (including the old DOS) have defined those character +positions anyway, but usually in totally different ways.  Some well +known extensions are:
+ + +And these are only examples of character sets used in West-European languages. + For Japanese, Chinese, Korean, Vietnamese, ... there are separate character +sets in which one byte's meaning can even be influenced by what the previous +byte was, i.e. these are multi-byte character sets.  This is because +even 256 characters (the maximum for 8 bits) is totally inadequate to represent all characters in +such languages.
+
+ +So, summarizing, if a text file contains a byte that has a value 65, it is +pretty safe to assume that this byte represents an "A", if we ignore the +multi-byte character sets spoken of before.  However, a value 233 cannot +be interpreted without knowing in which character set the text file is written. + In Latin-1, it happens to be the character "é", but in another +character set it can be something totally different (e.g. in the DOS character +set it is the Greek letter theta).
+
+ +Vice versa, if you need to write a character "é" to a file, it depends +on the character set you will use what the numerical value will be in the +file: in Latin-1 it will be 233, but if you use the DOS character set it +will be 130.
+
+

Unicode

+ +Enter the Unicode standard...
+
+
$Id$
$Name$

+
+ \ No newline at end of file diff --git a/doc/parser.html b/doc/parser.html index fbaefdf..893670c 100644 --- a/doc/parser.html +++ b/doc/parser.html @@ -1,9 +1,9 @@ - The Gedcom parser library + The Gedcom parser library internals
-

The Gedcom parser library

+

The Gedcom parser library internals

The intention of this page is to provide some explanation of the gedcom parser, to aid development on and with it.  First, @@ -151,56 +151,9 @@ the file.

This basic description ignores the problem of character encoding.  The next section describes what this problem exactly is.

-

Character encoding

-

The character encoding problem
-

-Developers are usually familiar with the ASCII character set.  This -is a character set that assigns a unique number to some characters, e.g. -an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or -0x61 in hex).  Some people may also have used ASCII codes for several -drawing characters (such as a horizontal bar, a vertical bar, or a top-right -corner) in the old DOS days, to be able to draw nice windows in text mode.
-
-However, these last characters are strictly spoken not part of the ASCII -set.  The standard ASCII set contains only the character positions from -0 to 127 (i.e. anything that fits into an integer that is 7 bits wide).  An -example of this table can be found here.  Anything that has an ASCII code between 128 and 255 is in principle undefined.
-
-Now, several systems (including the old DOS) have defined those character -positions anyway, but usually in totally different ways.  Some well -known extensions are:
-
    -
  • the DOS - character set, nowadays usually known as Code Page 437, but sometimes also -named LatinUS, ECS (Extended Character Set) or PC-8; note that the table -displayed in the link also contains the standard ASCII part
  • -
  • the ANSI character set, also known as Code Page 1252, and usually the default on Windows
  • -
  • the ISO-8859-1 character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices
  • -
  • the Adobe Standard Encoding, which is by default used in Postscript, unless overridden
  • -
-And these are only examples of character sets used in West-European languages. - For Japanese, Chinese, Korean, Vietnamese, ... there are separate character -sets in which one byte's meaning can even be influenced by what the previous -byte was, i.e. these are multi-byte character sets.  This is because -even 256 characters is totally inadequate to represent all characters in -such languages.
-
-So, summarizing, if a text file contains a byte that has a value 65, it is -pretty safe to assume that this byte represents an "A", if we ignore the -multi-byte character sets spoken of before.  However, a value 233 cannot -be interpreted without knowing in which character set the text file is written. - In Latin-1, it happens to be the character "é", but in another -character set it can be something totally different (e.g. in the DOS character -set it is the Greek letter theta).
-
-Vice versa, if you need to write a character "é" to a file, it depends -on the character set you will use what the numerical value will be in the -file: in Latin-1 it will be 233, but if you use the DOS character set it -will be 130.
-
-

Unicode

-Enter the Unicode standard...
-
+

Character encoding

Refer to this page for some introduction on character encoding...
+ +


TO BE COMPLETED
-- 2.30.2