From: Peter Verthez Date: Sun, 13 Jan 2002 12:20:00 +0000 (+0000) Subject: At lot more explanation about encodings. X-Git-Url: https://git.dlugolecki.net.pl/?a=commitdiff_plain;h=dda8b23849c8a7a62e3fd86a24199fbebd50be04;p=gedcom-parse.git At lot more explanation about encodings. --- diff --git a/doc/encoding.html b/doc/encoding.html index 08d953e..4c86c0d 100644 --- a/doc/encoding.html +++ b/doc/encoding.html @@ -1,8 +1,7 @@ - - Character encoding +Character encoding + - +charset=UTF-8">

Character encoding

@@ -11,21 +10,16 @@ charset=UTF-8"> - -
+
  • Unicode code points
  • Unicode encodings, UTF-8


  • The character encoding problem

    -Developers are usually familiar with the ASCII character set.  This +Developers are usually familiar with the ASCII character set.  This is a character set that assigns a unique number to some characters, e.g. an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or -0x61 in hex).  Some people may also have used ASCII codes for several +0x61 in hex).  Some people may also have used ASCII codes for several drawing characters (such as a horizontal bar, a vertical bar, or a top-right corner) in the old DOS days, to be able to draw nice windows in text mode.
    @@ -33,15 +27,15 @@ corner) in the old DOS days, to be able to draw nice windows in text mode.
    However, these last characters are strictly spoken not part of the ASCII -set.  The standard ASCII set contains only the character positions from -0 to 127 (i.e. anything that fits into an integer that is 7 bits wide).  An -example of this table can be found here.  Anything that has an ASCII code between 128 and 255 is in principle undefined.
    +set.  The standard ASCII set contains only the character positions from +0 to 127 (i.e. anything that fits into an integer that is 7 bits wide).  An +example of this table can be found here.  Anything that has an ASCII code between 128 and 255 is in principle undefined.

    Now, several systems (including the old DOS) have defined those character -positions anyway, but usually in totally different ways.  Some well +positions anyway, but usually in totally different ways.  Some well known extensions are:

    Unicode encodings, UTF-8

    Since Unicode characters are generally represented by a number that is 16 bits wide, as seen above (for the basic plane), it would seem that all text files would double in size, since the usual ASCII characters are 8 bits wide. - However, the Unicode code points are not necessarily the values that -are written to files...  
    + However, the Unicode code points are not necessarily the values that +are written to files...  

    Indeed, the simplest solution is to take the code point that defines a character, -split it up into two bytes, and write the two bytes to the file.  This +split it up into two bytes, and write the two bytes to the file.  This is called the UCS-2 encoding scheme:

    @@ -216,7 +212,7 @@ is called the UCS-2 encoding scheme:
    - @@ -224,7 +220,7 @@ is called the UCS-2 encoding scheme:
    - @@ -232,7 +228,7 @@ is called the UCS-2 encoding scheme:
    - @@ -243,24 +239,24 @@ is called the UCS-2 encoding scheme:
    é
    +
    é
    U+00E9
    θ (theta)
    +
    θ (theta)
    U+03B8
    € (euro)
    +
    € (euro)
    U+20AC

    This table assumes a big-endian encoding of UCS-2: the endianness is in principle -not defined, so there are two versions of UCS-2.  The little-endian +not defined, so there are two versions of UCS-2.  The little-endian encoding results in the same values as in the table above, but in the inverse order.

    So, we see that the UCS-2 encoding results in a doubling of file sizes for -files that contain only English text.  This is a disadvantage for this -encoding.  Another disadvantage is that null bytes can occur in normal +files that contain only English text.  This is a disadvantage for this +encoding.  Another disadvantage is that null bytes can occur in normal text, breaking all conventions for null-terminated C strings if you use the -normal char type.  This is why C also defines the wchar_t - type, which can hold a 32-bit character (at least in GNU systems).  To +normal char type.  This is why C also defines the wchar_t + type, which can hold a 32-bit character (at least in GNU systems).  To avoid both of these disadvantages, UTF-8 was introduced.

    In UTF-8, the number of bytes used to write a character to a file depends -on the Unicode code point.  The corresponding table to the table above +on the Unicode code point.  The corresponding table to the table above is:

    - +
    Character
    Unicode code point
    Byte values in file (UTF-8)
    A
    U+0041
    0x41
    é
    U+00E9
    0xC3, 0xA9
    θ (theta)
    U+03B8
    0xCE, 0xB8
    € (euro)
    U+20AC
    0xE2, 0x82, 0xAC
    Character
    Unicode code point
    Byte values in file (UTF-8)
    A
    U+0041
    0x41
    é
    U+00E9
    0xC3, 0xA9
    θ (theta)
    U+03B8
    0xCE, 0xB8
    € (euro)
    U+20AC
    0xE2, 0x82, 0xAC

    Some immediate observations:
    @@ -270,14 +266,14 @@ in a null-terminated C string (without having to use the wchar_t ty
  • Strict ASCII characters are encoded into 1 byte, which makes UTF-8 -completely backward compatible with ASCII.  It doesn't change the size -of normal ASCII text files.
    +completely backward compatible with ASCII.  It doesn't change the size +of normal ASCII text strings or files.

  • -
  • Some characters need 3 bytes in UTF-8.  Indeed, all basic plane +
  • Some characters need 3 bytes in UTF-8.  Indeed, all basic plane characters (U+0000 to U+FFFF) can be encoded in 1, 2 or 3 bytes.
  • -An excellent explanation of how to encode characters in UTF-8 can be found on this page.
    +An excellent explanation of how to characters are encoded in UTF-8 can be found on this page.

    Some additional terminology regarding encoding schemes (less important here):
    Note that the byte order of UCS-2, UCS-4, UTF-16 and UTF-32 is not defined, so it can be big endian or little endian !
    +
    $Id$
    $Name$