X-Git-Url: https://git.dlugolecki.net.pl/?a=blobdiff_plain;f=doc%2Fencoding.html;h=4c86c0de9e3d2d22d209f0a94b51810cfade7b30;hb=90d522d64b1bc52135d15e26f88f55cf0d700ec6;hp=a8f0a1e614ba799cbcfc629fea3d835c51dd3ef1;hpb=37246a4b0ab22fb948d6fb5b9b91917441db26cf;p=gedcom-parse.git diff --git a/doc/encoding.html b/doc/encoding.html index a8f0a1e..4c86c0d 100644 --- a/doc/encoding.html +++ b/doc/encoding.html @@ -1,66 +1,307 @@ -Character encoding -

Character encoding

-
+Character encoding + + +

Character encoding
+

+

Index

+ -
+
  • Unicode code points
  • Unicode encodings, UTF-8

  • +

    The character encoding problem

    -Developers are usually familiar with the ASCII character set.  This + +Developers are usually familiar with the ASCII character set.  This is a character set that assigns a unique number to some characters, e.g. an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or -0x61 in hex).  Some people may also have used ASCII codes for several +0x61 in hex).  Some people may also have used ASCII codes for several drawing characters (such as a horizontal bar, a vertical bar, or a top-right corner) in the old DOS days, to be able to draw nice windows in text mode.
    +
    + However, these last characters are strictly spoken not part of the ASCII -set.  The standard ASCII set contains only the character positions from -0 to 127 (i.e. anything that fits into an integer that is 7 bits wide).  An -example of this table can be found here.  Anything that has an ASCII code between 128 and 255 is in principle undefined.
    +set.  The standard ASCII set contains only the character positions from +0 to 127 (i.e. anything that fits into an integer that is 7 bits wide).  An +example of this table can be found here.  Anything that has an ASCII code between 128 and 255 is in principle undefined.
    +
    + Now, several systems (including the old DOS) have defined those character -positions anyway, but usually in totally different ways.  Some well +positions anyway, but usually in totally different ways.  Some well known extensions are:
    + + And these are only examples of character sets used in West-European languages. - For Japanese, Chinese, Korean, Vietnamese, ... there are separate character + For Japanese, Chinese, Korean, Vietnamese, ... there are separate character sets in which one byte's meaning can even be influenced by what the previous -byte was, i.e. these are multi-byte character sets.  This is because +byte was, i.e. these are multi-byte character sets.  This is because even 256 characters (the maximum for 8 bits) is totally inadequate to represent all characters in such languages.
    +
    + So, summarizing, if a text file contains a byte that has a value 65, it is pretty safe to assume that this byte represents an "A", if we ignore the -multi-byte character sets spoken of before.  However, a value 233 cannot +multi-byte character sets spoken of before.  However, a value 233 cannot be interpreted without knowing in which character set the text file is written. - In Latin-1, it happens to be the character "é", but in another + In Latin-1, it happens to be the character "é", but in another character set it can be something totally different (e.g. in the DOS character set it is the Greek letter theta).
    +
    -Vice versa, if you need to write a character "é" to a file, it depends +
    Conversion from byte value +
    +
    + + +
    + + +Vice versa, if you need to write a character "é" to a file, it depends on the character set you will use what the numerical value will be in the file: in Latin-1 it will be 233, but if you use the DOS character set it -will be 130.
    +will be 130, making it necessary again to know the encoding when you want to re-read the file.
    + +
    + +
    Conversion to byte value +
    +
    +
    This is a source of great confusion as soon as you go outside +the normal English character set, especially when you are using files on +different systems...
    +
    +
    + +
    -

    Unicode

    +

    Unicode code points

    + + +Enter the Unicode standard...
    + +
    + +Unicode solves the problem of encoding by assigning unique numbers to every + character that is used anywhere in the world.  Since it is not possible +to do this in 8 bits (with a maximum of 256 code positions), a Unicode character +is usually represented by 16 bits, denoted by U+0000 to U+FFFF in hexadecimal +style.  A number such as U+0123 is named a "code point".
    + +
    + Recently (Unicode 3.1), some extensions have even been defined so that in +fact the defined range is now U+000000 to U+10FFFF (21 bits), and formally, +the character set is defined as 31-bits to allow for future expansion.
    + +
    + +The Unicode character set is backward compatible with the ISO-8859-1 or Latin-1 +character set (and thus automatically also with the ASCII character set), +because for every ISO-8859-1 character with hexadecimal value 0xXY, the corresponding +Unicode code point is U+00XY.
    + +
    + Some examples of Unicode code points (some of the characters here may not +be displayed correctly in all browsers; current Mozilla works perfectly for +this, but it also depends on the installed fonts of course):
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + +
    Unicode code point
    +
    Character
    +
    U+0041
    +
    A
    +
    U+00E9
    +
    é
    +
    U+03B8
    +
    θ (the Greek theta)
    +
    U+20AC
    +
    € (the euro)
    +
    + +
    +Using the Unicode code points there is no confusion anymore which character +is meant, because they uniquely define the character.  The full Unicode +code charts can be found here + (as a set of PDF documents).  A nice application to see all Unicode +characters is the Unicode Character Map (ucm), which can be found here, and which allows to select and paste any Unicode character.
    +
    +Some additional terminology (more terminology follows in the next section):
    + +
    +

    Unicode encodings, UTF-8

    +Since Unicode characters are generally represented by a number that is 16 +bits wide, as seen above (for the basic plane), it would seem that all text +files would double in size, since the usual ASCII characters are 8 bits wide. + However, the Unicode code points are not necessarily the values that +are written to files...  
    +
    +Indeed, the simplest solution is to take the code point that defines a character, +split it up into two bytes, and write the two bytes to the file.  This +is called the UCS-2 encoding scheme:
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Character
    +
    Unicode code point
    +
    Byte values in file (UCS-2)
    +
    A
    +
    U+0041
    +
    0x00, 0x41
    +
    é
    +
    U+00E9
    +
    0x00, 0xE9
    +
    θ (theta)
    +
    U+03B8
    +
    0x03, 0xB8
    +
    € (euro)
    +
    U+20AC
    +
    0x20, 0xAC
    +
    +
    +This table assumes a big-endian encoding of UCS-2: the endianness is in principle +not defined, so there are two versions of UCS-2.  The little-endian +encoding results in the same values as in the table above, but in the inverse +order.
    +
    +So, we see that the UCS-2 encoding results in a doubling of file sizes for +files that contain only English text.  This is a disadvantage for this +encoding.  Another disadvantage is that null bytes can occur in normal +text, breaking all conventions for null-terminated C strings if you use the +normal char type.  This is why C also defines the wchar_t + type, which can hold a 32-bit character (at least in GNU systems).  To +avoid both of these disadvantages, UTF-8 was introduced.
    +
    +In UTF-8, the number of bytes used to write a character to a file depends +on the Unicode code point.  The corresponding table to the table above +is:
    +
    + + +
    Character
    Unicode code point
    Byte values in file (UTF-8)
    A
    U+0041
    0x41
    é
    U+00E9
    0xC3, 0xA9
    θ (theta)
    U+03B8
    0xCE, 0xB8
    € (euro)
    U+20AC
    0xE2, 0x82, 0xAC
    +
    +Some immediate observations:
    + +An excellent explanation of how to characters are encoded in UTF-8 can be found on this page.
    +
    +Some additional terminology regarding encoding schemes (less important here):
    + +Note that the byte order of UCS-2, UCS-4, UTF-16 and UTF-32 is not defined, so it can be big endian or little endian !
    + -Enter the Unicode standard...

    $Id$
    $Name$

    +
    + \ No newline at end of file