X-Git-Url: https://git.dlugolecki.net.pl/?a=blobdiff_plain;f=doc%2Fencoding.html;h=4c86c0de9e3d2d22d209f0a94b51810cfade7b30;hb=560a739640bb8ab266e1335afb9a8cef9eb4f1aa;hp=a8f0a1e614ba799cbcfc629fea3d835c51dd3ef1;hpb=37246a4b0ab22fb948d6fb5b9b91917441db26cf;p=gedcom-parse.git
diff --git a/doc/encoding.html b/doc/encoding.html
index a8f0a1e..4c86c0d 100644
--- a/doc/encoding.html
+++ b/doc/encoding.html
@@ -1,66 +1,307 @@
-
Character encoding
-Character encoding
-
+Character encoding
+
+
+Character encoding
+
+
Index
+
-
+ Unicode code pointsUnicode encodings, UTF-8
+
The character encoding problem
-Developers are usually familiar with the ASCII character set. This
+
+Developers are usually familiar with the ASCII character set. Â This
is a character set that assigns a unique number to some characters, e.g.
an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or
-0x61 in hex). Some people may also have used ASCII codes for several
+0x61 in hex). Â Some people may also have used ASCII codes for several
drawing characters (such as a horizontal bar, a vertical bar, or a top-right
corner) in the old DOS days, to be able to draw nice windows in text mode.
+
+
However, these last characters are strictly spoken not part of the ASCII
-set. The standard ASCII set contains only the character positions from
-0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). An
-example of this table can be found here. Anything that has an ASCII code between 128 and 255 is in principle undefined.
+set. Â The standard ASCII set contains only the character positions from
+0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). Â An
+example of this table can be found here. Â Anything that has an ASCII code between 128 and 255 is in principle undefined.
+
+
Now, several systems (including the old DOS) have defined those character
-positions anyway, but usually in totally different ways. Some well
+positions anyway, but usually in totally different ways. Â Some well
known extensions are:
+
+
- the DOS
character set, nowadays usually known as Code Page 437, but sometimes also
named LatinUS, ECS (Extended Character Set) or PC-8; note that the table
displayed in the link also contains the standard ASCII part
- the ANSI character set, also known as Code Page 1252, and usually the default on Windows
- the ISO-8859-1 character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices
- the Adobe Standard Encoding, which is by default used in Postscript, unless overridden
+
And these are only examples of character sets used in West-European languages.
- For Japanese, Chinese, Korean, Vietnamese, ... there are separate character
+Â For Japanese, Chinese, Korean, Vietnamese, ... there are separate character
sets in which one byte's meaning can even be influenced by what the previous
-byte was, i.e. these are multi-byte character sets. This is because
+byte was, i.e. these are multi-byte character sets. Â This is because
even 256 characters (the maximum for 8 bits) is totally inadequate to represent all characters in
such languages.
+
+
So, summarizing, if a text file contains a byte that has a value 65, it is
pretty safe to assume that this byte represents an "A", if we ignore the
-multi-byte character sets spoken of before. However, a value 233 cannot
+multi-byte character sets spoken of before. Â However, a value 233 cannot
be interpreted without knowing in which character set the text file is written.
- In Latin-1, it happens to be the character "é", but in another
+ In Latin-1, it happens to be the character "é", but in another
character set it can be something totally different (e.g. in the DOS character
set it is the Greek letter theta).
+
-Vice versa, if you need to write a character "é" to a file, it depends
+
+
+
+
+
+
+
+
+Vice versa, if you need to write a character "é" to a file, it depends
on the character set you will use what the numerical value will be in the
file: in Latin-1 it will be 233, but if you use the DOS character set it
-will be 130.
+will be 130, making it necessary again to know the encoding when you want to re-read the file.
+
+
+
+
+
+
+
This is a source of great confusion as soon as you go outside
+the normal English character set, especially when you are using files on
+different systems...
+
+
+
+
-Unicode
+Unicode code points
+
+
+Enter the Unicode standard...
+
+
+
+Unicode solves the problem of encoding by assigning unique numbers to every
+ character that is used anywhere in the world. Â Since it is not possible
+to do this in 8 bits (with a maximum of 256 code positions), a Unicode character
+is usually represented by 16 bits, denoted by U+0000 to U+FFFF in hexadecimal
+style. Â A number such as U+0123 is named a "code point".
+
+
+ Recently (Unicode 3.1), some extensions have even been defined so that in
+fact the defined range is now U+000000 to U+10FFFF (21 bits), and formally,
+the character set is defined as 31-bits to allow for future expansion.
+
+
+
+The Unicode character set is backward compatible with the ISO-8859-1 or Latin-1
+character set (and thus automatically also with the ASCII character set),
+because for every ISO-8859-1 character with hexadecimal value 0xXY, the corresponding
+Unicode code point is U+00XY.
+
+
+ Some examples of Unicode code points (some of the characters here may not
+be displayed correctly in all browsers; current Mozilla works perfectly for
+this, but it also depends on the installed fonts of course):
+
+
+
+
+
+
+ Unicode code point
+ |
+ Character
+ |
+
+
+ U+0041
+ |
+ A
+ |
+
+
+ U+00E9
+ |
+ é
+ |
+
+
+ U+03B8
+ |
+ θ (the Greek theta)
+ |
+
+
+ U+20AC
+ |
+ ⬠(the euro)
+ |
+
+
+
+
+
+Using the Unicode code points there is no confusion anymore which character
+is meant, because they uniquely define the character. Â The full Unicode
+code charts can be found here
+ (as a set of PDF documents). Â A nice application to see all Unicode
+characters is the Unicode Character Map (ucm), which can be found here, and which allows to select and paste any Unicode character.
+
+Some additional terminology (more terminology follows in the next section):
+
+ - UCS (Universal Character Set): the official name for the Unicode character set
+
+
+ - ISO 10646: the international standard that defines the Unicode character set
+
+
+ - BMP
+ (Basic Multilingual Plane) or Plane 0 is the 16-bit subset of UCS, i.e.
+the characters U+0000 to U+FFFF, which is supposed to cover all characters
+is all currently used languages. Â Code points outside that range are used
+for historical character sets (e.g. hieroglyphs) and special symbols.
+
+
+Unicode encodings, UTF-8
+Since Unicode characters are generally represented by a number that is 16
+bits wide, as seen above (for the basic plane), it would seem that all text
+files would double in size, since the usual ASCII characters are 8 bits wide.
+Â However, the Unicode code points are not necessarily the values that
+are written to files... Â
+
+Indeed, the simplest solution is to take the code point that defines a character,
+split it up into two bytes, and write the two bytes to the file. Â This
+is called the UCS-2 encoding scheme:
+
+
+
+
+ Character
+ |
+ Unicode code point
+ |
+ Byte values in file (UCS-2)
+ |
+
+
+ A
+ |
+ U+0041
+ |
+ 0x00, 0x41
+ |
+
+
+ é
+ |
+ U+00E9
+ |
+ 0x00, 0xE9
+ |
+
+
+ θ (theta)
+ |
+ U+03B8
+ |
+ 0x03, 0xB8
+ |
+
+
+ ⬠(euro)
+ |
+ U+20AC
+ |
+ 0x20, 0xAC
+ |
+
+
+
+
+This table assumes a big-endian encoding of UCS-2: the endianness is in principle
+not defined, so there are two versions of UCS-2. Â The little-endian
+encoding results in the same values as in the table above, but in the inverse
+order.
+
+So, we see that the UCS-2 encoding results in a doubling of file sizes for
+files that contain only English text. Â This is a disadvantage for this
+encoding. Â Another disadvantage is that null bytes can occur in normal
+text, breaking all conventions for null-terminated C strings if you use the
+normal char
type. Â This is why C also defines the wchar_t
+ type, which can hold a 32-bit character (at least in GNU systems). Â To
+avoid both of these disadvantages, UTF-8 was introduced.
+
+In UTF-8, the number of bytes used to write a character to a file depends
+on the Unicode code point. Â The corresponding table to the table above
+is:
+
+
+Character
| Unicode code point
| Byte values in file (UTF-8)
|
A
| U+0041
| 0x41
|
é
| U+00E9
| 0xC3, 0xA9
|
θ (theta)
| U+03B8
| 0xCE, 0xB8
|
⬠(euro)
| U+20AC
| 0xE2, 0x82, 0xAC
|
+
+
+Some immediate observations:
+
+ - Null characters don't occur, so there is no problem to put such characters
+in a null-terminated C string (without having to use the
wchar_t
type), although it is a little more difficult to get the length of the string.
+
+
+ - Strict ASCII characters are encoded into 1 byte, which makes UTF-8
+completely backward compatible with ASCII. Â It doesn't change the size
+of normal ASCII text strings or files.
+
+
+ - Some characters need 3 bytes in UTF-8. Â Indeed, all basic plane
+characters (U+0000 to U+FFFF) can be encoded in 1, 2 or 3 bytes.
+
+An excellent explanation of how to characters are encoded in UTF-8 can be found on this page.
+
+Some additional terminology regarding encoding schemes (less important here):
+
+ - UCS-2: the encoding of Unicode characters in 2 bytes (16 bits) per character, can only encode U+0000 to U+FFFF (the basic plane)
+
+
+ - UCS-4: the encoding of Unicode characters in 4 bytes (32 bits) per character, can encode the entire UCS range
+
+
+ - UTF-8: the encoding scheme described above (UTF = UCS Transformation Format)
+
+
+ - UTF-16: an extension to UCS-2 to be able to encode characters
+outside the basic plane (the 21-bit range of Unicode 3.1), by using a sequence
+of two 16-bit characters
+
+
+ - UTF-32: the 4-byte encoding of the 21-bit range of Unicode 3.1, in fact the same as UCS-4
+
+
+ - UTF-7: can be safely forgotten... (a kind of 7-bit clean version of UTF-8)
+
+Note that the byte order of UCS-2, UCS-4, UTF-16 and UTF-32 is not defined, so it can be big endian or little endian !
+
-Enter the Unicode standard...
$Id$
$Name$
+
+
\ No newline at end of file