From: Peter Verthez Date: Sat, 19 Apr 2003 14:55:10 +0000 (+0000) Subject: Copied to html generated documentation. X-Git-Url: https://git.dlugolecki.net.pl/?a=commitdiff_plain;h=5cd08934a429c7958567dc0bf6524d484b2d4ec2;p=gedcom-parse.git Copied to html generated documentation. --- diff --git a/doc/html/encoding.html b/doc/html/encoding.html new file mode 100644 index 0000000..4c86c0d --- /dev/null +++ b/doc/html/encoding.html @@ -0,0 +1,307 @@ +Character encoding + + +

Character encoding
+

+ +

Index

+ +

The character encoding problem
Unicode code points
Unicode encodings, UTF-8

+ +

The character encoding problem

+ + +Developers are usually familiar with the ASCII character set. Â This +is a character set that assigns a unique number to some characters, e.g. +an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or +0x61 in hex). Â Some people may also have used ASCII codes for several +drawing characters (such as a horizontal bar, a vertical bar, or a top-right +corner) in the old DOS days, to be able to draw nice windows in text mode.
+ +
+ + +However, these last characters are strictly spoken not part of the ASCII +set. Â The standard ASCII set contains only the character positions from +0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). Â An +example of this table can be found here. Â Anything that has an ASCII code between 128 and 255 is in principle undefined.
+ +
+ + +Now, several systems (including the old DOS) have defined those character +positions anyway, but usually in totally different ways. Â Some well +known extensions are:
+ +

the DOS + character set, nowadays usually known as Code Page 437, but sometimes also +named LatinUS, ECS (Extended Character Set) or PC-8; note that the table +displayed in the link also contains the standard ASCII part
the ANSI character set, also known as Code Page 1252, and usually the default on Windows
the ISO-8859-1 character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices
the Adobe Standard Encoding, which is by default used in Postscript, unless overridden

+ + +And these are only examples of character sets used in West-European languages. +Â For Japanese, Chinese, Korean, Vietnamese, ... there are separate character +sets in which one byte's meaning can even be influenced by what the previous +byte was, i.e. these are multi-byte character sets. Â This is because +even 256 characters (the maximum for 8 bits) is totally inadequate to represent all characters in +such languages.
+ +
+ + +So, summarizing, if a text file contains a byte that has a value 65, it is +pretty safe to assume that this byte represents an "A", if we ignore the +multi-byte character sets spoken of before. Â However, a value 233 cannot +be interpreted without knowing in which character set the text file is written. +Â In Latin-1, it happens to be the character "Ã©", but in another +character set it can be something totally different (e.g. in the DOS character +set it is the Greek letter theta).
+ +
+ +

+
+

+ + +
+ + +Vice versa, if you need to write a character "Ã©" to a file, it depends +on the character set you will use what the numerical value will be in the +file: in Latin-1 it will be 233, but if you use the DOS character set it +will be 130, making it necessary again to know the encoding when you want to re-read the file.
+ +
+ +

+
+
+

This is a source of great confusion as soon as you go outside +the normal English character set, especially when you are using files on +different systems...
+

+ + +

Unicode code points

+ + +Enter theÂ Unicode standard...
+ +
+ +Unicode solves the problem of encoding by assigning unique numbers to every + character that is used anywhere in the world. Â Since it is not possible +to do this in 8 bits (with a maximum of 256 code positions), a Unicode character +is usually represented by 16 bits, denoted by U+0000 to U+FFFF in hexadecimal +style. Â A number such as U+0123 is named a "code point".
+ +
+ Recently (Unicode 3.1), some extensions have even been defined so that in +fact the defined range is now U+000000 to U+10FFFF (21 bits), and formally, +the character set is defined as 31-bits to allow for future expansion.
+ +
+ +The Unicode character set is backward compatible with the ISO-8859-1 or Latin-1 +character set (and thus automatically also with the ASCII character set), +because for every ISO-8859-1 character with hexadecimal value 0xXY, the corresponding +Unicode code point is U+00XY.
+ +
+ Some examples of Unicode code points (some of the characters here may not +be displayed correctly in all browsers; current Mozilla works perfectly for +this, but it also depends on the installed fonts of course):
+
+ + + + + + + + + + + + + + + + + + + + + + + + + +

Unicode code point +	Character +
U+0041 +	A +
U+00E9 +	Ã© +
U+03B8 +	Î¸ (the Greek theta) +
U+20AC +	â¬ (the euro) +

+ +
+Using the Unicode code points there is no confusion anymore which character +is meant, because they uniquely define the character. Â The full Unicode +code charts can be found here + (as a set of PDF documents). Â A nice application to see all Unicode +characters is the Unicode Character Map (ucm), which can be found here, and which allows to select and paste any Unicode character.
+
+Some additional terminology (more terminology follows in the next section):
+

UCS (Universal Character Set): the official name for the Unicode character set
+
+
ISO 10646: the international standard that defines the Unicode character set
+
+
BMP + (Basic Multilingual Plane) or Plane 0 is the 16-bit subset of UCS, i.e. +the characters U+0000 to U+FFFF, which is supposed to cover all characters +is all currently used languages. Â Code points outside that range are used +for historical character sets (e.g. hieroglyphs) and special symbols.

Unicode encodings, UTF-8

+Since Unicode characters are generally represented by a number that is 16 +bits wide, as seen above (for the basic plane), it would seem that all text +files would double in size, since the usual ASCII characters are 8 bits wide. +Â However, the Unicode code points are not necessarily the values that +are written to files... Â
+
+Indeed, the simplest solution is to take the code point that defines a character, +split it up into two bytes, and write the two bytes to the file. Â This +is called the UCS-2 encoding scheme:
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Character +	Unicode code point +	Byte values in file (UCS-2) +
A +	U+0041 +	0x00, 0x41 +
Ã© +	U+00E9 +	0x00, 0xE9 +
Î¸ (theta) +	U+03B8 +	0x03, 0xB8 +
â¬ (euro) +	U+20AC +	0x20, 0xAC +

+
+This table assumes a big-endian encoding of UCS-2: the endianness is in principle +not defined, so there are two versions of UCS-2. Â The little-endian +encoding results in the same values as in the table above, but in the inverse +order.
+
+So, we see that the UCS-2 encoding results in a doubling of file sizes for +files that contain only English text. Â This is a disadvantage for this +encoding. Â Another disadvantage is that null bytes can occur in normal +text, breaking all conventions for null-terminated C strings if you use the +normal char type. Â This is why C also defines the wchar_t + type, which can hold a 32-bit character (at least in GNU systems). Â To +avoid both of these disadvantages, UTF-8 was introduced.
+
+In UTF-8, the number of bytes used to write a character to a file depends +on the Unicode code point. Â The corresponding table to the table above +is:
+
+ + +

Character	Unicode code point	Byte values in file (UTF-8)
A	U+0041	0x41
Ã©	U+00E9	0xC3, 0xA9
Î¸ (theta)	U+03B8	0xCE, 0xB8
â¬ (euro)	U+20AC	0xE2, 0x82, 0xAC

+
+Some immediate observations:
+

Null characters don't occur, so there is no problem to put such characters +in a null-terminated C string (without having to use the wchar_t type), although it is a little more difficult to get the length of the string.
+
+
Strict ASCII characters are encoded into 1 byte, which makes UTF-8 +completely backward compatible with ASCII. Â It doesn't change the size +of normal ASCII text strings or files.
+
+
Some characters need 3 bytes in UTF-8. Â Indeed, all basic plane +characters (U+0000 to U+FFFF) can be encoded in 1, 2 or 3 bytes.

+An excellent explanation of how to characters are encoded in UTF-8 can be found on this page.
+
+Some additional terminology regarding encoding schemes (less important here):
+

UCS-2: the encoding of Unicode characters in 2 bytes (16 bits) per character, can only encode U+0000 to U+FFFF (the basic plane)
+
+
UCS-4: the encoding of Unicode characters in 4 bytes (32 bits) per character, can encode the entire UCS range
+
+
UTF-8: the encoding scheme described above (UTF = UCS Transformation Format)
+
+
UTF-16: an extension to UCS-2 to be able to encode characters +outside the basic plane (the 21-bit range of Unicode 3.1), by using a sequence +of two 16-bit characters
+
+
UTF-32: the 4-byte encoding of the 21-bit range of Unicode 3.1, in fact the same as UCS-4
+
+
UTF-7: can be safely forgotten... (a kind of 7-bit clean version of UTF-8)

+Note that the byte order of UCS-2, UCS-4, UTF-16 and UTF-32 is not defined, so it can be big endian or little endian !
+ + +

$Id$
$Name$

+ +
+ + \ No newline at end of file