From 5cd08934a429c7958567dc0bf6524d484b2d4ec2 Mon Sep 17 00:00:00 2001 From: Peter Verthez Date: Sat, 19 Apr 2003 14:55:10 +0000 Subject: [PATCH] Copied to html generated documentation. --- doc/html/encoding.html | 307 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 307 insertions(+) create mode 100644 doc/html/encoding.html diff --git a/doc/html/encoding.html b/doc/html/encoding.html new file mode 100644 index 0000000..4c86c0d --- /dev/null +++ b/doc/html/encoding.html @@ -0,0 +1,307 @@ +Character encoding + + +

Character encoding
+

+ +

Index

+ +
+ +
+

The character encoding problem

+ + +Developers are usually familiar with the ASCII character set.  This +is a character set that assigns a unique number to some characters, e.g. +an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or +0x61 in hex).  Some people may also have used ASCII codes for several +drawing characters (such as a horizontal bar, a vertical bar, or a top-right +corner) in the old DOS days, to be able to draw nice windows in text mode.
+ +
+ + +However, these last characters are strictly spoken not part of the ASCII +set.  The standard ASCII set contains only the character positions from +0 to 127 (i.e. anything that fits into an integer that is 7 bits wide).  An +example of this table can be found here.  Anything that has an ASCII code between 128 and 255 is in principle undefined.
+ +
+ + +Now, several systems (including the old DOS) have defined those character +positions anyway, but usually in totally different ways.  Some well +known extensions are:
+ + + + +And these are only examples of character sets used in West-European languages. + For Japanese, Chinese, Korean, Vietnamese, ... there are separate character +sets in which one byte's meaning can even be influenced by what the previous +byte was, i.e. these are multi-byte character sets.  This is because +even 256 characters (the maximum for 8 bits) is totally inadequate to represent all characters in +such languages.
+ +
+ + +So, summarizing, if a text file contains a byte that has a value 65, it is +pretty safe to assume that this byte represents an "A", if we ignore the +multi-byte character sets spoken of before.  However, a value 233 cannot +be interpreted without knowing in which character set the text file is written. + In Latin-1, it happens to be the character "é", but in another +character set it can be something totally different (e.g. in the DOS character +set it is the Greek letter theta).
+ +
+ +
Conversion from byte value +
+
+ + +
+ + +Vice versa, if you need to write a character "é" to a file, it depends +on the character set you will use what the numerical value will be in the +file: in Latin-1 it will be 233, but if you use the DOS character set it +will be 130, making it necessary again to know the encoding when you want to re-read the file.
+ +
+ +
Conversion to byte value +
+
+
This is a source of great confusion as soon as you go outside +the normal English character set, especially when you are using files on +different systems...
+
+
+ + +
+

Unicode code points

+ + +Enter the Unicode standard...
+ +
+ +Unicode solves the problem of encoding by assigning unique numbers to every + character that is used anywhere in the world.  Since it is not possible +to do this in 8 bits (with a maximum of 256 code positions), a Unicode character +is usually represented by 16 bits, denoted by U+0000 to U+FFFF in hexadecimal +style.  A number such as U+0123 is named a "code point".
+ +
+ Recently (Unicode 3.1), some extensions have even been defined so that in +fact the defined range is now U+000000 to U+10FFFF (21 bits), and formally, +the character set is defined as 31-bits to allow for future expansion.
+ +
+ +The Unicode character set is backward compatible with the ISO-8859-1 or Latin-1 +character set (and thus automatically also with the ASCII character set), +because for every ISO-8859-1 character with hexadecimal value 0xXY, the corresponding +Unicode code point is U+00XY.
+ +
+ Some examples of Unicode code points (some of the characters here may not +be displayed correctly in all browsers; current Mozilla works perfectly for +this, but it also depends on the installed fonts of course):
+
+ + + + + + + + + + + + + + + + + + + + + + + + + +
Unicode code point
+
Character
+
U+0041
+
A
+
U+00E9
+
é
+
U+03B8
+
θ (the Greek theta)
+
U+20AC
+
€ (the euro)
+
+ +
+Using the Unicode code points there is no confusion anymore which character +is meant, because they uniquely define the character.  The full Unicode +code charts can be found here + (as a set of PDF documents).  A nice application to see all Unicode +characters is the Unicode Character Map (ucm), which can be found here, and which allows to select and paste any Unicode character.
+
+Some additional terminology (more terminology follows in the next section):
+ +
+

Unicode encodings, UTF-8

+Since Unicode characters are generally represented by a number that is 16 +bits wide, as seen above (for the basic plane), it would seem that all text +files would double in size, since the usual ASCII characters are 8 bits wide. + However, the Unicode code points are not necessarily the values that +are written to files...  
+
+Indeed, the simplest solution is to take the code point that defines a character, +split it up into two bytes, and write the two bytes to the file.  This +is called the UCS-2 encoding scheme:
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Character
+
Unicode code point
+
Byte values in file (UCS-2)
+
A
+
U+0041
+
0x00, 0x41
+
é
+
U+00E9
+
0x00, 0xE9
+
θ (theta)
+
U+03B8
+
0x03, 0xB8
+
€ (euro)
+
U+20AC
+
0x20, 0xAC
+
+
+This table assumes a big-endian encoding of UCS-2: the endianness is in principle +not defined, so there are two versions of UCS-2.  The little-endian +encoding results in the same values as in the table above, but in the inverse +order.
+
+So, we see that the UCS-2 encoding results in a doubling of file sizes for +files that contain only English text.  This is a disadvantage for this +encoding.  Another disadvantage is that null bytes can occur in normal +text, breaking all conventions for null-terminated C strings if you use the +normal char type.  This is why C also defines the wchar_t + type, which can hold a 32-bit character (at least in GNU systems).  To +avoid both of these disadvantages, UTF-8 was introduced.
+
+In UTF-8, the number of bytes used to write a character to a file depends +on the Unicode code point.  The corresponding table to the table above +is:
+
+ + +
Character
Unicode code point
Byte values in file (UTF-8)
A
U+0041
0x41
é
U+00E9
0xC3, 0xA9
θ (theta)
U+03B8
0xCE, 0xB8
€ (euro)
U+20AC
0xE2, 0x82, 0xAC
+
+Some immediate observations:
+ +An excellent explanation of how to characters are encoded in UTF-8 can be found on this page.
+
+Some additional terminology regarding encoding schemes (less important here):
+ +Note that the byte order of UCS-2, UCS-4, UTF-16 and UTF-32 is not defined, so it can be big endian or little endian !
+ + +
+
$Id$
$Name$

+ +
+ + \ No newline at end of file -- 2.30.2