X-Git-Url: https://git.dlugolecki.net.pl/?a=blobdiff_plain;ds=sidebyside;f=doc%2Fparser.html;h=a344644c334e745a4a1067c045f74226240dc373;hb=fe7273f0242015874a7617bd2ad4403a0f76e9fd;hp=e6581b7cbacf210f18370ab48ed554fcf0079b37;hpb=440a08f883e410cb0c714ccdc3bf95d23be3d805;p=gedcom-parse.git diff --git a/doc/parser.html b/doc/parser.html index e6581b7..a344644 100644 --- a/doc/parser.html +++ b/doc/parser.html @@ -1,11 +1,9 @@
- xterm
coming with XFree86 4.0.x). UTF-8 capabilities
Given the UTF-8 capable terminal, you can now let the testgedcom
program print the values that it parses. An example of a command
- line is (in the gedcom
directory):
directory):./testgedcom -dg t/ulhc.ged
+The./testgedcom -dg t/input/ulhc.ged
-dg
option instructs the parser to show its own debug messages (see./testgedcom -h
for the full set of options). @@ -102,10 +100,10 @@ containing a lot of special characters.
For the ANSEL test file (t/ansel.ged
), you have to set the environment variableGCONV_PATH
to theansel
- subdirectory of the gedcom directory:
+ subdirectory of the top directory:
This is because for the ANSEL character set an extra module is needed for the iconv library (more on this later). But again, this should @@ -121,11 +119,13 @@ containing a lot of special characters.export GCONV_PATH=./ansel
- ./testgedcom -dg t/ansel.ged
+ ./testgedcom -dg t/input/ansel.ged
- This will generate a lexer program that can process e.g. themake lexer_1byte
t/allged.ged
+ This will generate a lexer program that can process e.g. thet/input/allged.ged
test file. Simply cat the file through the lexer on standard input and you should get all the tokens in the file. Similar tests can be done usingmake lexer_hilo
and-make lexer_lohi
(for the unicode lexers). In each of the cases you need to know yourself which of the test files are appropriate to pass through the lexer.
+make lexer_lohi + (for the unicode lexers). In each of the cases you need to know +yourself which of the test files are appropriate to pass through the lexer.
This concludes the testing setup. Now for some explanations...
@@ -149,59 +149,20 @@ For each recognized statement in the GEDCOM file, the parser calls some callback which can be registered by the application to get the information out of the file.
-This basic description ignores the problem of character encoding. The next section describes what this problem exactly is.
-
-Character encoding
-The character encoding problem
-Developers are usually familiar with the ASCII character set. This -is a character set that assigns a unique number to some characters, e.g. -an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or -0x61 in hex). Some people may also have used ASCII codes for several -drawing characters (such as a horizontal bar, a vertical bar, or a top-right -corner) in the old DOS days, to be able to draw nice windows in text mode.
-
+This basic description ignores the problem of character encoding.
-However, these last characters are strictly spoken not part of the ASCII -set. The standard ASCII set contains only the character positions from -0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). An -example of this table can be found here. Anything that has an ASCII code between 128 and 255 is in principle undefined.
+Character encoding
Refer to this page for some introduction on character encoding...
+
-Now, several systems (including the old DOS) have defined those character -positions anyway, but usually in totally different ways. Some well -known extensions are:
+GEDCOM defines three standard encodings:
-
-And these are only examples of character sets used in West-European languages. - For Japanese, Chinese, Korean, Vietnamese, ... there are separate character -sets in which one byte's meaning can even be influenced by what the previous -byte was, i.e. these are multi-byte character sets. This is because -even 256 characters is totally inadequate to represent all characters in -such languages.- the DOS - character set, nowadays usually known as Code Page 437, but sometimes also -named LatinUS, ECS (Extended Character Set) or PC-8; note that the table -displayed in the link also contains the standard ASCII part
-- the ANSI character set, also known as Code Page 1252, and usually the default on Windows
-- the ISO-8859-1 character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices
-- the Adobe Standard Encoding, which is by default used in Postscript, unless overridden
-
-
-So, summarizing, if a text file contains a byte that has a value 65, it is -pretty safe to assume that this byte represents an "A", if we ignore the -multi-byte character sets spoken of before. However, a value 233 cannot -be interpreted without knowing in which character set the text file is written. - In Latin-1, it happens to be the character "é", but in another -character set it can be something totally different (e.g. in the DOS character -set it is the Greek letter theta).
-
-Vice versa, if you need to write a character "é" to a file, it depends -on the character set you will use what the numerical value will be in the -file: in Latin-1 it will be 233, but if you use the DOS character set it -will be 130.
-
-Unicode
-Enter the Unicode standard...
-
-TO BE COMPLETED
+ASCII +ANSEL +UNICODE (assumed to be UCS-2, either big-endian or little-endian: the GEDCOM spec doesn't specify this) + These are all supported by the parser, and converted into UTF-8 format.
+ + +