X-Git-Url: https://git.dlugolecki.net.pl/?a=blobdiff_plain;f=doc%2Fparser.html;h=a344644c334e745a4a1067c045f74226240dc373;hb=0767205183bd541a80b83ce4578e22c9b5fe9fa4;hp=893670cf3fb7361de6e2183bee074c6a3115ca90;hpb=37246a4b0ab22fb948d6fb5b9b91917441db26cf;p=gedcom-parse.git
diff --git a/doc/parser.html b/doc/parser.html
index 893670c..a344644 100644
--- a/doc/parser.html
+++ b/doc/parser.html
@@ -89,9 +89,9 @@ the xterm
coming with XFree86 4.0.x). UTF-8 capabilities
Given the UTF-8 capable terminal, you can now let the testgedcom
program print the values that it parses. An example of a command
- line is (in the gedcom
directory):
+ line is (in the top directory):
-
./testgedcom -dg t/ulhc.ged
+The./testgedcom -dg t/input/ulhc.ged
-dg
option instructs the parser to show its own debug messages (see./testgedcom -h
for the full set of options). @@ -100,10 +100,10 @@ containing a lot of special characters.
For the ANSEL test file (t/ansel.ged
), you have to set the environment variableGCONV_PATH
to theansel
- subdirectory of the gedcom directory:
+ subdirectory of the top directory:
This is because for the ANSEL character set an extra module is needed for the iconv library (more on this later). But again, this should @@ -119,7 +119,7 @@ containing a lot of special characters.export GCONV_PATH=./ansel
- ./testgedcom -dg t/ansel.ged
+ ./testgedcom -dg t/input/ansel.ged
- This will generate a lexer program that can process e.g. themake lexer_1byte
t/allged.ged
+ This will generate a lexer program that can process e.g. thet/input/allged.ged
test file. Simply cat the file through the lexer on standard input and you should get all the tokens in the file. Similar tests can be done usingmake lexer_hilo
and@@ -149,12 +149,20 @@ For each recognized statement in the GEDCOM file, the parser calls some callback which can be registered by the application to get the information out of the file.
-This basic description ignores the problem of character encoding. The next section describes what this problem exactly is.
+This basic description ignores the problem of character encoding.
Character encoding
Refer to this page for some introduction on character encoding...
-
-TO BE COMPLETED
+
+GEDCOM defines three standard encodings:
++
These are all supported by the parser, and converted into UTF-8 format.- ASCII
+- ANSEL
+- UNICODE (assumed to be UCS-2, either big-endian or little-endian: the GEDCOM spec doesn't specify this)
+
+ + +