X-Git-Url: https://git.dlugolecki.net.pl/?a=blobdiff_plain;ds=inline;f=doc%2Fparser.html;h=a344644c334e745a4a1067c045f74226240dc373;hb=b254663a41a4da7bc4c9039f89411f4b3bd0b8a7;hp=a70769370698c294c327d0abbc95e8d576a5514c;hpb=1316fee80103b38bf682f5c626c0abaa525db41c;p=gedcom-parse.git diff --git a/doc/parser.html b/doc/parser.html index a707693..a344644 100644 --- a/doc/parser.html +++ b/doc/parser.html @@ -1,22 +1,40 @@ - - -
+ - -If everything goes OK, you'll see that some gedcom files are parsed, - and that each parse is successful. Note that the used gedcom files + and that each parse is successful. Note that some of the used gedcom files are made by Heiner Eichmann and are an excellent way to test gedcom parsers thoroughly../configure
@@ -24,16 +42,16 @@ some practical issues of testing with the parser will be explained.
make check
testgedcom
program
-that is generated by make test
. make
. testgedcom
generates
is in UTF-8 format (more on this later), some preparation is necessary
@@ -66,12 +84,14 @@ the xterm
coming with XFree86 4.0.x). UTF-8 capabilities
and see the result.testgedcom
program print the values that it parses. An example of a command
- line is (in the gedcom
directory):
directory):./testgedcom -dg t/ulhc.ged
+The./testgedcom -dg t/input/ulhc.ged
-dg
option instructs the parser to show its own debug messages (see./testgedcom -h
for the full set of options). @@ -80,47 +100,76 @@ containing a lot of special characters.
For the ANSEL test file (t/ansel.ged
), you have to set the environment variableGCONV_PATH
to theansel
- subdirectory of the gedcom directory:
+ subdirectory of the top directory:
This is because for the ANSEL character set an extra module is needed for the iconv library (more on this later). But again, this should show a lot of special characters.export GCONV_PATH=./ansel
- ./testgedcom -dg t/ansel.ged
+ ./testgedcom -dg t/input/ansel.ged
-Testing the lexers separately
+ +Testing the lexers separately
+ The lexers themselves can be tested separately. For the 1-byte lexer (i.e. supporting the encodings with 1 byte per characters, such as - ASCII, ANSI and ANSEL), the sequence of commands would be:
+ ASCII, ANSI and ANSEL), the command would be (in thegedcom
subdirectory):
-make clean
- make test_1byte
+- This will show all tokens in themake lexer_1byte
t/allged.ged
test file. - Similar tests can be done usingmake test_hilo
and-make test_lohi
(for the unicode lexers).
+ This will generate a lexer program that can process e.g. thet/input/allged.ged
+ test file. Simply cat the file through the lexer on standard input +and you should get all the tokens in the file. Similar tests can be +done usingmake lexer_hilo
and+make lexer_lohi
+ (for the unicode lexers). In each of the cases you need to know +yourself which of the test files are appropriate to pass through the lexer.
This concludes the testing setup. Now for some explanations...
-
+
-Structure of the parser
+Structure of the parser
I see the structure of a program using the gedcom parser as follows:
- TO BE COMPLETED...
+ The parser is based onlex/yacc
, which means that a module generated bylex
+ takes the inputfile and determines the tokens in that file (i.e. the smallest +units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM +values...). These tokens are passed to the parser module, which is +generated by yacc, to parse the syntax of the file, i.e. whether the tokens +appear in a sequence that is valid.
+
+For each recognized statement in the GEDCOM file, the parser calls some callbacks, +which can be registered by the application to get the information out of +the file.
+
+This basic description ignores the problem of character encoding.
+
+Character encoding
Refer to this page for some introduction on character encoding...
+ +
+GEDCOM defines three standard encodings:
++
These are all supported by the parser, and converted into UTF-8 format.- ASCII
+- ANSEL
+- UNICODE (assumed to be UCS-2, either big-endian or little-endian: the GEDCOM spec doesn't specify this)
+
+ + + +
-$Id$+
$Name$$Id$
$Name$