X-Git-Url: https://git.dlugolecki.net.pl/?a=blobdiff_plain;f=doc%2Fparser.html;h=893670cf3fb7361de6e2183bee074c6a3115ca90;hb=37246a4b0ab22fb948d6fb5b9b91917441db26cf;hp=a70769370698c294c327d0abbc95e8d576a5514c;hpb=1316fee80103b38bf682f5c626c0abaa525db41c;p=gedcom-parse.git diff --git a/doc/parser.html b/doc/parser.html index a707693..893670c 100644 --- a/doc/parser.html +++ b/doc/parser.html @@ -1,22 +1,40 @@ - - -
+ - -If everything goes OK, you'll see that some gedcom files are parsed, - and that each parse is successful. Note that the used gedcom files + and that each parse is successful. Note that some of the used gedcom files are made by Heiner Eichmann and are an excellent way to test gedcom parsers thoroughly../configure
@@ -24,16 +42,16 @@ some practical issues of testing with the parser will be explained.
make check
testgedcom
program
-that is generated by make test
. make
. testgedcom
generates
is in UTF-8 format (more on this later), some preparation is necessary
@@ -66,7 +84,9 @@ the xterm
coming with XFree86 4.0.x). UTF-8 capabilities
and see the result.testgedcom
program print the values that it parses. An example of a command
line is (in the gedcom
directory):gedcom
subdirectory):make clean
- make test_1byte
+ make lexer_1byte
- This will show all tokens in the t/allged.ged
test file.
- Similar tests can be done using make test_hilo
and
-make test_lohi
(for the unicode lexers).
+ This will generate a lexer program that can process e.g. the t/allged.ged
+ test file. Simply cat the file through the lexer on standard input
+and you should get all the tokens in the file. Similar tests can be
+done using make lexer_hilo
and
+make lexer_lohi
+ (for the unicode lexers). In each of the cases you need to know
+yourself which of the test files are appropriate to pass through the lexer.
This concludes the testing setup. Now for some explanations...
-
+
- Structure of the parser
+ Structure of the parser
I see the structure of a program using the gedcom parser as follows:
- TO BE COMPLETED...
+ The parser is based on lex/yacc
, which means that a module generated by lex
+ takes the inputfile and determines the tokens in that file (i.e. the smallest
+units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM
+values...). These tokens are passed to the parser module, which is
+generated by yacc, to parse the syntax of the file, i.e. whether the tokens
+appear in a sequence that is valid.
+
+For each recognized statement in the GEDCOM file, the parser calls some callbacks,
+which can be registered by the application to get the information out of
+the file.
+
+This basic description ignores the problem of character encoding. The next section describes what this problem exactly is.
+
+ Character encoding
Refer to this page for some introduction on character encoding...
+
+
+TO BE COMPLETED
+
- $Id$
$Name$
+ $Id$
$Name$