X-Git-Url: https://git.dlugolecki.net.pl/?a=blobdiff_plain;ds=sidebyside;f=doc%2Fparser.html;h=a344644c334e745a4a1067c045f74226240dc373;hb=e4ccfb6a6f26881db3874841a5aa2c2201a9c4e0;hp=26586a36e6ec7ec4b6d70a0dca5a9064bc883a0d;hpb=8b7ff0dff0815a94ff08a9825d22a0c44490317a;p=gedcom-parse.git diff --git a/doc/parser.html b/doc/parser.html index 26586a3..a344644 100644 --- a/doc/parser.html +++ b/doc/parser.html @@ -1,132 +1,175 @@ - - -
- - -- If everything goes OK, you'll see that some gedcom files are parsed, -and that each parse is successful. Note that the used gedcom files -are made by Heiner -Eichmann - and are an excellent way to test gedcom parsers thoroughly.make clean
- make
- make test
-
gedcom-parse
program that is generated
- by make test
. gedcom-parse
generates is
-in UTF-8 format (more on this later), some preparation is necessary to have
- a full view on it. Basically, you need a terminal that understands
-and can display UTF-8 encoded characters, and you need to proper fonts installed
- to display them. I'll give some advice on this here, based on the
-Red Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 4.0.x. Any
- other distribution that has the same or newer versions for these components
- should give the same results.xterm
in its unicode mode (which is supported by the
- xterm
coming with XFree86 4.0.x). UTF-8 capabilities
-have only recently been added to gnome-terminal
, so probably
-that is not in your distribution yet (it certainly isn't in Red Hat 7.1).xterm
in unicode mode is then e.g. (put
-everything on 1 line !):- This first sets theLANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm - -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
-
LANG
variable to a locale that
-uses UTF-8, and then starts xterm
with a proper Unicode font.
- Some sample UTF-8 plain text files can be found
- here
- . Just cat
them on the command line and see the result.gedcom-parse
- program print the values that it parses. An example of a command
- line is (in the gedcom
directory):./gedcom_parse -dg t/ulhc.ged
+ + +The Gedcom parser library internals + ++- - - +The Gedcom parser library internals
+ +The intention of this page is to provide some explanation + of the gedcom parser, to aid development on and with it. First, +some practical issues of testing with the parser will be explained.-
+
+Index
++
+- Testing
++
+- Basic testing
+- Preparing for further testing
+- Testing the parser with debugging
+- Testing the lexers separately
+
+- Structure of the parser
++
+- Character encoding
+
+
+
+Testing
+ + +
+Basic testing
+ + You should be able to perform a basic test using the commands:
+ +
+ +- The./configure
+ make
+ make check
-dg
option instructs the parser to show its own debug -messages (see./gedcom_parse -h
for the full set of options). - If everything is OK, you'll see the values from the gedcom file, containing -a lot of special characters.
+ If everything goes OK, you'll see that some gedcom files are parsed, + and that each parse is successful. Note that some of the used gedcom files + are made by Heiner + Eichmann and are an excellent way to test gedcom parsers thoroughly.
- For the ANSEL test file (t/ansel.ged
), you have to set the - environment variableGCONV_PATH
to theansel
subdirectory - of the gedcom directory:
- -- This is because for the ANSEL character set an extra module is needed -for the iconv library (more on this later). But again, this should -show a lot of special characters.export GCONV_PATH=./ansel
- ./gedcom_parse -dg t/ansel.ged
-
+ + +Preparing for further testing
+Some +more detailed tests are possible, via thetestgedcom
program +that is generated bymake
.
+
+ However, since the output thattestgedcom
generates +is in UTF-8 format (more on this later), some preparation is necessary +to have a full view on it. Basically, you need a terminal that understands + and can display UTF-8 encoded characters, and you need to proper fonts +installed to display them. I'll give some advice on this here, +based on the Red Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 +4.0.x. Any other distribution that has the same or newer versions +for these components should give the same results.
+
+ For the first issue, the UTF-8 capable terminal, the safest bet is + to usexterm
in its unicode mode (which is supported by +thexterm
coming with XFree86 4.0.x). UTF-8 capabilities + have only recently been added tognome-terminal
, so probably + that is not in your distribution yet (it certainly isn't in Red Hat 7.1).
+
+ For the second issue, you'll need the ISO 10646-1 fonts. These + come also with XFree86 4.0.x.
+
+ The way to startxterm
in unicode mode is then e.g. +(put everything on 1 line !):
+ ++ This first sets theLANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm + -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
+LANG
variable to a locale that + uses UTF-8, and then startsxterm
with a proper Unicode font. + Some sample UTF-8 plain text files can be found + here . Justcat
them on the command line + and see the result.
- -Testing the lexers separately
- The lexers themselves can be tested separately. For the 1-byte lexer - (i.e. supporting the encodings with 1 byte per characters, such as ASCII, - ANSI and ANSEL), the sequence of commands would be:
- -make clean
- make test_1byte
- cat t/allged.ged | ./test_1byte
+ + +Testing the parser with debugging
+ + Given the UTF-8 capable terminal, you can now let thetestgedcom
+ program print the values that it parses. An example of a command + line is (in the topdirectory):
+ +- This will show all tokens in the./testgedcom -dg t/input/ulhc.ged
t/allged.ged
test file. - With the lexers you have to make sure that you use the proper lexer -for each test file. Thetest_1byte
test program is OK -forallged.ged
andansel.ged
(the last one again -with the environment variable set); for theuhl*.ged
files -you need thetest_hilo
test program; for theulh*.ged
- files you need thetest_lohi
program.
+ The-dg
option instructs the parser to show its own debug + messages (see./testgedcom -h
for the full set of options). + If everything is OK, you'll see the values from the gedcom file, +containing a lot of special characters.
- This concludes the testing setup. Now for some explanations...
-
- -Structure of the parser
- I see the structure of a program using the gedcom parser as follows:
-
- -
-
-
- TO BE COMPLETED...
-
$Id$
-$Name$
+ For the ANSEL test file (t/ansel.ged
), you have to set + the environment variableGCONV_PATH
to theansel
+ subdirectory of the top directory:
+ ++ This is because for the ANSEL character set an extra module is needed + for the iconv library (more on this later). But again, this should + show a lot of special characters.export GCONV_PATH=./ansel
+ ./testgedcom -dg t/input/ansel.ged
+
+
+ + +Testing the lexers separately
+ + The lexers themselves can be tested separately. For the 1-byte + lexer (i.e. supporting the encodings with 1 byte per characters, such as + ASCII, ANSI and ANSEL), the command would be (in thegedcom
subdirectory):
+ ++ This will generate a lexer program that can process e.g. themake lexer_1byte
+t/input/allged.ged
+ test file. Simply cat the file through the lexer on standard input +and you should get all the tokens in the file. Similar tests can be +done usingmake lexer_hilo
and+make lexer_lohi
+ (for the unicode lexers). In each of the cases you need to know +yourself which of the test files are appropriate to pass through the lexer.
+
+ This concludes the testing setup. Now for some explanations...
+
+ + +Structure of the parser
+ I see the structure of a program using the gedcom parser as follows:
+
+ +
+
+
+ The parser is based onlex/yacc
, which means that a module generated bylex
+ takes the inputfile and determines the tokens in that file (i.e. the smallest +units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM +values...). These tokens are passed to the parser module, which is +generated by yacc, to parse the syntax of the file, i.e. whether the tokens +appear in a sequence that is valid.
+
+For each recognized statement in the GEDCOM file, the parser calls some callbacks, +which can be registered by the application to get the information out of +the file.
+
+This basic description ignores the problem of character encoding.
+
+Character encoding
Refer to this page for some introduction on character encoding...
+ +
+GEDCOM defines three standard encodings:
++
These are all supported by the parser, and converted into UTF-8 format.- ASCII
+- ANSEL
+- UNICODE (assumed to be UCS-2, either big-endian or little-endian: the GEDCOM spec doesn't specify this)
+
+ + + + + +
+$Id$
$Name$
-