Gedcom parser in Genes

- -

The intention of this page is to provide some explanation - of the gedcom parser, to aid development on and with it. Currently, - the parser is in a state that it works, but some parts are still missing, - notably the interface towards applications. First, some practical -issues of testing with the parser will be explained.
-
- -

Basic testing
-

- The parser is located in the "gedcom" subdirectory of the Genes source - code. You should be able to perform a basic test using the commands:
- -

make clean - make - make test
-

- If everything goes OK, you'll see that some gedcom files are parsed, -and that each parse is successful. Note that the used gedcom files -are made by Heiner -Eichmann - and are an excellent way to test gedcom parsers thoroughly.
-
- -

Preparing for further testing

- The basic testing described above doesn't show anything else than "Parse - succeeded", which is nice, but not very interesting. Some more detailed - tests are possible, via the gedcom-parse program that is generated - by make test.
-
- However, since the output that gedcom-parse generates is -in UTF-8 format (more on this later), some preparation is necessary to have - a full view on it. Basically, you need a terminal that understands -and can display UTF-8 encoded characters, and you need to proper fonts installed - to display them. I'll give some advice on this here, based on the -Red Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 4.0.x. Any - other distribution that has the same or newer versions for these components - should give the same results.
-
- For the first issue, the UTF-8 capable terminal, the safest bet is to -use xterm in its unicode mode (which is supported by the - xterm coming with XFree86 4.0.x). UTF-8 capabilities -have only recently been added to gnome-terminal, so probably -that is not in your distribution yet (it certainly isn't in Red Hat 7.1).
-
- For the second issue, you'll need the ISO 10646-1 fonts. These -come also with XFree86 4.0.x.
-
- The way to start xterm in unicode mode is then e.g. (put -everything on 1 line !):
- -

LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm - -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
-

- This first sets the LANG variable to a locale that -uses UTF-8, and then starts xterm with a proper Unicode font. - Some sample UTF-8 plain text files can be found - here - . Just cat them on the command line and see the result.
-
- -

Testing the parser with debugging

- Given the UTF-8 capable terminal, you can now let the gedcom-parse - program print the values that it parses. An example of a command - line is (in the gedcom directory):
- -

./gedcom_parse -dg t/ulhc.ged
+ + + The Gedcom parser library internals + +
+
The Gedcom parser library internals
+ +
The intention of this page is to provide some explanation + of the gedcom parser, to aid development on and with it. First, +some practical issues of testing with the parser will be explained.
+
+
Index
+
+
Testing
+
+
Basic testing
+
Preparing for further testing
+
Testing the parser with debugging
+
Testing the lexers separately
+
+
+
Structure of the parser
+
+
Character encoding
+
+
+
+
+
+
Testing
+
+ + +
Basic testing
+ +
+ + You should be able to perform a basic test using the commands:
+ +
./configure + make + make check

- The -dg option instructs the parser to show its own debug -messages (see ./gedcom_parse -h for the full set of options). - If everything is OK, you'll see the values from the gedcom file, containing -a lot of special characters.
+ If everything goes OK, you'll see that some gedcom files are parsed, + and that each parse is successful. Note that some of the used gedcom files + are made by Heiner + Eichmann and are an excellent way to test gedcom parsers thoroughly.

- For the ANSEL test file (t/ansel.ged), you have to set the - environment variable GCONV_PATH to the ansel subdirectory - of the gedcom directory:
- -
export GCONV_PATH=./ansel - ./gedcom_parse -dg t/ansel.ged -
- This is because for the ANSEL character set an extra module is needed -for the iconv library (more on this later). But again, this should -show a lot of special characters.
+ + +
Preparing for further testing
+Some +more detailed tests are possible, via the testgedcom program +that is generated by make.
+
+ However, since the output that testgedcom generates +is in UTF-8 format (more on this later), some preparation is necessary +to have a full view on it. Basically, you need a terminal that understands + and can display UTF-8 encoded characters, and you need to proper fonts +installed to display them. I'll give some advice on this here, +based on the Red Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 +4.0.x. Any other distribution that has the same or newer versions +for these components should give the same results.
+
+ For the first issue, the UTF-8 capable terminal, the safest bet is + to use xterm in its unicode mode (which is supported by +the xterm coming with XFree86 4.0.x). UTF-8 capabilities + have only recently been added to gnome-terminal, so probably + that is not in your distribution yet (it certainly isn't in Red Hat 7.1).
+
+ For the second issue, you'll need the ISO 10646-1 fonts. These + come also with XFree86 4.0.x.
+
+ The way to start xterm in unicode mode is then e.g. +(put everything on 1 line !):
+ +
LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm + -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
+
+ This first sets the LANG variable to a locale that + uses UTF-8, and then starts xterm with a proper Unicode font. + Some sample UTF-8 plain text files can be found + here . Just cat them on the command line + and see the result.

- -
Testing the lexers separately
- The lexers themselves can be tested separately. For the 1-byte lexer - (i.e. supporting the encodings with 1 byte per characters, such as ASCII, - ANSI and ANSEL), the sequence of commands would be:
- -
make clean - make test_1byte - cat t/allged.ged | ./test_1byte
+ + +
Testing the parser with debugging
+ + Given the UTF-8 capable terminal, you can now let the testgedcom + program print the values that it parses. An example of a command + line is (in the top directory):
+ +
./testgedcom -dg t/input/ulhc.ged

- This will show all tokens in the t/allged.ged test file. - With the lexers you have to make sure that you use the proper lexer -for each test file. The test_1byte test program is OK -for allged.ged and ansel.ged (the last one again -with the environment variable set); for the uhl*.ged files -you need the test_hilo test program; for the ulh*.ged - files you need the test_lohi program.
+ The -dg option instructs the parser to show its own debug + messages (see ./testgedcom -h for the full set of options). + If everything is OK, you'll see the values from the gedcom file, +containing a lot of special characters.

- This concludes the testing setup. Now for some explanations...
-
- -
Structure of the parser
- I see the structure of a program using the gedcom parser as follows:
-
- -
-
-
- TO BE COMPLETED...
-
$Id$
-$Name$
+ For the ANSEL test file (t/ansel.ged), you have to set + the environment variable GCONV_PATH to the ansel + subdirectory of the top directory:
+ +
export GCONV_PATH=./ansel + ./testgedcom -dg t/input/ansel.ged +
+ This is because for the ANSEL character set an extra module is needed + for the iconv library (more on this later). But again, this should + show a lot of special characters.
+
+ + +
Testing the lexers separately
+ + The lexers themselves can be tested separately. For the 1-byte + lexer (i.e. supporting the encodings with 1 byte per characters, such as + ASCII, ANSI and ANSEL), the command would be (in the gedcom subdirectory):
+ +
make lexer_1byte +
+ This will generate a lexer program that can process e.g. the t/input/allged.ged + test file. Simply cat the file through the lexer on standard input +and you should get all the tokens in the file. Similar tests can be +done using make lexer_hilo and +make lexer_lohi + (for the unicode lexers). In each of the cases you need to know +yourself which of the test files are appropriate to pass through the lexer.
+
+ This concludes the testing setup. Now for some explanations...
+

+ + +
Structure of the parser
+ I see the structure of a program using the gedcom parser as follows:
+
+ +
+
+
+ The parser is based on lex/yacc, which means that a module generated by lex + takes the inputfile and determines the tokens in that file (i.e. the smallest +units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM +values...). These tokens are passed to the parser module, which is +generated by yacc, to parse the syntax of the file, i.e. whether the tokens +appear in a sequence that is valid.
+
+For each recognized statement in the GEDCOM file, the parser calls some callbacks, +which can be registered by the application to get the information out of +the file.
+
+This basic description ignores the problem of character encoding.
+
+
Character encoding
Refer to this page for some introduction on character encoding...
+ +
+GEDCOM defines three standard encodings:
+
+
ASCII
+
ANSEL
+
UNICODE (assumed to be UCS-2, either big-endian or little-endian: the GEDCOM spec doesn't specify this)
+
These are all supported by the parser, and converted into UTF-8 format.
+ + + + + +
+
$Id$
$Name$
-
-
- - - +

Gedcom parser in Genes

Basic testing
-

Preparing for further testing

Testing the parser with debugging

The Gedcom parser library internals

Index

Testing
+

Basic testing
+ +

Preparing for further testing

Testing the lexers separately

Testing the parser with debugging

Structure of the parser

Testing the lexers separately

Structure of the parser

Character encoding

Gedcom parser in Genes

Basic testing -

Preparing for further testing

Testing the parser with debugging

The Gedcom parser library internals

Index

Testing +

Basic testing + +

Preparing for further testing

Testing the lexers separately

Testing the parser with debugging

Structure of the parser

Testing the lexers separately

Structure of the parser

Character encoding

Basic testing
-

Testing
+

Basic testing
+ +