Gedcom parser in Genes

- -

The intention of this page is to provide some explanation - of the gedcom parser, to aid development on and with it. Currently, - the parser is in a state that it works, but some parts are still missing, - notably the interface towards applications. First, some practical issues - of testing with the parser will be explained.
-
- -

Basic testing
-

- The parser is located in the "gedcom" subdirectory of the Genes source -code. You should be able to perform a basic test using the commands:
- -

make clean - make - make test
-

- If everything goes OK, you'll see that some gedcom files are parsed, and - that each parse is successful. Note that the used gedcom files are -made by Heiner Eichmann - and are an excellent way to test gedcom parsers thoroughly.
-
- -

Preparing for further testing

- The basic testing described above doesn't show anything else than "Parse - succeeded", which is nice, but not very interesting. Some more detailed - tests are possible, via the gedcom-parse program that is generated - by make test.
-
- However, since the output that gedcom-parse generates is -in UTF-8 format (more on this later), some preparation is necessary to have -a full view on it. Basically, you need a terminal that understands and -can display UTF-8 encoded characters, and you need to proper fonts installed - to display them. I'll give some advice on this here, based on the Red - Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 4.0.x. Any - other distribution that has the same or newer versions for these components - should give the same results.
-
- For the first issue, the UTF-8 capable terminal, the safest bet is to -use xterm in its unicode mode (which is supported by the - xterm coming with XFree86 4.0.x). UTF-8 capabilities -have only recently been added to gnome-terminal, so probably -that is not in your distribution yet (it certainly isn't in Red Hat 7.1).
-
- For the second issue, you'll need the ISO 10646-1 fonts. These come - also with XFree86 4.0.x.
-
- The way to start xterm in unicode mode is then e.g. (put -everything on 1 line !):
- -

LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm - -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
-

- This first sets the LANG variable to a locale that uses - UTF-8, and then starts xterm with a proper Unicode font. Some - sample UTF-8 plain text files can be found - here - . Just cat them on the command line and see the result.
-
- -

Testing the parser with debugging

- Given the UTF-8 capable terminal, you can now let the gedcom-parse - program print the values that it parses. An example of a command -line is (in the gedcom directory):
- -

./gedcom_parse -dg t/ulhc.ged
-

- The -dg option instructs the parser to show its own debug -messages (see ./gedcom_parse -h for the full set of options). - If everything is OK, you'll see the values from the gedcom file, containing -a lot of special characters.
-
- For the ANSEL test file (t/ansel.ged), you have to set the -environment variable GCONV_PATH to the ansel subdirectory -of the gedcom directory:
- + + + The Gedcom parser library internals + +

The Gedcom parser library internals

+ +

The intention of this page is to provide some explanation + of the gedcom parser, to aid development on and with it. First, +some practical issues of testing with the parser will be explained.
+
+

Index

Testing

Basic testing
Preparing for further testing
Testing the parser with debugging
Testing the lexers separately
+

Structure of the parser

Character encoding
+

+
+

Testing
+

+ + +

Basic testing
+ +

+ + You should be able to perform a basic test using the commands:
+ +

./configure + make + make check
+

+ If everything goes OK, you'll see that some gedcom files are parsed, + and that each parse is successful. Note that some of the used gedcom files + are made by Heiner + Eichmann and are an excellent way to test gedcom parsers thoroughly.
+
+ + +

Preparing for further testing

+Some +more detailed tests are possible, via the testgedcom program +that is generated by make.
+
+ However, since the output that testgedcom generates +is in UTF-8 format (more on this later), some preparation is necessary +to have a full view on it. Basically, you need a terminal that understands + and can display UTF-8 encoded characters, and you need to proper fonts +installed to display them. I'll give some advice on this here, +based on the Red Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 +4.0.x. Any other distribution that has the same or newer versions +for these components should give the same results.
+
+ For the first issue, the UTF-8 capable terminal, the safest bet is + to use xterm in its unicode mode (which is supported by +the xterm coming with XFree86 4.0.x). UTF-8 capabilities + have only recently been added to gnome-terminal, so probably + that is not in your distribution yet (it certainly isn't in Red Hat 7.1).
+
+ For the second issue, you'll need the ISO 10646-1 fonts. These + come also with XFree86 4.0.x.
+
+ The way to start xterm in unicode mode is then e.g. +(put everything on 1 line !):
+ +

LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm + -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
+

+ This first sets the LANG variable to a locale that + uses UTF-8, and then starts xterm with a proper Unicode font. + Some sample UTF-8 plain text files can be found + here . Just cat them on the command line + and see the result.
+
+ + +

Testing the parser with debugging

+ + Given the UTF-8 capable terminal, you can now let the testgedcom + program print the values that it parses. An example of a command + line is (in the top directory):
+ +

./testgedcom -dg t/input/ulhc.ged
+

+ The -dg option instructs the parser to show its own debug + messages (see ./testgedcom -h for the full set of options). + If everything is OK, you'll see the values from the gedcom file, +containing a lot of special characters.
+
+ For the ANSEL test file (t/ansel.ged), you have to set + the environment variable GCONV_PATH to the ansel + subdirectory of the top directory:
+

export GCONV_PATH=./ansel - ./gedcom_parse -dg t/ansel.ged -

- This is because for the ANSEL character set an extra module is needed for - the iconv library (more on this later). But again, this should show - a lot of special characters.
+ ./testgedcom -dg t/input/ansel.ged
+ + This is because for the ANSEL character set an extra module is needed + for the iconv library (more on this later). But again, this should + show a lot of special characters.
+
+ + +

Testing the lexers separately

+ + The lexers themselves can be tested separately. For the 1-byte + lexer (i.e. supporting the encodings with 1 byte per characters, such as + ASCII, ANSI and ANSEL), the command would be (in the gedcom subdirectory):
+ +

make lexer_1byte +

+ This will generate a lexer program that can process e.g. the t/input/allged.ged + test file. Simply cat the file through the lexer on standard input +and you should get all the tokens in the file. Similar tests can be +done using make lexer_hilo and


+make lexer_lohi

+ (for the unicode lexers). In each of the cases you need to know +yourself which of the test files are appropriate to pass through the lexer.
+
+ This concludes the testing setup. Now for some explanations...
+

+ + +

Structure of the parser

+ I see the structure of a program using the gedcom parser as follows:
+
+ Gedcom parsing scheme

+
+
+
+ The parser is based on lex/yacc, which means that a module generated by lex + takes the inputfile and determines the tokens in that file (i.e. the smallest +units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM +values...). These tokens are passed to the parser module, which is +generated by yacc, to parse the syntax of the file, i.e. whether the tokens +appear in a sequence that is valid.

- -

Testing the lexers separately

- The lexers themselves can be tested separately. For the 1-byte lexer - (i.e. supporting the encodings with 1 byte per characters, such as ASCII, - ANSI and ANSEL), the sequence of commands would be:
- -

make clean - make test_1byte - cat t/allged.ged | ./test_1byte
-

- This will show all tokens in the t/allged.ged test file. With - the lexers you have to make sure that you use the proper lexer for each -test file. The test_1byte test program is OK for


-allged.ged

and ansel.ged (the last one again with the -environment variable set); for the uhl*.ged files you need -the test_hilo test program; for the ulh*.ged -files you need the test_lohi program.
-
- This concludes the testing setup. Now for some explanations...
-
- -

Structure of the parser

- I see the structure of a program using the gedcom parser as follows:
-
- Gedcom parsing scheme

-
-
-
- TO BE COMPLETED...
+For each recognized statement in the GEDCOM file, the parser calls some callbacks, +which can be registered by the application to get the information out of +the file.

-

- - - +This basic description ignores the problem of character encoding.
+
+

Character encoding

Refer to this page for some introduction on character encoding...
+ +
+GEDCOM defines three standard encodings:
+

ASCII
ANSEL
UNICODE (assumed to be UCS-2, either big-endian or little-endian: the GEDCOM spec doesn't specify this)

These are all supported by the parser, and converted into UTF-8 format.
+ + + + + +

$Id$
$Name$

+
+

Gedcom parser in Genes

Basic testing -

Preparing for further testing

Testing the parser with debugging

The Gedcom parser library internals

Index

Testing +

Basic testing + +

Preparing for further testing

Testing the parser with debugging

Testing the lexers separately

Structure of the parser

Testing the lexers separately

Structure of the parser

Character encoding

Basic testing
-

Testing
+

Basic testing
+ +