1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
5 <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
6 <title>Gedcom parser in Genes</title>
11 <h1>Gedcom parser in Genes</h1>
13 <div align="Left">The intention of this page is to provide some explanation
14 of the gedcom parser, to aid development on and with it. Currently,
15 the parser is in a state that it works, but some parts are still missing,
16 notably the interface towards applications. First, some practical
17 issues of testing with the parser will be explained.<br>
22 The parser is located in the "gedcom" subdirectory of the Genes source
23 code. You should be able to perform a basic test using the commands:<br>
25 <blockquote><code>make clean<br>
29 If everything goes OK, you'll see that some gedcom files are parsed,
30 and that each parse is successful. Note that the used gedcom files
31 are made by <a href="http://heiner-eichmann.de/gedcom/gedcom.htm">Heiner
33 and are an excellent way to test gedcom parsers thoroughly.<br>
36 <h2>Preparing for further testing</h2>
37 The basic testing described above doesn't show anything else than "Parse
38 succeeded", which is nice, but not very interesting. Some more detailed
39 tests are possible, via the <code>gedcom-parse</code> program that is generated
40 by <code>make test</code>. <br>
42 However, since the output that <code>gedcom-parse</code> generates is
43 in UTF-8 format (more on this later), some preparation is necessary to
44 have a full view on it. Basically, you need a terminal that understands
45 and can display UTF-8 encoded characters, and you need to proper fonts installed
46 to display them. I'll give some advice on this here, based on the
47 Red Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 4.0.x. Any
48 other distribution that has the same or newer versions for these components
49 should give the same results.<br>
51 For the first issue, the UTF-8 capable terminal, the safest bet is to
52 use <code>xterm</code> in its unicode mode (which is supported by the
53 <code> xterm</code> coming with XFree86 4.0.x). UTF-8 capabilities
54 have only recently been added to <code>gnome-terminal</code>, so probably
55 that is not in your distribution yet (it certainly isn't in Red Hat 7.1).<br>
57 For the second issue, you'll need the ISO 10646-1 fonts. These
58 come also with XFree86 4.0.x.<br>
60 The way to start <code>xterm</code> in unicode mode is then e.g. (put
61 everything on 1 line !):<br>
63 <blockquote><code>LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm
64 -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'</code><br>
66 This first sets the <code>LANG</code> variable to a locale that
67 uses UTF-8, and then starts <code>xterm</code> with a proper Unicode font.
68 Some sample UTF-8 plain text files can be found <a href="http://www.cl.cam.ac.uk/%7Emgk25/ucs/examples">
70 . Just <code>cat</code> them on the command line and see the result.<br>
73 <h2>Testing the parser with debugging</h2>
74 Given the UTF-8 capable terminal, you can now let the <code>gedcom-parse</code>
75 program print the values that it parses. An example of a command
76 line is (in the <code>gedcom</code> directory):<br>
78 <blockquote><code>./gedcom_parse -dg t/ulhc.ged</code><br>
80 The <code>-dg</code> option instructs the parser to show its own debug
81 messages (see <code>./gedcom_parse -h</code> for the full set of
82 options). If everything is OK, you'll see the values from the gedcom
83 file, containing a lot of special characters.<br>
85 For the ANSEL test file (<code>t/ansel.ged</code>), you have to set the
86 environment variable <code>GCONV_PATH</code> to the <code>ansel</code> subdirectory
87 of the gedcom directory:<br>
89 <blockquote><code>export GCONV_PATH=./ansel<br>
90 ./gedcom_parse -dg t/ansel.ged<br>
92 This is because for the ANSEL character set an extra module is needed
93 for the iconv library (more on this later). But again, this should
94 show a lot of special characters.<br>
97 <h2>Testing the lexers separately</h2>
98 The lexers themselves can be tested separately. For the 1-byte
99 lexer (i.e. supporting the encodings with 1 byte per characters, such as
100 ASCII, ANSI and ANSEL), the sequence of commands would be:<br>
102 <blockquote><code>make clean<br>
105 This will show all tokens in the <code>t/allged.ged</code> test file. Similar
106 tests can be done using <code>make test_hilo</code> and <code>make test_lohi</code>
107 (for the unicode lexers).<br>
109 This concludes the testing setup. Now for some explanations...<br>
112 <h2>Structure of the parser</h2>
113 I see the structure of a program using the gedcom parser as follows:<br>
115 <img src="images/schema.png" alt="Gedcom parsing scheme">
119 TO BE COMPLETED...<br>
121 <hr width="100%" size="2">$Id: parser.html,v 1.2 2001/12/01 15:29:00