-<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
-<html>
-<head>
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
- <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
- <title>The Gedcom parser library</title>
-</head>
- <body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099">
+ <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><title>The Gedcom parser library internals</title></head><body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099">
<div align="center">
-<h1>The Gedcom parser library</h1>
+<h1>The Gedcom parser library internals</h1>
<div align="left">The intention of this page is to provide some explanation
of the gedcom parser, to aid development on and with it. First,
some practical issues of testing with the parser will be explained.<br>
- <br>
+<br>
+<h2>Index</h2>
+<ul>
+ <li><a href="#Testing">Testing</a></li>
+ <ul>
+ <li><a href="#Basic_testing">Basic testing</a></li>
+ <li><a href="#Preparing_for_further_testing">Preparing for further testing</a></li>
+ <li><a href="#Testing_the_parser_with_debugging">Testing the parser with debugging</a></li>
+ <li><a href="#Testing_the_lexers_separately">Testing the lexers separately</a><br>
+ </li>
+ </ul>
+ <li><a href="#Structure_of_the_parser">Structure of the parser</a></li>
+ <ul>
+ <li><a href="#Character_encoding">Character encoding</a><br>
+ </li>
+ </ul>
+</ul>
+<br>
+<hr width="100%" size="2">
+<h2><a name="Testing"></a>Testing<br>
+</h2>
+
-<h2>Basic testing<br>
- </h2>
+<h3><a name="Basic_testing"></a>Basic testing<br>
+
+ </h3>
+
You should be able to perform a basic test using the commands:<br>
<blockquote><code>./configure<br>
make check</code><br>
</blockquote>
If everything goes OK, you'll see that some gedcom files are parsed,
- and that each parse is successful. Note that the used gedcom files
+ and that each parse is successful. Note that some of the used gedcom files
are made by <a href="http://heiner-eichmann.de/gedcom/gedcom.htm">Heiner
Eichmann</a> and are an excellent way to test gedcom parsers thoroughly.<br>
<br>
- <h2>Preparing for further testing</h2>
- The basic testing described above doesn't show anything else than
-"Parse succeeded", which is nice, but not very interesting. Some
+
+ <h3><a name="Preparing_for_further_testing"></a>Preparing for further testing</h3>
+Some
more detailed tests are possible, via the <code>testgedcom</code> program
-that is generated by <code>make test</code>. <br>
+that is generated by <code>make</code>. <br>
<br>
However, since the output that <code>testgedcom</code> generates
is in UTF-8 format (more on this later), some preparation is necessary
and see the result.<br>
<br>
- <h2>Testing the parser with debugging</h2>
+
+ <h3><a name="Testing_the_parser_with_debugging"></a>Testing the parser with debugging</h3>
+
Given the UTF-8 capable terminal, you can now let the <code>testgedcom</code>
program print the values that it parses. An example of a command
- line is (in the <code>gedcom</code> directory):<br>
+ line is (in the top <code></code>directory):<br>
- <blockquote><code>./testgedcom -dg t/ulhc.ged</code><br>
+ <blockquote><code>./testgedcom -dg t/input/ulhc.ged</code><br>
</blockquote>
The <code>-dg</code> option instructs the parser to show its own debug
messages (see <code>./testgedcom -h</code> for the full set of options).
<br>
For the ANSEL test file (<code>t/ansel.ged</code>), you have to set
the environment variable <code>GCONV_PATH</code> to the <code>ansel</code>
- subdirectory of the gedcom directory:<br>
+ subdirectory of the top directory:<br>
<blockquote><code>export GCONV_PATH=./ansel<br>
- ./testgedcom -dg t/ansel.ged<br>
+ ./testgedcom -dg t/input/ansel.ged<br>
</code></blockquote>
This is because for the ANSEL character set an extra module is needed
for the iconv library (more on this later). But again, this should
show a lot of special characters.<br>
<br>
- <h2>Testing the lexers separately</h2>
+
+ <h3><a name="Testing_the_lexers_separately"></a>Testing the lexers separately</h3>
+
The lexers themselves can be tested separately. For the 1-byte
lexer (i.e. supporting the encodings with 1 byte per characters, such as
- ASCII, ANSI and ANSEL), the sequence of commands would be:<br>
+ ASCII, ANSI and ANSEL), the command would be (in the <code>gedcom</code> subdirectory):<br>
- <blockquote><code>make clean<br>
- make test_1byte<br>
+ <blockquote><code>make lexer_1byte<br>
</code></blockquote>
- This will show all tokens in the <code>t/allged.ged</code> test file.
- Similar tests can be done using <code>make test_hilo</code> and <code>
-make test_lohi</code> (for the unicode lexers).<br>
+ This will generate a lexer program that can process e.g. the <code>t/input/allged.ged</code>
+ test file. Simply cat the file through the lexer on standard input
+and you should get all the tokens in the file. Similar tests can be
+done using <code>make lexer_hilo</code> and <code>
+make lexer_lohi</code>
+ (for the unicode lexers). In each of the cases you need to know
+yourself which of the test files are appropriate to pass through the lexer.<br>
<br>
This concludes the testing setup. Now for some explanations...<br>
- <br>
+ <hr width="100%" size="2"><br>
- <h2>Structure of the parser</h2>
+ <h2><a name="Structure_of_the_parser"></a>Structure of the parser</h2>
I see the structure of a program using the gedcom parser as follows:<br>
<br>
<img src="images/schema.png" alt="Gedcom parsing scheme">
<br>
<br>
<br>
- TO BE COMPLETED...<br>
+ The parser is based on <code>lex/yacc</code>, which means that a module generated by <code>lex</code>
+ takes the inputfile and determines the tokens in that file (i.e. the smallest
+units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM
+values...). These tokens are passed to the parser module, which is
+generated by yacc, to parse the syntax of the file, i.e. whether the tokens
+appear in a sequence that is valid. <br>
+ <br>
+For each recognized statement in the GEDCOM file, the parser calls some callbacks,
+which can be registered by the application to get the information out of
+the file.<br>
+ <br>
+This basic description ignores the problem of character encoding.<br>
+ <br>
+ <h3><a name="Character_encoding"></a>Character encoding</h3>Refer to <a href="encoding.html">this page</a> for some introduction on character encoding...<br>
+
+ <br>
+GEDCOM defines three standard encodings:<br>
+ <ul>
+ <li>ASCII</li>
+ <li>ANSEL</li>
+ <li>UNICODE (assumed to be UCS-2, either big-endian or little-endian: the GEDCOM spec doesn't specify this)</li>
+ </ul>These are all supported by the parser, and converted into UTF-8 format.<br>
+
+
+
+
<hr width="100%" size="2">
- <pre>$Id$<br>$Name$<br></pre>
+ <pre><font size="-1">$Id$<br>$Name$</font><br></pre>
<br>
</div>
</div>
- </body>
- </html>
+ </body></html>
\ No newline at end of file