+
+
+ <h3><a name="Testing_the_lexers_separately"></a>Testing the lexers separately</h3>
+
+ The lexers themselves can be tested separately. For the 1-byte
+ lexer (i.e. supporting the encodings with 1 byte per characters, such as
+ ASCII, ANSI and ANSEL), the command would be (in the <code>gedcom</code> subdirectory):<br>
+
+ <blockquote><code>make lexer_1byte<br>
+ </code></blockquote>
+ This will generate a lexer program that can process e.g. the <code>t/input/allged.ged</code>
+ test file. Simply cat the file through the lexer on standard input
+and you should get all the tokens in the file. Similar tests can be
+done using <code>make lexer_hilo</code> and <code>
+make lexer_lohi</code>
+ (for the unicode lexers). In each of the cases you need to know
+yourself which of the test files are appropriate to pass through the lexer.<br>
+ <br>
+ This concludes the testing setup. Now for some explanations...<br>
+ <hr width="100%" size="2"><br>
+
+
+ <h2><a name="Structure_of_the_parser"></a>Structure of the parser</h2>
+ I see the structure of a program using the gedcom parser as follows:<br>
+ <br>
+ <img src="images/schema.png" alt="Gedcom parsing scheme">
+ <br>
+ <br>
+ <br>
+ The parser is based on <code>lex/yacc</code>, which means that a module generated by <code>lex</code>
+ takes the inputfile and determines the tokens in that file (i.e. the smallest
+units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM
+values...). These tokens are passed to the parser module, which is
+generated by yacc, to parse the syntax of the file, i.e. whether the tokens
+appear in a sequence that is valid. <br>
+ <br>
+For each recognized statement in the GEDCOM file, the parser calls some callbacks,
+which can be registered by the application to get the information out of
+the file.<br>
+ <br>
+This basic description ignores the problem of character encoding.<br>
+ <br>
+ <h3><a name="Character_encoding"></a>Character encoding</h3>Refer to <a href="encoding.html">this page</a> for some introduction on character encoding...<br>
+
+ <br>
+GEDCOM defines three standard encodings:<br>
+ <ul>
+ <li>ASCII</li>
+ <li>ANSEL</li>
+ <li>UNICODE (assumed to be UCS-2, either big-endian or little-endian: the GEDCOM spec doesn't specify this)</li>
+ </ul>These are all supported by the parser, and converted into UTF-8 format.<br>
+
+
+
+
+
+ <hr width="100%" size="2">
+ <pre><font size="-1">$Id$<br>$Name$</font><br></pre>
+ <br>
+ </div>
+ </div>
+
+
+ </body></html>
\ No newline at end of file