doc/parser.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
   2
   3   <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><title>The Gedcom parser library internals</title></head><body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099">
   4
   5 <div align="center">
   6 <h1>The Gedcom parser library internals</h1>
   7
   8 <div align="left">The intention of this page is to provide some explanation
   9     of the gedcom parser, to aid development on and with it. &nbsp;First,
  10 some practical  issues  of testing with the parser will be explained.<br>
  11 <br>
  12 <h2>Index</h2>
  13 <ul>
  14   <li><a href="#Testing">Testing</a></li>
  15   <ul>
  16     <li><a href="#Basic_testing">Basic testing</a></li>
  17     <li><a href="#Preparing_for_further_testing">Preparing for further testing</a></li>
  18     <li><a href="#Testing_the_parser_with_debugging">Testing the parser with debugging</a></li>
  19     <li><a href="#Testing_the_lexers_separately">Testing the lexers separately</a><br>
  20     </li>
  21   </ul>
  22   <li><a href="#Structure_of_the_parser">Structure of the parser</a></li>
  23   <ul>
  24     <li><a href="#Character_encoding">Character encoding</a><br>
  25     </li>
  26   </ul>
  27 </ul>
  28 <br>
  29 <hr width="100%" size="2">
  30 <h2><a name="Testing"></a>Testing<br>
  31 </h2>
  32
  33
  34 <h3><a name="Basic_testing"></a>Basic testing<br>
  35
  36        </h3>
  37
  38   You should be able to perform a basic test using the commands:<br>
  39
  40 <blockquote><code>./configure<br>
  41        make<br>
  42         make check</code><br>
  43          </blockquote>
  44         If everything goes OK, you'll see that some gedcom files are parsed,
  45   and   that each parse is successful. &nbsp;Note that some of the used gedcom files
  46   are  made by <a href="http://heiner-eichmann.de/gedcom/gedcom.htm">Heiner
  47   Eichmann</a>       and are an excellent way to test gedcom parsers thoroughly.<br>
  48          <br>
  49
  50
  51   <h3><a name="Preparing_for_further_testing"></a>Preparing for further testing</h3>
  52 Some
  53 more detailed    tests are possible, via the <code>testgedcom</code> program
  54 that is generated    by <code>make</code>. &nbsp;<br>
  55          <br>
  56         However, since the output that <code>testgedcom</code> generates
  57 is   in  UTF-8 format (more on this later), some preparation is necessary
  58 to have  a full view on it. &nbsp;Basically, you need a terminal that understands
  59   and can display UTF-8 encoded characters, and you need to proper fonts
  60 installed     to display them. &nbsp;I'll give some advice on this here,
  61 based on the   Red  Hat 7.1 distribution that I use, with glibc 2.2 and XFree86
  62 4.0.x. &nbsp;Any    other distribution that has the same or newer versions
  63 for these components    should give the same results.<br>
  64          <br>
  65         For the first issue, the UTF-8 capable terminal, the safest bet is
  66  to  use    <code>xterm</code> in its unicode mode (which is supported by
  67 the   <code>   xterm</code> coming with XFree86 4.0.x). &nbsp;UTF-8 capabilities
  68   have only  recently been added to <code>gnome-terminal</code>, so probably
  69   that is not  in your distribution yet (it certainly isn't in Red Hat 7.1).<br>
  70          <br>
  71         For the second issue, you'll need the ISO 10646-1 fonts. &nbsp;These
  72   come   also with XFree86 4.0.x.<br>
  73          <br>
  74         The way to start <code>xterm</code> in unicode mode is then e.g.
  75 (put   everything  on 1 line !):<br>
  76
  77   <blockquote><code>LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm
  78     -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'</code><br>
  79            </blockquote>
  80              This first sets the <code>LANG</code> variable to a locale that
  81   uses  UTF-8, and then starts <code>xterm</code> with a proper Unicode font.
  82   &nbsp;Some  sample UTF-8 plain text files can be found <a href="http://www.cl.cam.ac.uk/%7Emgk25/ucs/examples">
  83        here</a>     . &nbsp;Just <code>cat</code> them on the command line
  84  and see the result.<br>
  85            <br>
  86
  87
  88     <h3><a name="Testing_the_parser_with_debugging"></a>Testing the parser with debugging</h3>
  89
  90        Given the UTF-8 capable terminal, you can now let the <code>testgedcom</code>
  91         program print the values that it parses. &nbsp;An example of a command
  92    line is (in the top <code></code>directory):<br>
  93
  94     <blockquote><code>./testgedcom -dg t/input/ulhc.ged</code><br>
  95              </blockquote>
  96        The <code>-dg</code> option instructs the parser to show its own debug
  97   messages  &nbsp;(see <code>./testgedcom -h</code> for the full set of options).
  98   &nbsp;If  everything is OK, you'll see the values from the gedcom file,
  99 containing  a lot of special characters.<br>
 100              <br>
 101        For the ANSEL test file (<code>t/ansel.ged</code>), you have to set
 102  the   environment variable <code>GCONV_PATH</code> to the <code>ansel</code>
 103    subdirectory   of the top directory:<br>
 104
 105       <blockquote><code>export GCONV_PATH=./ansel<br>
 106        ./testgedcom -dg t/input/ansel.ged<br>
 107                </code></blockquote>
 108        This is because for the ANSEL character set an extra module is needed
 109   for  the iconv library (more on this later). &nbsp;But again, this should
 110   show  a lot of special characters.<br>
 111                <br>
 112
 113
 114         <h3><a name="Testing_the_lexers_separately"></a>Testing the lexers separately</h3>
 115
 116        The lexers themselves can be tested separately. &nbsp;For the 1-byte
 117  lexer   (i.e. supporting the encodings with 1 byte per characters, such as
 118  ASCII,   ANSI and ANSEL), the command would be (in the <code>gedcom</code> subdirectory):<br>
 119
 120         <blockquote><code>make lexer_1byte<br>
 121              </code></blockquote>
 122    This will generate a lexer program that can process e.g. the <code>t/input/allged.ged</code>
 123  test file. &nbsp;Simply cat the file through the lexer on standard input
 124 and you should get all the tokens in the file. &nbsp;Similar  tests can be
 125 done using <code>make lexer_hilo</code> and <code>
 126 make lexer_lohi</code>
 127     (for the unicode lexers). &nbsp;In each of the cases you need to know
 128 yourself which of the test files are appropriate to pass through the lexer.<br>
 129                  <br>
 130        This concludes the testing setup. &nbsp;Now for some explanations...<br>
 131           <hr width="100%" size="2"><br>
 132
 133
 134           <h2><a name="Structure_of_the_parser"></a>Structure of the parser</h2>
 135        I see the structure of a program using the gedcom parser as follows:<br>
 136                  <br>
 137                  <img src="images/schema.png" alt="Gedcom parsing scheme">
 138                  <br>
 139                  <br>
 140                  <br>
 141        The parser is based on <code>lex/yacc</code>, which means that a module generated by <code>lex</code>
 142  takes the inputfile and determines the tokens in that file (i.e. the smallest
 143 units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM
 144 values...). &nbsp;These tokens are passed to the parser module, which is
 145 generated by yacc, to parse the syntax of the file, i.e. whether the tokens
 146 appear in a sequence that is valid. &nbsp;<br>
 147           <br>
 148 For each recognized statement in the GEDCOM file, the parser calls some callbacks,
 149 which can be registered by the application to get the information out of
 150 the file.<br>
 151           <br>
 152 This basic description ignores the problem of character encoding.<br>
 153           <br>
 154           <h3><a name="Character_encoding"></a>Character encoding</h3>Refer to <a href="encoding.html">this page</a> for some introduction on character encoding...<br>
 155
 156           <br>
 157 GEDCOM defines three standard encodings:<br>
 158           <ul>
 159             <li>ASCII</li>
 160             <li>ANSEL</li>
 161             <li>UNICODE (assumed to be UCS-2, either big-endian or little-endian: the GEDCOM spec doesn't specify this)</li>
 162           </ul>These are all supported by the parser, and converted into UTF-8 format.<br>
 163
 164
 165
 166
 167
 168           <hr width="100%" size="2">
 169           <pre><font size="-1">$Id$<br>$Name$</font><br></pre>
 170            <br>
 171                  </div>
 172                  </div>
 173
 174
 175           </body></html>