doc/parser.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
   2 <html>
   3 <head>
   4
   5   <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
   6   <title>Gedcom parser in Genes</title>
   7 </head>
   8   <body>
   9
  10 <div align="Center">
  11 <h1>Gedcom parser in Genes</h1>
  12
  13 <div align="Left">The intention of this page is to provide some explanation
  14   of the gedcom parser, to aid development on and with it. &nbsp;Currently,
  15   the parser is in a state that it works, but some parts are still missing,
  16   notably the interface towards applications. &nbsp;First, some practical
  17 issues  of testing with the parser will be explained.<br>
  18    <br>
  19
  20 <h2>Basic testing<br>
  21    </h2>
  22     The parser is located in the "gedcom" subdirectory of the Genes source
  23  code.  &nbsp;You should be able to perform a basic test using the commands:<br>
  24
  25 <blockquote><code>make clean<br>
  26    make<br>
  27     make test</code><br>
  28      </blockquote>
  29     If everything goes OK, you'll see that some gedcom files are parsed,
  30 and   that each parse is successful. &nbsp;Note that the used gedcom files
  31 are  made by <a href="http://heiner-eichmann.de/gedcom/gedcom.htm">Heiner
  32 Eichmann</a>
  33      and are an excellent way to test gedcom parsers thoroughly.<br>
  34      <br>
  35
  36   <h2>Preparing for further testing</h2>
  37     The basic testing described above doesn't show anything else than "Parse
  38   succeeded", which is nice, but not very interesting. &nbsp;Some more detailed
  39   tests are possible, via the <code>gedcom-parse</code> program that is generated
  40   by <code>make test</code>. &nbsp;<br>
  41      <br>
  42     However, since the output that <code>gedcom-parse</code> generates is
  43 in  UTF-8 format (more on this later), some preparation is necessary to have
  44  a full view on it. &nbsp;Basically, you need a terminal that understands
  45 and can display UTF-8 encoded characters, and you need to proper fonts installed
  46   to display them. &nbsp;I'll give some advice on this here, based on the
  47 Red  Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 4.0.x. &nbsp;Any
  48   other distribution that has the same or newer versions for these components
  49   should give the same results.<br>
  50      <br>
  51     For the first issue, the UTF-8 capable terminal, the safest bet is to
  52 use    <code>xterm</code> in its unicode mode (which is supported by the
  53  <code>   xterm</code> coming with XFree86 4.0.x). &nbsp;UTF-8 capabilities
  54 have only  recently been added to <code>gnome-terminal</code>, so probably
  55 that is not  in your distribution yet (it certainly isn't in Red Hat 7.1).<br>
  56      <br>
  57     For the second issue, you'll need the ISO 10646-1 fonts. &nbsp;These
  58 come   also with XFree86 4.0.x.<br>
  59      <br>
  60     The way to start <code>xterm</code> in unicode mode is then e.g. (put
  61 everything  on 1 line !):<br>
  62
  63   <blockquote><code>LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm
  64   -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'</code><br>
  65        </blockquote>
  66          This first sets the <code>LANG</code> variable to a locale that
  67 uses  UTF-8, and then starts <code>xterm</code> with a proper Unicode font.
  68 &nbsp;Some  sample UTF-8 plain text files can be found <a href="http://www.cl.cam.ac.uk/%7Emgk25/ucs/examples">
  69    here</a>
  70    . &nbsp;Just <code>cat</code> them on the command line and see the result.<br>
  71        <br>
  72
  73     <h2>Testing the parser with debugging</h2>
  74    Given the UTF-8 capable terminal, you can now let the <code>gedcom-parse</code>
  75     program print the values that it parses. &nbsp;An example of a command
  76  line is (in the <code>gedcom</code> directory):<br>
  77
  78     <blockquote><code>./gedcom_parse -dg t/ulhc.ged</code><br>
  79          </blockquote>
  80    The <code>-dg</code> option instructs the parser to show its own debug
  81 messages  &nbsp;(see <code>./gedcom_parse -h</code> for the full set of options).
  82 &nbsp;If  everything is OK, you'll see the values from the gedcom file, containing
  83 a lot of special characters.<br>
  84          <br>
  85    For the ANSEL test file (<code>t/ansel.ged</code>), you have to set the
  86  environment variable <code>GCONV_PATH</code> to the <code>ansel</code> subdirectory
  87  of the gedcom directory:<br>
  88
  89       <blockquote><code>export GCONV_PATH=./ansel<br>
  90    ./gedcom_parse -dg t/ansel.ged<br>
  91            </code></blockquote>
  92    This is because for the ANSEL character set an extra module is needed
  93 for  the iconv library (more on this later). &nbsp;But again, this should
  94 show  a lot of special characters.<br>
  95            <br>
  96
  97         <h2>Testing the lexers separately</h2>
  98    The lexers themselves can be tested separately. &nbsp;For the 1-byte lexer
  99  (i.e. supporting the encodings with 1 byte per characters, such as ASCII,
 100  ANSI and ANSEL), the sequence of commands would be:<br>
 101
 102         <blockquote><code>make clean<br>
 103    make test_1byte<br>
 104    cat t/allged.ged | ./test_1byte</code><br>
 105              </blockquote>
 106    This will show all tokens in the <code>t/allged.ged</code> test file.
 107 &nbsp;With  the lexers you have to make sure that you use the proper lexer
 108 for each test  file. &nbsp;The <code>test_1byte</code> test program is OK
 109 for <code> allged.ged</code>   and <code>ansel.ged</code> (the last one again
 110 with the environment variable  set); for the <code>uhl*.ged</code> files
 111 you need the <code>test_hilo</code>   test program; for the <code>ulh*.ged</code>
 112  files you need the <code>test_lohi</code>   program.<br>
 113              <br>
 114    This concludes the testing setup. &nbsp;Now for some explanations...<br>
 115              <br>
 116
 117           <h2>Structure of the parser</h2>
 118    I see the structure of a program using the gedcom parser as follows:<br>
 119              <br>
 120              <img src="images/schema.png" alt="Gedcom parsing scheme">
 121              <br>
 122              <br>
 123              <br>
 124    TO BE COMPLETED...<br>
 125           <hr width="100%" size="2">$Id$<br>
 126 $Name$<br>
 127            <br>
 128              </div>
 129              </div>
 130
 131           </body>
 132           </html>