doc/parser.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
   2
   3   <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><title>The Gedcom parser library</title></head>
   4
   5 <body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099">
   6
   7 <div align="center">
   8 <h1>The Gedcom parser library</h1>
   9
  10 <div align="left">The intention of this page is to provide some explanation
  11     of the gedcom parser, to aid development on and with it. &nbsp;First,
  12 some practical  issues  of testing with the parser will be explained.<br>
  13 <br>
  14 <h2>Index</h2>
  15 <ul>
  16   <li><a href="#Testing">Testing</a></li>
  17   <ul>
  18     <li><a href="#Basic_testing">Basic testing</a></li>
  19     <li><a href="#Preparing_for_further_testing">Preparing for further testing</a></li>
  20     <li><a href="#Testing_the_parser_with_debugging">Testing the parser with debugging</a></li>
  21     <li><a href="#Testing_the_lexers_separately">Testing the lexers separately</a><br>
  22     </li>
  23   </ul>
  24   <li><a href="#Structure_of_the_parser">Structure of the parser</a></li>
  25   <ul>
  26     <li><a href="#Character_encoding">Character encoding</a><br>
  27     </li>
  28   </ul>
  29 </ul>
  30 <br>
  31 <hr width="100%" size="2">
  32 <h2><a name="Testing"></a>Testing<br>
  33 </h2>
  34
  35
  36 <h3><a name="Basic_testing"></a>Basic testing<br>
  37
  38        </h3>
  39
  40   You should be able to perform a basic test using the commands:<br>
  41
  42 <blockquote><code>./configure<br>
  43        make<br>
  44         make check</code><br>
  45          </blockquote>
  46         If everything goes OK, you'll see that some gedcom files are parsed,
  47   and   that each parse is successful. &nbsp;Note that some of the used gedcom files
  48   are  made by <a href="http://heiner-eichmann.de/gedcom/gedcom.htm">Heiner
  49   Eichmann</a>       and are an excellent way to test gedcom parsers thoroughly.<br>
  50          <br>
  51
  52
  53   <h3><a name="Preparing_for_further_testing"></a>Preparing for further testing</h3>
  54 Some
  55 more detailed    tests are possible, via the <code>testgedcom</code> program
  56 that is generated    by <code>make</code>. &nbsp;<br>
  57          <br>
  58         However, since the output that <code>testgedcom</code> generates
  59 is   in  UTF-8 format (more on this later), some preparation is necessary
  60 to have  a full view on it. &nbsp;Basically, you need a terminal that understands
  61   and can display UTF-8 encoded characters, and you need to proper fonts
  62 installed     to display them. &nbsp;I'll give some advice on this here,
  63 based on the   Red  Hat 7.1 distribution that I use, with glibc 2.2 and XFree86
  64 4.0.x. &nbsp;Any    other distribution that has the same or newer versions
  65 for these components    should give the same results.<br>
  66          <br>
  67         For the first issue, the UTF-8 capable terminal, the safest bet is
  68  to  use    <code>xterm</code> in its unicode mode (which is supported by
  69 the   <code>   xterm</code> coming with XFree86 4.0.x). &nbsp;UTF-8 capabilities
  70   have only  recently been added to <code>gnome-terminal</code>, so probably
  71   that is not  in your distribution yet (it certainly isn't in Red Hat 7.1).<br>
  72          <br>
  73         For the second issue, you'll need the ISO 10646-1 fonts. &nbsp;These
  74   come   also with XFree86 4.0.x.<br>
  75          <br>
  76         The way to start <code>xterm</code> in unicode mode is then e.g.
  77 (put   everything  on 1 line !):<br>
  78
  79   <blockquote><code>LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm
  80     -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'</code><br>
  81            </blockquote>
  82              This first sets the <code>LANG</code> variable to a locale that
  83   uses  UTF-8, and then starts <code>xterm</code> with a proper Unicode font.
  84   &nbsp;Some  sample UTF-8 plain text files can be found <a href="http://www.cl.cam.ac.uk/%7Emgk25/ucs/examples">
  85        here</a>     . &nbsp;Just <code>cat</code> them on the command line
  86  and see the result.<br>
  87            <br>
  88
  89
  90     <h3><a name="Testing_the_parser_with_debugging"></a>Testing the parser with debugging</h3>
  91
  92        Given the UTF-8 capable terminal, you can now let the <code>testgedcom</code>
  93         program print the values that it parses. &nbsp;An example of a command
  94    line is (in the <code>gedcom</code> directory):<br>
  95
  96     <blockquote><code>./testgedcom -dg t/ulhc.ged</code><br>
  97              </blockquote>
  98        The <code>-dg</code> option instructs the parser to show its own debug
  99   messages  &nbsp;(see <code>./testgedcom -h</code> for the full set of options).
 100   &nbsp;If  everything is OK, you'll see the values from the gedcom file,
 101 containing  a lot of special characters.<br>
 102              <br>
 103        For the ANSEL test file (<code>t/ansel.ged</code>), you have to set
 104  the   environment variable <code>GCONV_PATH</code> to the <code>ansel</code>
 105    subdirectory   of the gedcom directory:<br>
 106
 107       <blockquote><code>export GCONV_PATH=./ansel<br>
 108        ./testgedcom -dg t/ansel.ged<br>
 109                </code></blockquote>
 110        This is because for the ANSEL character set an extra module is needed
 111   for  the iconv library (more on this later). &nbsp;But again, this should
 112   show  a lot of special characters.<br>
 113                <br>
 114
 115
 116         <h3><a name="Testing_the_lexers_separately"></a>Testing the lexers separately</h3>
 117
 118        The lexers themselves can be tested separately. &nbsp;For the 1-byte
 119  lexer   (i.e. supporting the encodings with 1 byte per characters, such as
 120  ASCII,   ANSI and ANSEL), the command would be (in the <code>gedcom</code> subdirectory):<br>
 121
 122         <blockquote><code>make lexer_1byte<br>
 123              </code></blockquote>
 124    This will generate a lexer program that can process e.g. the <code>t/allged.ged</code>
 125  test file. &nbsp;Simply cat the file through the lexer on standard input
 126 and you should get all the tokens in the file. &nbsp;Similar  tests can be
 127 done using <code>make lexer_hilo</code> and <code>
 128 make lexer_lohi</code>    (for the unicode lexers). &nbsp;In each of the cases you need to know yourself which of the test files are appropriate to pass through the lexer.<br>
 129                  <br>
 130        This concludes the testing setup. &nbsp;Now for some explanations...<br>
 131           <hr width="100%" size="2"><br>
 132
 133
 134           <h2><a name="Structure_of_the_parser"></a>Structure of the parser</h2>
 135        I see the structure of a program using the gedcom parser as follows:<br>
 136                  <br>
 137                  <img src="images/schema.png" alt="Gedcom parsing scheme">
 138                  <br>
 139                  <br>
 140                  <br>
 141        The parser is based on <code>lex/yacc</code>, which means that a module generated by <code>lex</code>
 142  takes the inputfile and determines the tokens in that file (i.e. the smallest
 143 units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM
 144 values...). &nbsp;These tokens are passed to the parser module, which is
 145 generated by yacc, to parse the syntax of the file, i.e. whether the tokens
 146 appear in a sequence that is valid. &nbsp;<br>
 147           <br>
 148 For each recognized statement in the GEDCOM file, the parser calls some callbacks,
 149 which can be registered by the application to get the information out of
 150 the file.<br>
 151           <br>
 152 This basic description ignores the problem of character encoding. &nbsp;The next section describes what this problem exactly is.<br>
 153           <br>
 154           <h3><a name="Character_encoding"></a>Character encoding</h3>
 155           <h4><i>The character encoding problem</i><br>
 156           </h4>
 157 Developers are usually familiar with the ASCII character set. &nbsp;This
 158 is a character set that assigns a unique number to some characters, e.g.
 159 an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or
 160 0x61 in hex). &nbsp;Some people may also have used ASCII codes for several
 161 drawing characters (such as a horizontal bar, a vertical bar, or a top-right
 162 corner) in the old DOS days, to be able to draw nice windows in text mode.<br>
 163           <br>
 164 However, these last characters are strictly spoken not part of the ASCII
 165 set. &nbsp;The standard ASCII set contains only the character positions from
 166 0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). &nbsp;An
 167 example of this table can be found <a href="http://web.cs.mun.ca/%7Emichael/ascii-table.html">here</a>. &nbsp;Anything that has an ASCII code between 128 and 255 is in principle undefined.<br>
 168           <br>
 169 Now, several systems (including the old DOS) have defined those character
 170 positions anyway, but usually in totally different ways. &nbsp;Some well
 171 known extensions are:<br>
 172           <ul>
 173             <li>the <a href="http://czyborra.com/charsets/cp437.gif">DOS</a>
 174  character set, nowadays usually known as Code Page 437, but sometimes also
 175 named LatinUS, ECS (Extended Character Set) or PC-8; note that the table
 176 displayed in the link also contains the standard ASCII part</li>
 177             <li>the <a href="http://czyborra.com/charsets/cp1252.gif">ANSI</a> character set, also known as Code Page 1252, and usually the default on Windows</li>
 178             <li>the <a href="http://czyborra.com/charsets/iso8859-1.gif">ISO-8859-1</a> character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices</li>
 179             <li>the <a href="http://czyborra.com/charsets/adobe-stdenc.gif">Adobe Standard Encoding</a>, which is by default used in Postscript, unless overridden</li>
 180           </ul>
 181 And these are only examples of character sets used in West-European languages.
 182 &nbsp;For Japanese, Chinese, Korean, Vietnamese, ... there are separate character
 183 sets in which one byte's meaning can even be influenced by what the previous
 184 byte was, i.e. these are multi-byte character sets. &nbsp;This is because
 185 even 256 characters is totally inadequate to represent all characters in
 186 such languages.<br>
 187           <br>
 188 So, summarizing, if a text file contains a byte that has a value 65, it is
 189 pretty safe to assume that this byte represents an "A", if we ignore the
 190 multi-byte character sets spoken of before. &nbsp;However, a value 233 cannot
 191 be interpreted without knowing in which character set the text file is written.
 192 &nbsp;In Latin-1, it happens to be the character "&eacute;", but in another
 193 character set it can be something totally different (e.g. in the DOS character
 194 set it is the Greek letter theta).<br>
 195           <br>
 196 Vice versa, if you need to write a character "&eacute;" to a file, it depends
 197 on the character set you will use what the numerical value will be in the
 198 file: in Latin-1 it will be 233, but if you use the DOS character set it
 199 will be 130.<br>
 200           <br>
 201           <h4><i>Unicode</i></h4>
 202 Enter the Unicode standard...<br>
 203           <br>
 204 TO BE COMPLETED<br>
 205
 206
 207           <hr width="100%" size="2">
 208           <pre><font size="-1">$Id$<br>$Name$</font><br></pre>
 209            <br>
 210                  </div>
 211                  </div>
 212
 213
 214           </body></html>