Use of libcharset (for portability).

[gedcom-parse.git] / doc / parser.html
diff --git a/doc/parser.html b/doc/parser.html

index 4ff35cde9321cb3ee8bea78d852a1efa4c9742f2..a344644c334e745a4a1067c045f74226240dc373 100644 (file)
--- a/doc/parser.html
+++ b/doc/parser.html
@@ -1,124 +1,175 @@
-<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
-<html>
-<head>
-                           
-  <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
-  <title>The Gedcom parser library</title>
-</head>
-  <body>
-           
-<div align="center">      
-<h1>The Gedcom parser library</h1>
-           
-<div align="left">The intention of this page is to provide some explanation 
-   of the gedcom parser, to aid development on and with it. &nbsp;First, some
-practical  issues  of testing with the parser will be explained.<br>
-      <br>
-           
-<h2>Basic testing<br>
-      </h2>
- You should be able to perform a basic test using the commands:<br>
-           
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
+                               
+  <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><title>The Gedcom parser library internals</title></head><body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099">
+             
+<div align="center">       
+<h1>The Gedcom parser library internals</h1>
+             
+<div align="left">The intention of this page is to provide some explanation
+    of the gedcom parser, to aid development on and with it. &nbsp;First,
+some practical  issues  of testing with the parser will be explained.<br>
+<br>
+<h2>Index</h2>
+<ul>
+  <li><a href="#Testing">Testing</a></li>
+  <ul>
+    <li><a href="#Basic_testing">Basic testing</a></li>
+    <li><a href="#Preparing_for_further_testing">Preparing for further testing</a></li>
+    <li><a href="#Testing_the_parser_with_debugging">Testing the parser with debugging</a></li>
+    <li><a href="#Testing_the_lexers_separately">Testing the lexers separately</a><br>
+    </li>
+  </ul>
+  <li><a href="#Structure_of_the_parser">Structure of the parser</a></li>
+  <ul>
+    <li><a href="#Character_encoding">Character encoding</a><br>
+    </li>
+  </ul>
+</ul>
+<br>
+<hr width="100%" size="2">
+<h2><a name="Testing"></a>Testing<br>
+</h2>
+
+             
+<h3><a name="Basic_testing"></a>Basic testing<br>
+
+       </h3>
+
+  You should be able to perform a basic test using the commands:<br>
+             
  <blockquote><code>./configure<br>
-      make<br>
-       make check</code><br>
-        </blockquote>
-       If everything goes OK, you'll see that some gedcom files are parsed, 
- and   that each parse is successful. &nbsp;Note that the used gedcom files 
- are  made by <a href="http://heiner-eichmann.de/gedcom/gedcom.htm">Heiner 
- Eichmann</a>       and are an excellent way to test gedcom parsers thoroughly.<br>
-        <br>
-                       
-  <h2>Preparing for further testing</h2>
-       The basic testing described above doesn't show anything else than
-"Parse    succeeded", which is nice, but not very interesting. &nbsp;Some
-more detailed    tests are possible, via the <code>testgedcom</code> program
-that is generated    by <code>make test</code>. &nbsp;<br>
-        <br>
-       However, since the output that <code>testgedcom</code> generates is
-  in  UTF-8 format (more on this later), some preparation is necessary to
-have  a full view on it. &nbsp;Basically, you need a terminal that understands 
- and can display UTF-8 encoded characters, and you need to proper fonts installed 
-   to display them. &nbsp;I'll give some advice on this here, based on the 
- Red  Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 4.0.x. &nbsp;Any
-   other distribution that has the same or newer versions for these components
-   should give the same results.<br>
-        <br>
-       For the first issue, the UTF-8 capable terminal, the safest bet is 
-to  use    <code>xterm</code> in its unicode mode (which is supported by the
-  <code>   xterm</code> coming with XFree86 4.0.x). &nbsp;UTF-8 capabilities
-  have only  recently been added to <code>gnome-terminal</code>, so probably
+       make<br>
+        make check</code><br>
+         </blockquote>
+        If everything goes OK, you'll see that some gedcom files are parsed,
+  and   that each parse is successful. &nbsp;Note that some of the used gedcom files
+  are  made by <a href="http://heiner-eichmann.de/gedcom/gedcom.htm">Heiner
+  Eichmann</a>       and are an excellent way to test gedcom parsers thoroughly.<br>
+         <br>
+                           
+  
+  <h3><a name="Preparing_for_further_testing"></a>Preparing for further testing</h3>
+Some 
+more detailed    tests are possible, via the <code>testgedcom</code> program 
+that is generated    by <code>make</code>. &nbsp;<br>
+         <br>
+        However, since the output that <code>testgedcom</code> generates
+is   in  UTF-8 format (more on this later), some preparation is necessary
+to have  a full view on it. &nbsp;Basically, you need a terminal that understands
+  and can display UTF-8 encoded characters, and you need to proper fonts
+installed     to display them. &nbsp;I'll give some advice on this here,
+based on the   Red  Hat 7.1 distribution that I use, with glibc 2.2 and XFree86
+4.0.x. &nbsp;Any    other distribution that has the same or newer versions
+for these components    should give the same results.<br>
+         <br>
+        For the first issue, the UTF-8 capable terminal, the safest bet is
+ to  use    <code>xterm</code> in its unicode mode (which is supported by
+the   <code>   xterm</code> coming with XFree86 4.0.x). &nbsp;UTF-8 capabilities 
+  have only  recently been added to <code>gnome-terminal</code>, so probably 
    that is not  in your distribution yet (it certainly isn't in Red Hat 7.1).<br>
-        <br>
-       For the second issue, you'll need the ISO 10646-1 fonts. &nbsp;These 
- come   also with XFree86 4.0.x.<br>
-        <br>
-       The way to start <code>xterm</code> in unicode mode is then e.g. (put
-  everything  on 1 line !):<br>
-                       
-  <blockquote><code>LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm 
-   -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'</code><br>
-          </blockquote>
-            This first sets the <code>LANG</code> variable to a locale that 
- uses  UTF-8, and then starts <code>xterm</code> with a proper Unicode font. 
- &nbsp;Some  sample UTF-8 plain text files can be found <a href="http://www.cl.cam.ac.uk/%7Emgk25/ucs/examples">
-      here</a>     . &nbsp;Just <code>cat</code> them on the command line 
-and see the result.<br>
-          <br>
-                                   
-    <h2>Testing the parser with debugging</h2>
-      Given the UTF-8 capable terminal, you can now let the <code>testgedcom</code>
-       program print the values that it parses. &nbsp;An example of a command 
-  line is (in the <code>gedcom</code> directory):<br>
-                                   
-    <blockquote><code>./testgedcom -dg t/ulhc.ged</code><br>
-            </blockquote>
-      The <code>-dg</code> option instructs the parser to show its own debug
-  messages  &nbsp;(see <code>./testgedcom -h</code> for the full set of options). 
- &nbsp;If  everything is OK, you'll see the values from the gedcom file, containing
- a lot of special characters.<br>
-            <br>
-      For the ANSEL test file (<code>t/ansel.ged</code>), you have to set 
-the   environment variable <code>GCONV_PATH</code> to the <code>ansel</code>
-  subdirectory   of the gedcom directory:<br>
-                                               
+         <br>
+        For the second issue, you'll need the ISO 10646-1 fonts. &nbsp;These
+  come   also with XFree86 4.0.x.<br>
+         <br>
+        The way to start <code>xterm</code> in unicode mode is then e.g.
+(put   everything  on 1 line !):<br>
+                           
+  <blockquote><code>LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm
+    -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'</code><br>
+           </blockquote>
+             This first sets the <code>LANG</code> variable to a locale that
+  uses  UTF-8, and then starts <code>xterm</code> with a proper Unicode font.
+  &nbsp;Some  sample UTF-8 plain text files can be found <a href="http://www.cl.cam.ac.uk/%7Emgk25/ucs/examples">
+       here</a>     . &nbsp;Just <code>cat</code> them on the command line
+ and see the result.<br>
+           <br>
+                                         
+    
+    <h3><a name="Testing_the_parser_with_debugging"></a>Testing the parser with debugging</h3>
+
+       Given the UTF-8 capable terminal, you can now let the <code>testgedcom</code>
+        program print the values that it parses. &nbsp;An example of a command
+   line is (in the top <code></code>directory):<br>
+                                         
+    <blockquote><code>./testgedcom -dg t/input/ulhc.ged</code><br>
+             </blockquote>
+       The <code>-dg</code> option instructs the parser to show its own debug 
+  messages  &nbsp;(see <code>./testgedcom -h</code> for the full set of options).
+  &nbsp;If  everything is OK, you'll see the values from the gedcom file,
+containing  a lot of special characters.<br>
+             <br>
+       For the ANSEL test file (<code>t/ansel.ged</code>), you have to set
+ the   environment variable <code>GCONV_PATH</code> to the <code>ansel</code>
+   subdirectory   of the top directory:<br>
+                                                       
        <blockquote><code>export GCONV_PATH=./ansel<br>
-      ./testgedcom -dg t/ansel.ged<br>
-              </code></blockquote>
-      This is because for the ANSEL character set an extra module is needed 
- for  the iconv library (more on this later). &nbsp;But again, this should 
- show  a lot of special characters.<br>
-              <br>
-                                                           
-        <h2>Testing the lexers separately</h2>
-      The lexers themselves can be tested separately. &nbsp;For the 1-byte
- lexer   (i.e. supporting the encodings with 1 byte per characters, such
-as  ASCII,   ANSI and ANSEL), the sequence of commands would be:<br>
-                                                           
-        <blockquote><code>make clean<br>
-      make test_1byte<br>
-            </code></blockquote>
-  This will show all tokens in the <code>t/allged.ged</code> test file. &nbsp;Similar
- tests can be done using <code>make test_hilo</code> and <code>make test_lohi</code>
-   (for the unicode lexers).<br>
-                <br>
-      This concludes the testing setup. &nbsp;Now for some explanations...<br>
-                <br>
-                                                                       
-          <h2>Structure of the parser</h2>
-      I see the structure of a program using the gedcom parser as follows:<br>
-                <br>
-                <img src="images/schema.png" alt="Gedcom parsing scheme">
-                <br>
-                <br>
-                <br>
-      TO BE COMPLETED...<br>
-                                   
-          <hr width="100%" size="2">
-          <pre>$Id$<br>$Name$<br></pre>
+       ./testgedcom -dg t/input/ansel.ged<br>
+               </code></blockquote>
+       This is because for the ANSEL character set an extra module is needed
+  for  the iconv library (more on this later). &nbsp;But again, this should
+  show  a lot of special characters.<br>
+               <br>
+                                                                     
+        
+        <h3><a name="Testing_the_lexers_separately"></a>Testing the lexers separately</h3>
+
+       The lexers themselves can be tested separately. &nbsp;For the 1-byte 
+ lexer   (i.e. supporting the encodings with 1 byte per characters, such as
+ ASCII,   ANSI and ANSEL), the command would be (in the <code>gedcom</code> subdirectory):<br>
+                                                                     
+        <blockquote><code>make lexer_1byte<br>
+             </code></blockquote>
+   This will generate a lexer program that can process e.g. the <code>t/input/allged.ged</code>
+ test file. &nbsp;Simply cat the file through the lexer on standard input
+and you should get all the tokens in the file. &nbsp;Similar  tests can be
+done using <code>make lexer_hilo</code> and <code>
+make lexer_lohi</code>
+    (for the unicode lexers). &nbsp;In each of the cases you need to know
+yourself which of the test files are appropriate to pass through the lexer.<br>
+                 <br>
+       This concludes the testing setup. &nbsp;Now for some explanations...<br>
+          <hr width="100%" size="2"><br>
+                                                                         
+         
+          <h2><a name="Structure_of_the_parser"></a>Structure of the parser</h2>
+       I see the structure of a program using the gedcom parser as follows:<br>
+                 <br>
+                 <img src="images/schema.png" alt="Gedcom parsing scheme">
+                 <br>
+                 <br>
+                 <br>
+       The parser is based on <code>lex/yacc</code>, which means that a module generated by <code>lex</code>
+ takes the inputfile and determines the tokens in that file (i.e. the smallest
+units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM
+values...). &nbsp;These tokens are passed to the parser module, which is
+generated by yacc, to parse the syntax of the file, i.e. whether the tokens
+appear in a sequence that is valid. &nbsp;<br>
            <br>
-                </div>
-                </div>
-                                                                       
-          </body>
-          </html>
+For each recognized statement in the GEDCOM file, the parser calls some callbacks,
+which can be registered by the application to get the information out of
+the file.<br>
+          <br>
+This basic description ignores the problem of character encoding.<br>
+          <br>
+          <h3><a name="Character_encoding"></a>Character encoding</h3>Refer to <a href="encoding.html">this page</a> for some introduction on character encoding...<br>
+
+          <br>
+GEDCOM defines three standard encodings:<br>
+          <ul>
+            <li>ASCII</li>
+            <li>ANSEL</li>
+            <li>UNICODE (assumed to be UCS-2, either big-endian or little-endian: the GEDCOM spec doesn't specify this)</li>
+          </ul>These are all supported by the parser, and converted into UTF-8 format.<br>
+
+            
+
+
+                                               
+          <hr width="100%" size="2">           
+          <pre><font size="-1">$Id$<br>$Name$</font><br></pre>
+           <br>
+                 </div>
+                 </div>
+                                                                         
+         
+          </body></html>
+\ No newline at end of file