Some more documentation regarding development.

[gedcom-parse.git] / doc / parser.html
diff --git a/doc/parser.html b/doc/parser.html

index a70769370698c294c327d0abbc95e8d576a5514c..e6581b7cbacf210f18370ab48ed554fcf0079b37 100644 (file)
--- a/doc/parser.html
+++ b/doc/parser.html
@@ -1,11 +1,8 @@
-<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
-<html>
-<head>
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head>
                                 
                                 
-  <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
-  <title>The Gedcom parser library</title>
-</head>
-  <body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099">
+  <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><title>The Gedcom parser library</title></head>
+
+<body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099">
               
  <div align="center">       
  <h1>The Gedcom parser library</h1>
               
  <div align="center">       
  <h1>The Gedcom parser library</h1>
@@ -13,10 +10,33 @@
  <div align="left">The intention of this page is to provide some explanation
      of the gedcom parser, to aid development on and with it. &nbsp;First,
  some practical  issues  of testing with the parser will be explained.<br>
  <div align="left">The intention of this page is to provide some explanation
      of the gedcom parser, to aid development on and with it. &nbsp;First,
  some practical  issues  of testing with the parser will be explained.<br>
-       <br>
+<br>
+<h2>Index</h2>
+<ul>
+  <li><a href="#Testing">Testing</a></li>
+  <ul>
+    <li><a href="#Basic_testing">Basic testing</a></li>
+    <li><a href="#Preparing_for_further_testing">Preparing for further testing</a></li>
+    <li><a href="#Testing_the_parser_with_debugging">Testing the parser with debugging</a></li>
+    <li><a href="#Testing_the_lexers_separately">Testing the lexers separately</a><br>
+    </li>
+  </ul>
+  <li><a href="#Structure_of_the_parser">Structure of the parser</a></li>
+  <ul>
+    <li><a href="#Character_encoding">Character encoding</a><br>
+    </li>
+  </ul>
+</ul>
+<br>
+<hr width="100%" size="2">
+<h2><a name="Testing"></a>Testing<br>
+</h2>
+
               
               
-<h2>Basic testing<br>
-       </h2>
+<h3><a name="Basic_testing"></a>Basic testing<br>
+
+       </h3>
+
    You should be able to perform a basic test using the commands:<br>
               
  <blockquote><code>./configure<br>
    You should be able to perform a basic test using the commands:<br>
               
  <blockquote><code>./configure<br>
@@ -24,16 +44,16 @@ some practical  issues  of testing with the parser will be explained.<br>
          make check</code><br>
           </blockquote>
          If everything goes OK, you'll see that some gedcom files are parsed,
          make check</code><br>
           </blockquote>
          If everything goes OK, you'll see that some gedcom files are parsed,
-  and   that each parse is successful. &nbsp;Note that the used gedcom files
+  and   that each parse is successful. &nbsp;Note that some of the used gedcom files
    are  made by <a href="http://heiner-eichmann.de/gedcom/gedcom.htm">Heiner
    Eichmann</a>       and are an excellent way to test gedcom parsers thoroughly.<br>
           <br>
                             
    are  made by <a href="http://heiner-eichmann.de/gedcom/gedcom.htm">Heiner
    Eichmann</a>       and are an excellent way to test gedcom parsers thoroughly.<br>
           <br>
                             
-  <h2>Preparing for further testing</h2>
-        The basic testing described above doesn't show anything else than 
-"Parse    succeeded", which is nice, but not very interesting. &nbsp;Some 
+  
+  <h3><a name="Preparing_for_further_testing"></a>Preparing for further testing</h3>
+Some 
  more detailed    tests are possible, via the <code>testgedcom</code> program 
  more detailed    tests are possible, via the <code>testgedcom</code> program 
-that is generated    by <code>make test</code>. &nbsp;<br>
+that is generated    by <code>make</code>. &nbsp;<br>
           <br>
          However, since the output that <code>testgedcom</code> generates
  is   in  UTF-8 format (more on this later), some preparation is necessary
           <br>
          However, since the output that <code>testgedcom</code> generates
  is   in  UTF-8 format (more on this later), some preparation is necessary
@@ -66,7 +86,9 @@ the   <code>   xterm</code> coming with XFree86 4.0.x). &nbsp;UTF-8 capabilities
   and see the result.<br>
             <br>
                                           
   and see the result.<br>
             <br>
                                           
-    <h2>Testing the parser with debugging</h2>
+    
+    <h3><a name="Testing_the_parser_with_debugging"></a>Testing the parser with debugging</h3>
+
         Given the UTF-8 capable terminal, you can now let the <code>testgedcom</code>
          program print the values that it parses. &nbsp;An example of a command
     line is (in the <code>gedcom</code> directory):<br>
         Given the UTF-8 capable terminal, you can now let the <code>testgedcom</code>
          program print the values that it parses. &nbsp;An example of a command
     line is (in the <code>gedcom</code> directory):<br>
@@ -90,37 +112,103 @@ containing  a lot of special characters.<br>
    show  a lot of special characters.<br>
                 <br>
                                                                       
    show  a lot of special characters.<br>
                 <br>
                                                                       
-        <h2>Testing the lexers separately</h2>
+        
+        <h3><a name="Testing_the_lexers_separately"></a>Testing the lexers separately</h3>
+
         The lexers themselves can be tested separately. &nbsp;For the 1-byte 
   lexer   (i.e. supporting the encodings with 1 byte per characters, such as
         The lexers themselves can be tested separately. &nbsp;For the 1-byte 
   lexer   (i.e. supporting the encodings with 1 byte per characters, such as
- ASCII,   ANSI and ANSEL), the sequence of commands would be:<br>
+ ASCII,   ANSI and ANSEL), the command would be (in the <code>gedcom</code> subdirectory):<br>
                                                                       
                                                                       
-        <blockquote><code>make clean<br>
-       make test_1byte<br>
+        <blockquote><code>make lexer_1byte<br>
               </code></blockquote>
               </code></blockquote>
-   This will show all tokens in the <code>t/allged.ged</code> test file.
-&nbsp;Similar  tests can be done using <code>make test_hilo</code> and <code>
-make test_lohi</code>    (for the unicode lexers).<br>
+   This will generate a lexer program that can process e.g. the <code>t/allged.ged</code>
+ test file. &nbsp;Simply cat the file through the lexer on standard input
+and you should get all the tokens in the file. &nbsp;Similar  tests can be
+done using <code>make lexer_hilo</code> and <code>
+make lexer_lohi</code>    (for the unicode lexers). &nbsp;In each of the cases you need to know yourself which of the test files are appropriate to pass through the lexer.<br>
                   <br>
         This concludes the testing setup. &nbsp;Now for some explanations...<br>
                   <br>
         This concludes the testing setup. &nbsp;Now for some explanations...<br>
-                 <br>
+          <hr width="100%" size="2"><br>
                                                                           
           
                                                                           
           
-          <h2>Structure of the parser</h2>
+          <h2><a name="Structure_of_the_parser"></a>Structure of the parser</h2>
         I see the structure of a program using the gedcom parser as follows:<br>
                   <br>
                   <img src="images/schema.png" alt="Gedcom parsing scheme">
                   <br>
                   <br>
                   <br>
         I see the structure of a program using the gedcom parser as follows:<br>
                   <br>
                   <img src="images/schema.png" alt="Gedcom parsing scheme">
                   <br>
                   <br>
                   <br>
-       TO BE COMPLETED...<br>
+       The parser is based on <code>lex/yacc</code>, which means that a module generated by <code>lex</code>
+ takes the inputfile and determines the tokens in that file (i.e. the smallest
+units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM
+values...). &nbsp;These tokens are passed to the parser module, which is
+generated by yacc, to parse the syntax of the file, i.e. whether the tokens
+appear in a sequence that is valid. &nbsp;<br>
+          <br>
+For each recognized statement in the GEDCOM file, the parser calls some callbacks,
+which can be registered by the application to get the information out of
+the file.<br>
+          <br>
+This basic description ignores the problem of character encoding. &nbsp;The next section describes what this problem exactly is.<br>
+          <br>
+          <h3><a name="Character_encoding"></a>Character encoding</h3>
+          <h4><i>The character encoding problem</i><br>
+          </h4>
+Developers are usually familiar with the ASCII character set. &nbsp;This
+is a character set that assigns a unique number to some characters, e.g.
+an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or
+0x61 in hex). &nbsp;Some people may also have used ASCII codes for several
+drawing characters (such as a horizontal bar, a vertical bar, or a top-right
+corner) in the old DOS days, to be able to draw nice windows in text mode.<br>
+          <br>
+However, these last characters are strictly spoken not part of the ASCII
+set. &nbsp;The standard ASCII set contains only the character positions from
+0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). &nbsp;An
+example of this table can be found <a href="http://web.cs.mun.ca/%7Emichael/ascii-table.html">here</a>. &nbsp;Anything that has an ASCII code between 128 and 255 is in principle undefined.<br>
+          <br>
+Now, several systems (including the old DOS) have defined those character
+positions anyway, but usually in totally different ways. &nbsp;Some well
+known extensions are:<br>
+          <ul>
+            <li>the <a href="http://czyborra.com/charsets/cp437.gif">DOS</a>
+ character set, nowadays usually known as Code Page 437, but sometimes also
+named LatinUS, ECS (Extended Character Set) or PC-8; note that the table
+displayed in the link also contains the standard ASCII part</li>
+            <li>the <a href="http://czyborra.com/charsets/cp1252.gif">ANSI</a> character set, also known as Code Page 1252, and usually the default on Windows</li>
+            <li>the <a href="http://czyborra.com/charsets/iso8859-1.gif">ISO-8859-1</a> character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices</li>
+            <li>the <a href="http://czyborra.com/charsets/adobe-stdenc.gif">Adobe Standard Encoding</a>, which is by default used in Postscript, unless overridden</li>
+          </ul>
+And these are only examples of character sets used in West-European languages.
+&nbsp;For Japanese, Chinese, Korean, Vietnamese, ... there are separate character
+sets in which one byte's meaning can even be influenced by what the previous
+byte was, i.e. these are multi-byte character sets. &nbsp;This is because
+even 256 characters is totally inadequate to represent all characters in
+such languages.<br>
+          <br>
+So, summarizing, if a text file contains a byte that has a value 65, it is
+pretty safe to assume that this byte represents an "A", if we ignore the
+multi-byte character sets spoken of before. &nbsp;However, a value 233 cannot
+be interpreted without knowing in which character set the text file is written.
+&nbsp;In Latin-1, it happens to be the character "&eacute;", but in another
+character set it can be something totally different (e.g. in the DOS character
+set it is the Greek letter theta).<br>
+          <br>
+Vice versa, if you need to write a character "&eacute;" to a file, it depends
+on the character set you will use what the numerical value will be in the
+file: in Latin-1 it will be 233, but if you use the DOS character set it
+will be 130.<br>
+          <br>
+          <h4><i>Unicode</i></h4>
+Enter the Unicode standard...<br>
+          <br>
+TO BE COMPLETED<br>
+
                                                 
            <hr width="100%" size="2">           
                                                 
            <hr width="100%" size="2">           
-          <pre>$Id$<br>$Name$<br></pre>
+          <pre><font size="-1">$Id$<br>$Name$</font><br></pre>
             <br>
                   </div>
                   </div>
                                                                           
           
             <br>
                   </div>
                   </div>
                                                                           
           
-          </body>
-          </html>
+          </body></html>
+\ No newline at end of file