- ./gedcom_parse -dg t/ansel.ged<br>
- </code></blockquote>
- This is because for the ANSEL character set an extra module is needed
-for the iconv library (more on this later). But again, this should
-show a lot of special characters.<br>
- <br>
-
- <h2>Testing the lexers separately</h2>
- The lexers themselves can be tested separately. For the 1-byte
-lexer (i.e. supporting the encodings with 1 byte per characters, such as
-ASCII, ANSI and ANSEL), the sequence of commands would be:<br>
-
- <blockquote><code>make clean<br>
- make test_1byte<br>
- </code></blockquote>
-This will show all tokens in the <code>t/allged.ged</code> test file. Similar
-tests can be done using <code>make test_hilo</code> and <code>make test_lohi</code>
- (for the unicode lexers).<br>
- <br>
- This concludes the testing setup. Now for some explanations...<br>
- <br>
-
- <h2>Structure of the parser</h2>
- I see the structure of a program using the gedcom parser as follows:<br>
- <br>
- <img src="images/schema.png" alt="Gedcom parsing scheme">
- <br>
- <br>
- <br>
- TO BE COMPLETED...<br>
-
- <hr width="100%" size="2">$Id: parser.html,v 1.2 2001/12/01 15:29:00
-verthezp Exp $<br>
- $Name$<br>
- <br>
- </div>
- </div>
+ ./testgedcom -dg t/ansel.ged<br>
+ </code></blockquote>
+ This is because for the ANSEL character set an extra module is needed
+ for the iconv library (more on this later). But again, this should
+ show a lot of special characters.<br>
+ <br>
+
+
+ <h3><a name="Testing_the_lexers_separately"></a>Testing the lexers separately</h3>
+
+ The lexers themselves can be tested separately. For the 1-byte
+ lexer (i.e. supporting the encodings with 1 byte per characters, such as
+ ASCII, ANSI and ANSEL), the command would be (in the <code>gedcom</code> subdirectory):<br>
+
+ <blockquote><code>make lexer_1byte<br>
+ </code></blockquote>
+ This will generate a lexer program that can process e.g. the <code>t/allged.ged</code>
+ test file. Simply cat the file through the lexer on standard input
+and you should get all the tokens in the file. Similar tests can be
+done using <code>make lexer_hilo</code> and <code>
+make lexer_lohi</code> (for the unicode lexers). In each of the cases you need to know yourself which of the test files are appropriate to pass through the lexer.<br>
+ <br>
+ This concludes the testing setup. Now for some explanations...<br>
+ <hr width="100%" size="2"><br>
+
+
+ <h2><a name="Structure_of_the_parser"></a>Structure of the parser</h2>
+ I see the structure of a program using the gedcom parser as follows:<br>
+ <br>
+ <img src="images/schema.png" alt="Gedcom parsing scheme">
+ <br>
+ <br>
+ <br>
+ The parser is based on <code>lex/yacc</code>, which means that a module generated by <code>lex</code>
+ takes the inputfile and determines the tokens in that file (i.e. the smallest
+units, such as numbers, line terminators, GEDCOM tags, characters in GEDCOM
+values...). These tokens are passed to the parser module, which is
+generated by yacc, to parse the syntax of the file, i.e. whether the tokens
+appear in a sequence that is valid. <br>
+ <br>
+For each recognized statement in the GEDCOM file, the parser calls some callbacks,
+which can be registered by the application to get the information out of
+the file.<br>
+ <br>
+This basic description ignores the problem of character encoding. The next section describes what this problem exactly is.<br>
+ <br>
+ <h3><a name="Character_encoding"></a>Character encoding</h3>
+ <h4><i>The character encoding problem</i><br>
+ </h4>
+Developers are usually familiar with the ASCII character set. This
+is a character set that assigns a unique number to some characters, e.g.
+an "A" has ASCII code 65 (or 0x41 in hex), and an "a" has ASCII code 97 (or
+0x61 in hex). Some people may also have used ASCII codes for several
+drawing characters (such as a horizontal bar, a vertical bar, or a top-right
+corner) in the old DOS days, to be able to draw nice windows in text mode.<br>
+ <br>
+However, these last characters are strictly spoken not part of the ASCII
+set. The standard ASCII set contains only the character positions from
+0 to 127 (i.e. anything that fits into an integer that is 7 bits wide). An
+example of this table can be found <a href="http://web.cs.mun.ca/%7Emichael/ascii-table.html">here</a>. Anything that has an ASCII code between 128 and 255 is in principle undefined.<br>
+ <br>
+Now, several systems (including the old DOS) have defined those character
+positions anyway, but usually in totally different ways. Some well
+known extensions are:<br>
+ <ul>
+ <li>the <a href="http://czyborra.com/charsets/cp437.gif">DOS</a>
+ character set, nowadays usually known as Code Page 437, but sometimes also
+named LatinUS, ECS (Extended Character Set) or PC-8; note that the table
+displayed in the link also contains the standard ASCII part</li>
+ <li>the <a href="http://czyborra.com/charsets/cp1252.gif">ANSI</a> character set, also known as Code Page 1252, and usually the default on Windows</li>
+ <li>the <a href="http://czyborra.com/charsets/iso8859-1.gif">ISO-8859-1</a> character set (also called Latin-1), which is an ISO standard for Western European languages, mostly used on various Unices</li>
+ <li>the <a href="http://czyborra.com/charsets/adobe-stdenc.gif">Adobe Standard Encoding</a>, which is by default used in Postscript, unless overridden</li>
+ </ul>
+And these are only examples of character sets used in West-European languages.
+ For Japanese, Chinese, Korean, Vietnamese, ... there are separate character
+sets in which one byte's meaning can even be influenced by what the previous
+byte was, i.e. these are multi-byte character sets. This is because
+even 256 characters is totally inadequate to represent all characters in
+such languages.<br>
+ <br>
+So, summarizing, if a text file contains a byte that has a value 65, it is
+pretty safe to assume that this byte represents an "A", if we ignore the
+multi-byte character sets spoken of before. However, a value 233 cannot
+be interpreted without knowing in which character set the text file is written.
+ In Latin-1, it happens to be the character "é", but in another
+character set it can be something totally different (e.g. in the DOS character
+set it is the Greek letter theta).<br>
+ <br>
+Vice versa, if you need to write a character "é" to a file, it depends
+on the character set you will use what the numerical value will be in the
+file: in Latin-1 it will be 233, but if you use the DOS character set it
+will be 130.<br>
+ <br>
+ <h4><i>Unicode</i></h4>
+Enter the Unicode standard...<br>
+ <br>
+TO BE COMPLETED<br>
+