From 7a161f98fa3efba595c96577e3ae7eda15b3dec3 Mon Sep 17 00:00:00 2001 From: Peter Verthez Date: Sun, 30 Dec 2001 22:45:43 +0000 Subject: [PATCH] Added some initial documentation. --- AUTHORS | 15 +++ ChangeLog | 11 ++ Makefile.am | 9 ++ NEWS | 17 +++ README | 77 ++++++++++++ doc/Makefile.am | 14 ++- doc/index.html | 36 ++++++ doc/parser.html | 214 +++++++++++++++++---------------- doc/usage.html | 307 ++++++++++++++++++++++++++++++++++++++++++++++++ 9 files changed, 587 insertions(+), 113 deletions(-) create mode 100644 doc/index.html create mode 100644 doc/usage.html diff --git a/AUTHORS b/AUTHORS index 44ca7de..1f55b05 100644 --- a/AUTHORS +++ b/AUTHORS @@ -1,2 +1,17 @@ +Authors of gedcom-parse: + +Design and implementation: + Peter Verthez + +Thanks for contributing ideas: + Geert Vantienen + Perry Rapp + +Integrated external code and data: + - Date calculation code: Scott E. Lee (http://www.genealogy.org/~scottlee/) + - Skeleton for gconv : Ulrich Drepper + - Gedcom test files : Heiner Eichmann + (http://heiner-eichmann.de/gedcom/gedcom.htm) + # $Id$ # $Name$ diff --git a/ChangeLog b/ChangeLog index 23b439f..6ef55a6 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,14 @@ +2001-12-30 Peter Verthez + + * all: Added some initial documentation. + + * gedcom/gedcom.y: Completed the calling of callbacks. + +2001-12-29 Peter Verthez + + * gedcom/gedcom_date.y: Added graceful fallback for date parse errors: + put everything as a 'date phrase'. + 2001-12-28 Peter Verthez * gedcom_date.*, date.*: Parsing dates via a separate yacc parser. diff --git a/Makefile.am b/Makefile.am index c6a2ee6..6b28425 100644 --- a/Makefile.am +++ b/Makefile.am @@ -12,6 +12,15 @@ testgedcom_LDFLAGS = -L gedcom/.libs -lgedcom testgedcom_LDADD = @INTLLIBS@ EXTRA_DIST = $(pkgdata_DATA) +VERSIONED_FILES = README + +dist-hook: + @cd $(distdir); \ + for file in $(VERSIONED_FILES); do \ + sed 's/\@VERSION\@/${VERSION}/' $$file > $$file.new; \ + rm $$file; \ + mv $$file.new $$file; \ + done clean-local: rm -f testgedcom.out diff --git a/NEWS b/NEWS index 44ca7de..2e8f0ad 100644 --- a/NEWS +++ b/NEWS @@ -1,2 +1,19 @@ +NOTE: NO BACKWARD COMPATIBILITY IS GUARANTEED FOR 0.x RELEASES !! + +release 0.12 (): + + - The calling of callbacks is now completed. + + - The parsed value that is returned in callbacks can now be: + - a null value + - a string + - a date (struct date_value) + See the documentation for more info. Parsing and checking of cross- + references will be added next. + +release 0.11 (15 December 2001): + + - Initial release from Sourceforge.net (developers only !) + # $Id$ # $Name$ diff --git a/README b/README index 44ca7de..aa4e4a2 100644 --- a/README +++ b/README @@ -1,2 +1,79 @@ +The Gedcom parser library (release @VERSION@) +------------------------- +The Gedcom parser library is a C library that provides an API to applications +to parse and process arbitrary genealogy files in the standard gedcom format. + +Its main features are: + + - strict callback-based parser written in C (using lex/yacc) + + - supports the Gedcom 5.5 standard fully + + - supports the standard encoding formats (ASCII, ANSEL, UNICODE), but + extensible (via a configuration file) to other encoding formats; by + default ANSI is also supported. + + - all strings passed from callbacks to the using program are in UTF-8 format + + - internationalization of the error and warning messages + + - specific parsing of date values to a calendar-neutral date system (Julian + days aka serial day numbers); the date parser can be called separately + + - provisions for "compatibility-mode" parsing, to allow for not-exactly- + standard syntaxes used by other genealogy programs (only the hooks are + in at the moment, not the actual compatibility) + +NOTE: + - NO BACKWARD COMPATIBILITY is guaranteed for 0.x releases ! + +To do list: + - specific parsing and checking of cross-references + - specific parsing of other special values + - C++ interface + - compatibility with other genealogy programs + - older/newer Gedcom standards ? + - ... + +For more information, refer to the documentation in the doc subdirectory, +or to the SourceForge project web site and summary page: + http://gedcom-parse.sourceforge.net + http://sourceforge.net/projects/gedcom-parse + +Also, have a look at the 'Genes' program, from which this library is a +spin-off, and which intends to use this library: + http://genes.sourceforge.net + http://sourceforge.net/projects/genes + + +Requirements: +------------ + - glibc 2.2 or higher + +To build from sources, you'll also need: + - gcc + - autoconf + - automake + - flex + - bison (won't work with plain yacc) + +It is possible that it also runs on other platforms than Linux (and that the +glibc version requirement can be loosened), however, I can only support Linux +because that is the only platform I have... + + +Installation: +------------ +This is simply: + + ./configure + make + make install + +You can also run some tests via: + make check + + +############################################################################### # $Id$ # $Name$ diff --git a/doc/Makefile.am b/doc/Makefile.am index 4da9c18..b328f8f 100644 --- a/doc/Makefile.am +++ b/doc/Makefile.am @@ -2,9 +2,15 @@ # $Id$ # $Name$ -EXTRA_DIST=parser.html +EXTRA_DIST = index.html usage.html parser.html +VERSIONED_FILES = index.html dist-hook: - mkdir $(distdir)/images - cp -p $(srcdir)/images/schema.obj $(srcdir)/images/schema.png \ - $(distdir)/images + @cd $(distdir); \ + mkdir images + cp -p $(srcdir)/images/schema.obj $(srcdir)/images/schema.png images \ + for file in $(VERSIONED_FILES); do \ + sed 's/\@VERSION\@/${VERSION}/' $$file > $$file.new; \ + rm $$file; \ + mv $$file.new $$file; \ + done diff --git a/doc/index.html b/doc/index.html new file mode 100644 index 0000000..09858f3 --- /dev/null +++ b/doc/index.html @@ -0,0 +1,36 @@ + + + + The GEDCOM parser library + + + + + +

The GEDCOM parser library

+ This is the documentation for the GEDCOM parser library, release @VERSION@.
+
+ The GEDCOM parser library is a C library that provides an API to applications + to parse and process arbitrary genealogy files in the standard gedcom format. + It supports release 5.5 + of the GEDCOM standard.
+
+ The rest of the documentation is divided into three parts:
+ +
    +
  • Usage: This is the main entry point for +application developers, using the library
  • +
  • Development: This describes some internals +of the library
  • +
  • Links: A collection of useful links, also +referenced in the rest of the documentation
    +
  • + +
+ +
$Id$
+ $Name$
+
+ + + diff --git a/doc/parser.html b/doc/parser.html index 15d253b..5637cd9 100644 --- a/doc/parser.html +++ b/doc/parser.html @@ -1,129 +1,125 @@ - + - Gedcom parser in Genes + The Gedcom parser library - -
-

Gedcom parser in Genes

- -
The intention of this page is to provide some explanation - of the gedcom parser, to aid development on and with it.  Currently, - the parser is in a state that it works, but some parts are still missing, - notably the interface towards applications.  First, some practical -issues of testing with the parser will be explained.
-
- + +
+

The Gedcom parser library

+ +
The intention of this page is to provide some explanation + of the gedcom parser, to aid development on and with it.  First, +some practical issues of testing with the parser will be explained.
+
+

Basic testing
-

- The parser is located in the "gedcom" subdirectory of the Genes source - code.  You should be able to perform a basic test using the commands:
- -
make clean
- make
- make test

-
- If everything goes OK, you'll see that some gedcom files are parsed, -and that each parse is successful.  Note that the used gedcom files -are made by Heiner -Eichmann - and are an excellent way to test gedcom parsers thoroughly.
-
- + +You should be able to perform a basic test using the commands:
+ +
./configure
+ make
+ make check

+
+ If everything goes OK, you'll see that some gedcom files are parsed, + and that each parse is successful.  Note that the used gedcom files + are made by Heiner + Eichmann and are an excellent way to test gedcom parsers thoroughly.
+
+

Preparing for further testing

- The basic testing described above doesn't show anything else than "Parse - succeeded", which is nice, but not very interesting.  Some more detailed - tests are possible, via the gedcom-parse program that is generated - by make test.  
-
- However, since the output that gedcom-parse generates is - in UTF-8 format (more on this later), some preparation is necessary to -have a full view on it.  Basically, you need a terminal that understands -and can display UTF-8 encoded characters, and you need to proper fonts installed - to display them.  I'll give some advice on this here, based on the -Red Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 4.0.x.  Any - other distribution that has the same or newer versions for these components - should give the same results.
-
- For the first issue, the UTF-8 capable terminal, the safest bet is to - use xterm in its unicode mode (which is supported by the - xterm coming with XFree86 4.0.x).  UTF-8 capabilities - have only recently been added to gnome-terminal, so probably + The basic testing described above doesn't show anything else than "Parse + succeeded", which is nice, but not very interesting.  Some more detailed + tests are possible, via the testgedcom program that is generated + by make test.  
+
+ However, since the output that testgedcom generates is + in UTF-8 format (more on this later), some preparation is necessary to have + a full view on it.  Basically, you need a terminal that understands + and can display UTF-8 encoded characters, and you need to proper fonts installed + to display them.  I'll give some advice on this here, based on the + Red Hat 7.1 distribution that I use, with glibc 2.2 and XFree86 4.0.x. + Any other distribution that has the same or newer versions for these +components should give the same results.
+
+ For the first issue, the UTF-8 capable terminal, the safest bet is +to use xterm in its unicode mode (which is supported by +the xterm coming with XFree86 4.0.x).  UTF-8 capabilities + have only recently been added to gnome-terminal, so probably that is not in your distribution yet (it certainly isn't in Red Hat 7.1).
-
- For the second issue, you'll need the ISO 10646-1 fonts.  These -come also with XFree86 4.0.x.
-
- The way to start xterm in unicode mode is then e.g. (put +
+ For the second issue, you'll need the ISO 10646-1 fonts.  These + come also with XFree86 4.0.x.
+
+ The way to start xterm in unicode mode is then e.g. (put everything on 1 line !):
- -
LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm - -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
-
- This first sets the LANG variable to a locale that -uses UTF-8, and then starts xterm with a proper Unicode font. - Some sample UTF-8 plain text files can be found - here - .  Just cat them on the command line and see the result.
-
- + +
LANG=en_GB.UTF-8 xterm -bg 'black' -fg 'DarkGrey' -cm + -fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'
+
+ This first sets the LANG variable to a locale that + uses UTF-8, and then starts xterm with a proper Unicode font. +  Some sample UTF-8 plain text files can be found + here .  Just cat them on the command line +and see the result.
+
+

Testing the parser with debugging

- Given the UTF-8 capable terminal, you can now let the gedcom-parse - program print the values that it parses.  An example of a command - line is (in the gedcom directory):
- -
./gedcom_parse -dg t/ulhc.ged
-
- The -dg option instructs the parser to show its own debug - messages  (see ./gedcom_parse -h for the full set of -options).  If everything is OK, you'll see the values from the gedcom -file, containing a lot of special characters.
-
- For the ANSEL test file (t/ansel.ged), you have to set the - environment variable GCONV_PATH to the ansel subdirectory - of the gedcom directory:
- -
export GCONV_PATH=./ansel
- ./gedcom_parse -dg t/ansel.ged
-
- This is because for the ANSEL character set an extra module is needed -for the iconv library (more on this later).  But again, this should -show a lot of special characters.
-
+ Given the UTF-8 capable terminal, you can now let the testgedcom + program print the values that it parses.  An example of a command + line is (in the gedcom directory):
+ +
./testgedcom -dg t/ulhc.ged
+
+ The -dg option instructs the parser to show its own debug + messages  (see ./testgedcom -h for the full set of options). +  If everything is OK, you'll see the values from the gedcom file, +containing a lot of special characters.
+
+ For the ANSEL test file (t/ansel.ged), you have to set +the environment variable GCONV_PATH to the ansel + subdirectory of the gedcom directory:
+
export GCONV_PATH=./ansel
+ ./testgedcom -dg t/ansel.ged
+
+ This is because for the ANSEL character set an extra module is needed + for the iconv library (more on this later).  But again, this should + show a lot of special characters.
+
+

Testing the lexers separately

- The lexers themselves can be tested separately.  For the 1-byte -lexer (i.e. supporting the encodings with 1 byte per characters, such as + The lexers themselves can be tested separately.  For the 1-byte +lexer (i.e. supporting the encodings with 1 byte per characters, such as ASCII, ANSI and ANSEL), the sequence of commands would be:
- +
make clean
- make test_1byte
-
-This will show all tokens in the t/allged.ged test file.  Similar + make test_1byte
+ + This will show all tokens in the t/allged.ged test file.  Similar tests can be done using make test_hilo and make test_lohi - (for the unicode lexers).
-
- This concludes the testing setup.  Now for some explanations...
-
- + (for the unicode lexers).
+
+ This concludes the testing setup.  Now for some explanations...
+
+

Structure of the parser

- I see the structure of a program using the gedcom parser as follows:
-
- Gedcom parsing scheme -
-
-
- TO BE COMPLETED...
- -
$Id: parser.html,v 1.2 2001/12/01 15:29:00 + I see the structure of a program using the gedcom parser as follows:
+
+ Gedcom parsing scheme +
+
+
+ TO BE COMPLETED...
+ +
$Id: parser.html,v 1.2 2001/12/01 15:29:00 verthezp Exp $
- $Name$
-
-
-
- + $Name$
+
+
+
+ diff --git a/doc/usage.html b/doc/usage.html new file mode 100644 index 0000000..f47684f --- /dev/null +++ b/doc/usage.html @@ -0,0 +1,307 @@ + + + + Using the GEDCOM parser library + + + + + +

Using the GEDCOM parser library

+
+ +

Index

+ +
+

Overview
+

+ The GEDCOM parser library is built as a callback-based parser (comparable +to the SAX interface of XML).  It comes with:
+ +
    +
  • a library (libgedcom.so), to be linked in the application +program
  • +
  • a header file (gedcom.h), to be used in the sources of +the application program
  • + +
+ Next to these, there is also a data directory in $PREFIX/share/gedcom-parse + that contains some additional stuff, but which is not immediately important +at first.  I'll leave the description of the data directory for later.
+
+ The very simplest call of the gedcom parser is simply the following piece +of code (include of the gedcom header is assumed, as everywhere in this manual):
+ +
int result;
+ ...
+ result = gedcom_parse_file("myfamily.ged");
+
+ Although this will not provide much information, one thing it does is parse +the entire file and return the result.  The function returns 0 on success +and 1 on failure.  No other information is available using this function +only.
+
+The next sections will refine this to be able to have meaningful errors and +the actual data that is in the file.
+
+

Error handling

+Since this is a relatively simple topic, it is discussed before the actual +callback mechanism, although it also uses a callback...
+
+The library can be used in several different circumstances, both terminal-based +as GUI-based.  Therefore, it leaves the actual display of the error +message up to the application.  For this, the application needs to register +a callback before parsing the GEDCOM file, which will be called by the library +on errors, warnings and messages.
+
+A typical piece of code would be:
+
void my_message_handler (Gedcom_msg_type type, +char *msg)
+{
+  ...
+}
+...
+ gedcom_set_message_handler(my_message_handler);
+...
+result = gedcom_parse_file("myfamily.ged");

+
+In the above piece of code, my_message_handler is the callback +that will be called for errors (type=ERROR), warnings ( +type=WARNING) and messages (type=MESSAGE).  The +callback must have the signature as in the example.  For errors, the + msg passed to the callback will have the format:
+
Error on line <lineno>: <actual_message>
+
+Note that the entire string will be properly internationalized, and encoded +in UTF-8 (see "Why UTF-8?"  LINK TBD).  Also, no newline +is appended, so that the application program can use it in any way it wants. + Warnings are similar, but use "Warning" instead of "Error".  Messages +are plain text, without any prefix.
+
+With this in place, the resulting code will already show errors and warnings +produced by the parser, e.g. on the terminal if a simple printf + is used in the message handler.
+
+

Data callback mechanism

+The most important use of the parser is of course to get the data out of +the GEDCOM file.  As already mentioned, the parser uses a callback mechanism +for that.  In fact, the mechanism involves two levels.
+
+The primary level is that each of the sections in a GEDCOM file is notified +to the application code via a "start element" callback and an "end element" +callback (much like in a SAX interface for XML), i.e. when a line containing +a certain tag is parsed, the "start element" callback is called for that +tag, and when all its subordinate lines with their tags have been processed, +the "end element" callback is called for the original tag.  Since GEDCOM +is hierarchical, this results in properly nested calls to appropriate "start +element" and "end element" callbacks.
+
+However, it would be typical for a genealogy program to support only a subset +of the GEDCOM standard, certainly a program that is still under development. + Moreover, under GEDCOM it is allowed for an application to define its +own tags, which will typically not  be supported by another application. + Still, in that case, data preservation is important; it would hardly +be accepted that information that is not understood by a certain program +is just removed.
+
+Therefore, the second level of callbacks involves a "default callback".  An +application needs to subscribe to callbacks for tags it does support, and +need to provide a "default callback" which will be called for tags it doesn't +support.  The application can then choose to just store the information +that comes via the default callback in plain textual format.
+
+After this introduction, let's see what the API looks like...
+
+

Start and end callbacks

+

Callbacks for records
+

+As a simple example, we will get some information from the header of a GEDCOM +file.  First, have a look at the following piece of code:
+
Gedcom_ctxt my_header_start_cb (int level, +Gedcom_val xref, char *tag)
+{
+  printf("The header starts\n");
+  return (Gedcom_ctxt)1;
+}
+
+void my_header_end_cb (Gedcom_ctxt self)
+{
+  printf("The header ends, context is %d\n", self);   /* context +will print as "1" */
+}
+
+...
+ gedcom_subscribe_to_record(REC_HEAD, my_header_start_cb, my_header_end_cb);
+...
+result = gedcom_parse_file("myfamily.ged");

+
+ Using the gedcom_subscribe_to_record function, the application +requests to use the specified callbacks as start and end callback. The end +callback is optional: you can pass NULL if you are not interested +in the end callback.  The identifiers to use as first argument to the +function (here REC_HEAD) are described in TBD (use the header +file for now...).
+
+From the name of the function it becomes clear that this function is specific +to complete records.  For the separate elements in records there is +another function, which we'll see shortly.  Again, the callbacks need +to have the signatures as shown in the example.
+
+The Gedcom_ctxt type that is used as a result of the start callback +and as an argument to the end callback is vital for passing context necessary +for the application.  This type is meant to be opaque; in fact, it's +a void pointer, so you can pass anything via it.  The important thing +to know is that the context that the application returns in the start callback +will be passed in the end callback as an argument, and as we will see shortly, +also to all the directly subordinate elements of the record.
+
+The example passes a simple integer as context, but an application could +e.g. pass a struct that will contain the information for the +header.  In the end callback, the application could then e.g. do some +finalizing operations on the struct to put it in its database.
+
+(Note that the Gedcom_val type for the xref argument +was not discussed, see further for this)
+
+

Callbacks for elements

+We will now retrieve the SOUR field (the name of the program that wrote the +file) from the header:
+
Gedcom_ctxt my_header_source_start_cb(Gedcom_ctxt +parent,
+                      +                int     +    level,
+                      +                char*     +  tag,
+                      +                char*     +  raw_value,
+                      +                Gedcom_val  parsed_value)
+{
+  char *source = GEDCOM_STRING(parsed_value);
+  printf("This file was written by %s\n", source);
+  return parent;
+}
+
+void my_header_source_end_cb(Gedcom_ctxt parent,
+                      +       Gedcom_ctxt self,
+                      +       Gedcom_val  parsed_value)
+{
+  printf("End of the source description\n");
+}
+
+...
+ gedcom_subscribe_to_element(ELT_HEAD_SOUR,
+                      +      my_header_source_start_cb,
+                      +      my_header_source_end_cb);
+...
+result = gedcom_parse_file("myfamily.ged");

+
+The subscription mechanism for elements is similar, only the signatures of +the callbacks differ.  The signature for the start callback shows that +the context of the parent line (e.g. the struct that describes +the header) is passed to this start callback.  The callback itself returns +here the same context, but this can be its own context object of course. + The end callback is called with both the context of the parent and +the context of itself, which will be the same in the example.
+
+If we look at the other arguments of the start callback, we see the level +number (the initial number of the line in the GEDCOM file), the tag (e.g. +"SOUR"), and then a raw value and a parsed value.  The raw value is +just the raw string that occurs as value on the line next to the tag (in +UTF-8 encoding).  The parsed value is the meaningful value that is parsed +from that raw string.
+
+The Gedcom_val type is meant to be an opaque type.  The +only thing that needs to be known about it is that it can contain specific +data types, which have to be retrieved from it using pre-defined macros. + Currently, the specific types are (with val of type +Gedcom_val):
+
+ + + + + + + + + + + + + + + + + + + + + + + +

+
type checker
+
cast operator
+
null value
+
GEDCOM_IS_NULL(val)
+
N/A
+
string
+
GEDCOM_IS_STRING(val)
+
char* str = GEDCOM_STRING(val);
+
date
+
GEDCOM_IS_DATE(val)
+
struct date_value dv = GEDCOM_DATE(val) +;
+
+
+The null value is used for when the GEDCOM spec doesn't allow a value, or +when an optional value is allowed but none is given.

+The string value is the most general used value currently, for all those +values that don't have a more specific meaning.  In essence, the value +that is returned by GEDCOM_STRING is always the same as the raw_value passed +to the start callback, and is thus in fact redundant.
+
+The date value is used for all elements that return a date.  (Description +of struct date_value TBD: look in the header file for the moment).
+
+The type checker returns a true or a false value according to the type of +the value, but this is in principle only necessary in the rare circumstances +that two types are possible, or where an optional value can be provided. + In most cases, the type is fixed for a specific tag (types per tag +to be described).
+
+Some extra notes:
+
    +
  • The Gedcom_val argument of the end callback +is currently not used.  It is there for future enhancements.
  • +
  • There is also a Gedcom_val argument in the start +callback for records.  This argument is currently a string value giving +the pointer in string form.
  • +
+

Default callbacks
+

+TO BE COMPLETED
+
$Id$
+ $Name$
+
+ + + -- 2.30.2