X-Git-Url: https://git.dlugolecki.net.pl/?a=blobdiff_plain;f=doc%2Fusage.html;h=0a8c7e85c5ece5d1352d099348de28d8366d81f0;hb=3cea0be54e7466a949a91904e5e8e8858fc1ee50;hp=2843d7c528b8c8fa5765eab45601b3fc8407b5c7;hpb=8c92a223c34fbd674f26520fb990c64a7b2f9147;p=gedcom-parse.git diff --git a/doc/usage.html b/doc/usage.html index 2843d7c..0a8c7e8 100644 --- a/doc/usage.html +++ b/doc/usage.html @@ -1,435 +1,746 @@ - - - - Using the GEDCOM parser library - - - - - +Using the GEDCOM parser library + + + +

Using the GEDCOM parser library

-
- +
+

Index

- +
  • Other API functions
  • + -
  • Interface details
    -
  • - +
  • Converting character sets
  • +
  • Support for configure.in
    +
    +
  • +
  • Interface details of the callback parser
  • C object model
    +
  • + + - -
    + +

    Overview
    -

    - The GEDCOM parser library is built as a callback-based parser (comparable - to the SAX interface of XML).  It comes with:
    - + The GEDCOM +parser library provides two interfaces.  At the one hand, it can be +used as a callback-based parser (comparable to the SAX interface of +XML); at the other hand, the parser can be used to convert the GEDCOM file +into an object model (comparable to the DOM interface of XML).  It comes +with:
    + Additionally, if you want to use the GEDCOM C object model, the following should be used (note that libgedcom.so is also needed in this case, because the object model uses the callback parser internally):
    + - Next to these, there is also a data directory in $PREFIX/share/gedcom-parse - that contains some additional stuff, but which is not immediately important - at first.  I'll leave the description of the data directory for later.
    -
    - The very simplest call of the gedcom parser is simply the following piece - of code (include of the gedcom header is assumed, as everywhere in this -manual):
    - -
    int result;
    - ...
    - result = gedcom_parse_file("myfamily.ged");
    -
    - Although this will not provide much information, one thing it does is -parse the entire file and return the result.  The function returns 0 -on success and 1 on failure.  No other information is available using -this function only.
    -
    - The next sections will refine this to be able to have meaningful errors - and the actual data that is in the file.
    - -
    -

    Error handling

    - Since this is a relatively simple topic, it is discussed before the actual - callback mechanism, although it also uses a callback...
    -
    - The library can be used in several different circumstances, both terminal-based - as GUI-based.  Therefore, it leaves the actual display of the error -message up to the application.  For this, the application needs to register -a callback before parsing the GEDCOM file, which will be called by the library - on errors, warnings and messages.
    -
    - A typical piece of code would be:
    - -
    void my_message_handler (Gedcom_msg_type type, - char *msg)
    - {
    -   ...
    - }
    - ...
    - gedcom_set_message_handler(my_message_handler);
    - ...
    - result = gedcom_parse_file("myfamily.ged");

    -
    - In the above piece of code, my_message_handler is the callback - that will be called for errors (type=ERROR), warnings ( - type=WARNING) and messages (type=MESSAGE).  The -callback must have the signature as in the example.  For errors, the - msg passed to the callback will have the format:
    + -
    Error on line <lineno>: <actual_message>
    -
    - Note that the entire string will be properly internationalized, and encoded - in UTF-8 (see "Why UTF-8?"  LINK TBD).  Also, no newline - is appended, so that the application program can use it in any way it wants. -  Warnings are similar, but use "Warning" instead of "Error".  Messages - are plain text, without any prefix.
    -
    - With this in place, the resulting code will already show errors and warnings - produced by the parser, e.g. on the terminal if a simple printf - is used in the message handler.
    - -
    -

    Data callback mechanism

    - The most important use of the parser is of course to get the data out -of the GEDCOM file.  As already mentioned, the parser uses a callback -mechanism for that.  In fact, the mechanism involves two levels.
    -
    - The primary level is that each of the sections in a GEDCOM file is notified - to the application code via a "start element" callback and an "end element" - callback (much like in a SAX interface for XML), i.e. when a line containing - a certain tag is parsed, the "start element" callback is called for that -tag, and when all its subordinate lines with their tags have been processed, -the "end element" callback is called for the original tag.  Since GEDCOM - is hierarchical, this results in properly nested calls to appropriate "start - element" and "end element" callbacks.
    -
    - However, it would be typical for a genealogy program to support only a -subset of the GEDCOM standard, certainly a program that is still under development. -  Moreover, under GEDCOM it is allowed for an application to define its - own tags, which will typically not  be supported by another application. -  Still, in that case, data preservation is important; it would hardly - be accepted that information that is not understood by a certain program -is just removed.
    -
    - Therefore, the second level of callbacks involves a "default callback". -  An application needs to subscribe to callbacks for tags it does support, - and need to provide a "default callback" which will be called for tags it - doesn't support.  The application can then choose to just store the -information that comes via the default callback in plain textual format.
    -
    - After this introduction, let's see what the API looks like...
    + + + Next to these, there is also a data directory in $PREFIX/share/gedcom-parse + that contains some additional stuff, but which is not immediately + important at first.  I'll leave the description of the data directory + for later.

    - -

    Start and end callbacks

    - -

    Callbacks for records
    -

    - As a simple example, we will get some information from the header of a -GEDCOM file.  First, have a look at the following piece of code:
    - -
    Gedcom_ctxt my_header_start_cb (int level, - Gedcom_val xref, char *tag, int parsed_tag)
    - {
    -   printf("The header starts\n");
    -   return (Gedcom_ctxt)1;
    - }
    -
    - void my_header_end_cb (Gedcom_ctxt self)
    - {
    -   printf("The header ends, context is %d\n", self);   /* context - will print as "1" */
    - }
    -
    - ...
    - gedcom_subscribe_to_record(REC_HEAD, my_header_start_cb, - my_header_end_cb);
    - ...
    - result = gedcom_parse_file("myfamily.ged");

    -
    - Using the gedcom_subscribe_to_record function, the application - requests to use the specified callbacks as start and end callback. The end - callback is optional: you can pass NULL if you are not interested - in the end callback.  The identifiers to use as first argument to the - function (here REC_HEAD) are described in the - interface details.
    -
    - From the name of the function it becomes clear that this function is specific - to complete records.  For the separate elements in records there is -another function, which we'll see shortly.  Again, the callbacks need -to have the signatures as shown in the example.
    -
    - The Gedcom_ctxt type that is used as a result of the start - callback and as an argument to the end callback is vital for passing context - necessary for the application.  This type is meant to be opaque; in -fact, it's a void pointer, so you can pass anything via it.  The important - thing to know is that the context that the application returns in the start - callback will be passed in the end callback as an argument, and as we will - see shortly, also to all the directly subordinate elements of the record.
    -
    -The tag is the GEDCOM tag in string format, the parsed_tag - is an integer, for which symbolic values are defined as TAG_HEAD, - TAG_SOUR, TAG_DATA, ... and USERTAG -for the application-specific tags.  These values are defined in the -header gedcom-tags.h that is installed, and included via -gedcom.h (so no need to include gedcom-tags.h yourself).
    -
    - The example passes a simple integer as context, but an application could - e.g. pass a struct that will contain the information for the - header.  In the end callback, the application could then e.g. do some - finalizing operations on the struct to put it in its database.
    -
    - (Note that the Gedcom_val type for the xref -argument was not discussed, see further for this)
    -
    - -

    Callbacks for elements

    - We will now retrieve the SOUR field (the name of the program that wrote - the file) from the header:
    - -
    Gedcom_ctxt my_header_source_start_cb(Gedcom_ctxt - parent,
    -                     -                  int   -      level,
    -                     -                  char*   -    tag,
    -                     -                  char*   -    raw_value,
    -                      -                int     -    parsed_tag,
    -                     -                  Gedcom_val - parsed_value)
    - {
    -   char *source = GEDCOM_STRING(parsed_value);
    -   printf("This file was written by %s\n", source);
    -   return parent;
    - }
    -
    - void my_header_source_end_cb(Gedcom_ctxt parent,
    -                     -         Gedcom_ctxt self,
    -                     -         Gedcom_val  parsed_value)
    - {
    -   printf("End of the source description\n");
    - }
    -
    - ...
    - gedcom_subscribe_to_element(ELT_HEAD_SOUR,
    -                     -        my_header_source_start_cb,
    -                     -        my_header_source_end_cb);
    - ...
    - result = gedcom_parse_file("myfamily.ged");

    -
    - The subscription mechanism for elements is similar, only the signatures - of the callbacks differ.  The signature for the start callback shows - that the context of the parent line (e.g. the struct that describes - the header) is passed to this start callback.  The callback itself returns - here the same context, but this can be its own context object of course. - The end callback is called with both the context of the parent and -the context of itself, which will be the same in the example.  Again, -the list of identifiers to use as a first argument for the subscription function -are detailed in the interface -details .
    -
    - If we look at the other arguments of the start callback, we see the level - number (the initial number of the line in the GEDCOM file), the tag (e.g. - "SOUR"), and then a raw value, a parsed tag and a parsed value.  The -raw value is just the raw string that occurs as value on the line next to -the tag (in UTF-8 encoding).  The parsed value is the meaningful value -that is parsed from that raw string.  The parsed tag is described in -the section for record callbacks.
    -
    - The Gedcom_val type is meant to be an opaque type.  The - only thing that needs to be known about it is that it can contain specific - data types, which have to be retrieved from it using pre-defined macros. - These data types are described in the - interface details.
    -
    - Some extra notes:
    - - - -

    Default callbacks
    -

    - As described above, an application doesn't always implement the entire -GEDCOM spec, and application-specific tags may have been added by other applications. -  To preserve this extra data anyway, a default callback can be registered - by the application, as in the following example:
    - -
    void my_default_cb (Gedcom_ctxt parent, - int level, char* tag, char* raw_value, int parsed_tag)
    - {
    -   ...
    - }
    -
    + The very simplest call of the gedcom callback parser is simply the following + piece of code (include of the gedcom.h header is assumed, as everywhere +in this manual):
    + +
    int result;
    ...
    - gedcom_set_default_callback(my_default_cb);
    + gedcom_init();
    + ...
    + result = gedcom_parse_file("myfamily.ged");
    +
    + Although this will not provide much information, one thing it does + is parse the entire file and return the result.  The function returns + 0 on success and 1 on failure.  No other information is available +using this function only.
    +
    +Alternatively, programs using the C object model should use the following (in this case, the inclusion of both gedcom.h and gom.h is required):
    + +
    int result;
    ...
    - result = gedcom_parse_file("myfamily.ged");

    + gedcom_init();
    + ...
    + result = gom_parse_file("myfamily.ged");
    +
    +The call to gom_parse_file will build the C object model, which is then a complete representation of the GEDCOM file.
    +
    +No matter which of the interfaces you use, the call to gedcom_init() should be one of the first calls +in your program.  The requirement is that it should come before the first +call to iconv_open (part of the generic character set conversion +feature) in the program, either by your program itself, or indirectly by +the library calls it makes.  Practically, it should e.g. come before + any calls to any GTK functions, because GTK uses iconv_open + in its initialization.

    +For the same reason it is also advised to put +the -lgedcom option +on the linking of the program as the last option, so that its initialization +code is run first.  In the case of using the C object model, the linking +options should be: -lgedcom_gom -lgedcom
    +
    The function gedcom_init() also initializes locale handling by calling setlocale(LC_ALL, ""), in case the application would not do this (it doesn't hurt for the application to do the same).

    +The next sections will refine this piece of code to be able to have + meaningful errors and the actual data that is in the file.
    + +
    +

    Error handling

    The library can be used in several different circumstances, both +terminal-based as GUI-based.  Therefore, it leaves the actual display +of the error message up to the application.  For this, the application +needs to register a callback before parsing the GEDCOM file, which will +be called by the library on errors, warnings and messages.
    +
    + A typical piece of code would be (gom_parse_file would be called in case the C object model is used):
    + +
    void my_message_handler (Gedcom_msg_type type, + char *msg)
    + {
    +   ...
    + }
    + ...
    + gedcom_set_message_handler(my_message_handler);
    + ...
    + result = gedcom_parse_file("myfamily.ged");

    +
    + In the above piece of code, my_message_handler is the + callback that will be called for errors (type=ERROR), warnings + (type=WARNING) and messages (type=MESSAGE).  The + callback must have the signature as in the example.  For errors, +the msg passed to the callback will have the format:
    + +
    Error on line <lineno>: <actual_message>
    - This callback has a similar signature as the previous ones, -but it doesn't contain a parsed value.  However, it does contain the -parent context, that was returned by the application for the most specific -containing tag that the application supported.
    + Note that the entire string will be properly internationalized, and + encoded in UTF-8 (Why UTF-8?).  Also, +no newline is appended, so that the application program can use it in any +way it wants.  Warnings are similar, but use "Warning" instead of "Error". + Messages are plain text, without any prefix.

    - Suppose e.g. that this callback is called for some tags in the header that - are specific to some other application, then our application could make -sure that the parent context contains the struct or object that represents -the header, and use the default callback here to add the level, tag and -raw_value as plain text in a member of that struct or object, thus preserving -the information.  The application can then write this out when the -data is saved again in a GEDCOM file.  To make it more specific, consider -the following example:
    - -
    struct header {
    -   char* source;
    -   ...
    -   char* extra_text;
    - };
    -
    - Gedcom_ctxt my_header_start_cb(int level, Gedcom_val xref, char* tag, int -parsed_tag)
    - {
    -   struct header head = my_make_header_struct();
    -   return (Gedcom_ctxt)head;
    - }
    + With this in place, the resulting code will already show errors and + warnings produced by the parser, e.g. on the terminal if a simple + printf is used in the message handler.
    + +
    +

    Data callback mechanism

    + The most important use of the parser is of course to get the data +out of the GEDCOM file.  This section focuses on the callback mechanism (see here for the C object model).  In fact, the mechanism involves two levels.
    +
    + The primary level is that each of the sections in a GEDCOM file is + notified to the application code via a "start element" callback and an + "end element" callback (much like in a SAX interface for XML), i.e. when + a line containing a certain tag is parsed, the "start element" callback + is called for that tag, and when all its subordinate lines with their +tags have been processed, the "end element" callback is called for the +original tag.  Since GEDCOM is hierarchical, this results in properly +nested calls to appropriate "start element" and "end element" callbacks.
    +
    + However, it would be typical for a genealogy program to support only + a subset of the GEDCOM standard, certainly a program that is still under + development.  Moreover, under GEDCOM it is allowed for an application + to define its own tags, which will typically not  be supported by +another application.  Still, in that case, data preservation is important; + it would hardly be accepted that information that is not understood by + a certain program is just removed.
    +
    + Therefore, the second level of callbacks involves a "default callback". +  An application needs to subscribe to callbacks for tags it does support, + and need to provide a "default callback" which will be called for tags +it doesn't support.  The application can then choose to just store +the information that comes via the default callback in plain textual format.
    +
    + After this introduction, let's see what the API looks like...
    +
    + +

    Start and end callbacks

    + +

    Callbacks for records
    +

    + As a simple example, we will get some information from the header +of a GEDCOM file.  First, have a look at the following piece of code:
    + +
    Gedcom_ctxt my_header_start_cb (Gedcom_rec rec,
    +                                int level,
    +                     +            Gedcom_val xref,
    +                     +            char *tag,
    +                     +            char *raw_value,
    +                     +            int parsed_tag,
    +                     +            Gedcom_val parsed_value)
    + {
    +   printf("The header starts\n");
    +   return (Gedcom_ctxt)1;
    + }

    - void my_default_cb(Gedcom_ctxt parent, int level, char* tag, char* raw_value, -int parsed_tag)
    - {
    -   struct header head = (struct header)parent;
    -   my_header_add_to_extra_text(head, level, tag, raw_value);
    - }
    + void my_header_end_cb (Gedcom_rec rec, Gedcom_ctxt self)
    + {
    +   printf("The header ends, context is %d\n", (int)self);   + /* context will print as "1" */
    + }

    - gedcom_set_default_callback(my_default_cb);
    - gedcom_subscribe_to_record(REC_HEAD, my_header_start, NULL);
    - ...
    - result = gedcom_parse_file(filename);

    + ...
    + gedcom_subscribe_to_record(REC_HEAD, my_header_start_cb, + my_header_end_cb);
    + ...
    + result = gedcom_parse_file("myfamily.ged");

    - Note that the default callback will be called for any tag that isn't specifically - subscribed upon by the application, and can thus be called in various contexts. -  For simplicity, the example above doesn't take this into account (the - parent could be of different types, depending -on the context).
    - -
    -

    Other API functions
    -

    - Although the above describes the basic interface of libgedcom, there are - some other functions that allow to customize the behaviour of the library. -  These will be explained in the current section.
    - -

    Debugging

    - The library can generate various debugging output, not only from itself, - but also the debugging output generated by the yacc parser.  By default, - no debugging output is generated, but this can be customized using the following - function:
    - -
    void gedcom_set_debug_level (int level, - FILE* trace_output)
    + Using the gedcom_subscribe_to_record function, the + application requests to use the specified callbacks as start and end +callback. The end callback is optional: you can pass NULL + if you are not interested in the end callback.  The identifiers +to use as first argument to the function (here REC_HEAD) +are described in the interface +details .  These are also passed as first argument in the callbacks (the Gedcom_rec argument).
    +
    + From the name of the function it becomes clear that this function +is specific to complete records.  For the separate elements in records + there is another function, which we'll see shortly.  Again, the callbacks + need to have the signatures as shown in the example.
    +
    + The Gedcom_ctxt type that is used as a result of the +start callback and as an argument to the end callback is vital for passing +context necessary for the application.  This type is meant to be opaque; +in fact, it's a void pointer, so you can pass anything via it.  The +important thing to know is that the context that the application returns +in the start callback will be passed in the end callback as an argument, +and as we will see shortly, also to all the directly subordinate elements +of the record.
    +
    + The tag is the GEDCOM tag in string format, the parsed_tag + is an integer, for which symbolic values are defined as TAG_HEAD, + TAG_SOUR, TAG_DATA, ... and USERTAG + for the application-specific tags.  These values +are defined in the header gedcom-tags.h that is installed, +and included via gedcom.h (so no need to include gedcom-tags.h + yourself).
    +
    + The example passes a simple integer as context, but an application + could e.g. pass a struct (or an object in a C++ application) + that will contain the information for the header.  In the end callback, + the application could then e.g. do some finalizing operations on the + struct to put it in its database.
    +
    + (Note that the Gedcom_val type for the xref + and parsed_value arguments was not discussed, see further + for this)
    +
    + +

    Callbacks for elements

    + We will now retrieve the SOUR field (the name of the program that +wrote the file) from the header:
    + +
    Gedcom_ctxt my_header_source_start_cb(Gedcom_elt  elt,
    +                      +                Gedcom_ctxt + parent,
    +                     +                   int   +       level,
    +                     +                   char*   +     tag,
    +                     +                   char*   +     raw_value,
    +                     +                   int   +       parsed_tag,
    +                     +                   Gedcom_val +  parsed_value)
    + {
    +   char *source = GEDCOM_STRING(parsed_value);
    +   printf("This file was written by %s\n", source);
    +   return parent;
    + }
    +
    + void my_header_source_end_cb(Gedcom_elt  elt,
    +                             Gedcom_ctxt parent,
    +                     +          Gedcom_ctxt self,
    +                     +          Gedcom_val  parsed_value)
    + {
    +   printf("End of the source description\n");
    + }
    +
    + ...
    + gedcom_subscribe_to_element(ELT_HEAD_SOUR,
    +                     +         my_header_source_start_cb,
    +                     +         my_header_source_end_cb);
    + ...
    + result = gedcom_parse_file("myfamily.ged");

    - The level can be one of the following values:
    - -
      -
    • 0:  no debugging information (this is the default)
    • -
    • 1:  only debugging information from libgedcom - itself
    • -
    • 2:  debugging information from libgedcom and -yacc
    • - -
    - If the trace_output is NULL, debugging information - will be written to stderr, otherwise the given file handle -is used (which must be open).
    + The subscription mechanism for elements is similar, only the signatures + of the callbacks differ.  The signature for the start callback shows + that the context of the parent line (here e.g. the struct + that describes the header) is passed to this start callback.  The + callback itself returns here in this example the same context, but this +can be its own context object of course.  The end callback is called +with both the context of the parent and the context of itself, which in this +example will be the same.  Again, the list of identifiers to use as +a first argument for the subscription function are detailed in the interface details .  Again, these are passed as first argument in the callback (the Gedcom_elt argument).
    +
    + If we look at the other arguments of the start callback, we see the + level number (the initial number of the line in the GEDCOM file), the tag + (e.g. "SOUR"), and then a raw value, a parsed tag and a parsed value.  The + raw value is just the raw string that occurs as value on the line next +to the tag (in UTF-8 encoding).  The parsed value is the meaningful +value that is parsed from that raw string.  The parsed tag is described +in the section for record callbacks above.

    - -

    Error treatment

    - One of the previous sections already described the callback to be registered - to get error messages.  The library also allows to customize what happens - on an error, using the following function:
    - -
    void gedcom_set_error_handling (Gedcom_err_mech + The Gedcom_val type is meant to be an opaque type.  The + only thing that needs to be known about it is that it can contain specific + data types, which have to be retrieved from it using pre-defined macros. +  These data types are described in the interface details. +
    +
    + Some extra notes:
    + + +
      +
    • The Gedcom_val argument of the end callback + is currently not used.  It is there for future enhancements.
    • +
    • There are also two Gedcom_val arguments + in the start callback for records.  The first one (xref + ) contains the xref_value corresponding to the cross-reference + (or NULL if there isn't one), the second one (parsed_value + ) contains the value that is parsed from the raw_value.  See + the interface details + .
    • + + +
    + + +

    Default callbacks
    +

    + As described above, an application doesn't always implement the entire + GEDCOM spec, and application-specific tags may have been added by other + applications.  To preserve this extra data anyway, a default callback + can be registered by the application, as in the following example:
    + +
    void my_default_cb (Gedcom_elt elt, Gedcom_ctxt parent, int level, + char* tag, char* raw_value, int parsed_tag)
    + {
    +   ...
    + }
    +
    + ...
    + gedcom_set_default_callback(my_default_cb);
    + ...
    + result = gedcom_parse_file("myfamily.ged");

    +
    + This callback has a similar signature as the previous ones, + but it doesn't contain a parsed value.  However, it does contain the + parent context, that was returned by the application for the most specific + containing tag that the application supported.
    +
    + Suppose e.g. that this callback is called for some tags in the header + that are specific to some other application, then our application could + make sure that the parent context contains the struct or object that represents + the header, and use the default callback here to add the level, tag and + raw_value as plain text in a member of that struct or object, thus preserving + the information.  The application can then write this out when the +data is saved again in a GEDCOM file.  To make it more specific, consider + the following example:
    + +
    struct header {
    +   char* source;
    +   ...
    +   char* extra_text;
    + };
    +
    + Gedcom_ctxt my_header_start_cb(Gedcom_rec rec, int level, Gedcom_val xref, char* tag, + char *raw_value,
    +                     +           int parsed_tag, Gedcom_val parsed_value)
    + {
    +   struct header head = my_make_header_struct();
    +   return (Gedcom_ctxt)head;
    + }
    +
    + void my_default_cb(Gedcom_elt elt, Gedcom_ctxt parent, int level, char* tag, char* +raw_value, int parsed_tag)
    + {
    +   struct header head = (struct header)parent;
    +   my_header_add_to_extra_text(head, level, tag, raw_value);
    + }
    +
    + gedcom_set_default_callback(my_default_cb);
    + gedcom_subscribe_to_record(REC_HEAD, my_header_start, NULL);
    + ...
    + result = gedcom_parse_file(filename);

    +
    + Note that the default callback will be called for any tag that isn't + specifically subscribed upon by the application, and can thus be called + in various contexts.  For simplicity, the example above doesn't take + this into account (the parent could be of different + types, depending on the context).
    +
    + Note also that the default callback is not called when the parent context + is NULL.  This is e.g. the case if none + of the "upper" tags has been subscribed upon.
    + + +
    + +

    Other API functions
    +

    + + Although the above describes the basic interface of the gedcom parser, there + are some other functions that allow to customize the behaviour of the library. +  These will be explained in the current section.
    + + +

    Debugging

    + The library can generate various debugging output, not only from itself, + but also the debugging output generated by the yacc parser.  By default, + no debugging output is generated, but this can be customized using the +following function:
    + + +
    void gedcom_set_debug_level (int level, FILE* +trace_output)
    +
    + The level can be one of the following values:
    + + +
      +
    • 0:  no debugging information (this is the + default)
    • +
    • 1:  only debugging information from libgedcom + itself
    • +
    • 2:  debugging information from libgedcom + and yacc
    • + + +
    + If the trace_output is NULL, debugging information + will be written to stderr, otherwise the given file handle + is used (which must be open).
    +
    + + +

    Error treatment

    + One of the previous sections already described the callback to be +registered to get error messages.  The library also allows to customize +what happens on an error, using the following function:
    + + +
    void gedcom_set_error_handling (Gedcom_err_mech mechanism)
    -
    - The mechanism can be one of:
    - -
      -
    • IMMED_FAIL: immediately fail the parsing - on an error (this is the default)
    • -
    • DEFER_FAIL: continue parsing after -an error, but return a failure code eventually
    • -
    • IGNORE_ERRORS: continue parsing after - an error, return success always
    • - -
    - This doesn't influence the generation of error or warning messages, only - the behaviour of the parser and its return code.
    -
    - -

    Compatibility mode
    -

    - Applications are not necessarily true to the GEDCOM spec (or use a different - version than 5.5).  The intention is that the library is resilient -to this, and goes in compatibility mode for files written by specific programs - (detected via the HEAD.SOUR tag).  This compatibility mode can be enabled - and disabled via the following function:
    - -
    void gedcom_set_compat_handling - (int enable_compat)
    -
    - The argument can be:
    +
    + The mechanism can be one of:
    + + +
      +
    • IMMED_FAIL: immediately fail +the parsing on an error (this is the default)
    • +
    • DEFER_FAIL: continue parsing +after an error, but return a failure code eventually
    • +
    • IGNORE_ERRORS: continue parsing + after an error, return success always
    • + + +
    + This doesn't influence the generation of error or warning messages, + only the behaviour of the parser and its return code.
    +
    + + +

    Compatibility mode
    +

    + Applications are not necessarily true to the GEDCOM spec (or use a +different version than 5.5).  The intention is that the library is +resilient to this, and goes in compatibility mode for files written by specific +programs (detected via the HEAD.SOUR tag).  This compatibility mode +can be enabled and disabled via the following function:
    + + +
    void gedcom_set_compat_handling (int enable_compat)
    +
    + The argument can be:
    + + +
      +
    • 0: disable compatibility mode
    • +
    • 1: allow compatibility mode (this is the +default)
      +
    • + + +
    + Currently, there is a beginning for compatibility for ftree and Lifelines (3.0.2).
    + +
    +

    Converting character sets

    + All strings passed by the GEDCOM parser to the application are in UTF-8 + encoding.  Typically, an application needs to convert this to something + else to be able to display it.
    +
    + The most common case is that the output character set is controlled by +the locale mechanism (i.e. via the LANG, + LC_ALL or LC_CTYPE environment variables), which also +controls the gettext mechanism in the application.  
    +
    +
    + + The source distribution of +gedcom-parse contains an a library implementing help functions for UTF-8 encoding (see +the "utf8" subdirectory of the top directory).   Feel free to use + it in your source code.  It isn't installed anywhere, so you need +to take over the source and header files in your application. Note that on +some systems it uses libcharset, which is also included in this subdirectory. +  
    +
    + Its interface contains first of all the following two help functions:
    + +
    +
    int   is_utf8_string (char *input);
    int utf8_strlen (char *input);
    The +first one returns 1 if the given input is a valid UTF-8 string, it returns +0 otherwise, the second gives the number of UTF-8 characters in the given +input.  Note that the second function assumes that the input is valid +UTF-8, and gives unpredictable results if it isn't.
    +
    +For conversion, the following functions are available:
    +
    +
    char *convert_utf8_to_locale (char *input, int *conv_failures);
    char *convert_locale_to_utf8 (char *input);
    +
    +
    +
    + + Both functions return a pointer to a static buffer that is overwritten + on each call.  To function properly, the application must first set +the locale using the setlocale function (the second step detailed + below).  All other steps given below, including setting up and closing + down the conversion handles, are transparantly handled by the two functions. +  
    +
    + If you pass a pointer to an integer to the first function, it will be +set to the number of conversion failures, i.e. characters that couldn't +be converted; you can also just pass NULL if you are not interested +(note that usually, the interesting information is just whether there +were conversion failures or not, which is then given by the integer +being bigger than zero or not).  The second function doesn't need this, +because any locale can be converted to UTF-8.
    +
    + You can change the "?" that is output for characters that can't be converted + to any string you want, using the following function before the conversion + calls:
    + +
    +
    void convert_set_unknown (const char *unknown);
    +
    +
    + If you want to have your own functions for it instead of this example +implementation, the following steps need to be taken by the application +(more detailed info can be found in the info file of the GNU libc library +in the "Generic Charset Conversion" section under "Character Set Handling" +or online + here):
    + +
      +
    • inclusion of some headers:
    • + +
    + +
    +
    +
    #include <locale.h>    /* for setlocale */
    #include <langinfo.h> /* for nl_langinfo */
    #include <iconv.h> /* for iconv_* functions */
    +
    +
    + +
      +
    • set the program's current locale to what +the user configured in the environment:
    • + +
    + +
    +
    +
    setlocale(LC_ALL, "");
    +
    +
    + +
      +
    • open a conversion handle for conversion + from UTF-8 to the character set of the current locale (once for the entire + program):
    • + +
    + +
    +
    +
    iconv_t iconv_handle;
    ...
    iconv_handle = iconv_open(nl_langinfo(CODESET), "UTF-8");

    if (iconv_handle == (iconv_t) -1)
    /* signal an error */
    +
    +
    + +
      +
    • then, every string can be converted + using the following:
    • + +
    + +
    +
    +
    /* char* in_buf is the input buffer,    size_t in_len is its length */
    /* char* out_buf is the output buffer, size_t out_len is its length */

    size_t nconv;
    char *in_ptr = in_buf;
    char *out_ptr = out_buf;
    nconv = iconv(iconv_handle, &in_ptr, &in_len, &out_ptr, &out_len);
    +
    +
    + +
    If the output buffer is not big enough, iconv will + return -1 and set errno to E2BIG.  Also, +the in_ptr and out_ptr will point just after +the last successfully converted character in the respective buffers, and +the in_len and out_len will be updated to show +the remaining lengths.  There can be two strategies here:
    + +
      +
    • Make sure from the beginning + that the output buffer is big enough.  However, it's difficult to find + an absolute maximum length in advance, even given the length of the input + string.
      +
      +
    • +
    • Do the conversion in several + steps, growing the output buffer each time to make more space, and calling + iconv consecutively until the conversion is complete. +  This is the preferred way (a function could be written to encapsulate + all this).
    • + +
    + Another error case is when the conversion was unsuccessful (if one of +the characters can't be represented in the target character set).  The + iconv function will then also return -1 and set errno + to EILSEQ; the in_ptr will point to the character + that couldn't be converted.  In that case, again two strategies are +possible:
    + +
      +
    • Just fail the conversion, +and show an error.  This is not very user friendly, of course.
      +
      +
    • +
    • Skip over the character that + can't be converted and append a "?" to the output buffer, then call + iconv again.  Skipping over a UTF-8 character is fairly simple, + as follows from the encoding rules + :
    • + +
    + +
      + +
        +
      1. if the first byte is in +binary 0xxxxxxx, then the character is only one byte long, just skip over +that byte
        +
        +
      2. +
      3. if the first byte is in +binary 11xxxxxx, then skip over that byte and all bytes 10xxxxxx that follow.
        +
      4. + +
      + +
    +
    -
      -
    • 0: disable compatibility mode
    • -
    • 1: allow compatibility mode (this is the default)
      -
    • +
        +
      • eventually, the conversion +handle needs to be closed (when the program exits):
        +
      • -
      - Note that, currently, no actual compatibility code is present, but this -is on the to-do list.
      +
    -
    -
    $Id$
    $Name$
    - -
                        
    - - - +
    +
    +
    iconv_close(iconv_handle);
    +
    +
    + The example implementation +mentioned above grows the output buffer dynamically and outputs "?" for characters + that can't be converted.
    + + +
    + +

    Support for configure.in

    There +is a macro available for use in configure.in for applications that are using +autoconf to configure their sources.  The following macro checks whether +the Gedcom parser library is available and whether its version is high enough:
    +
    AM_LIB_GEDCOM_PARSER([major,[minor,[patch]]])
    +
    +All the arguments are optional and default to 0.  E.g. to check for +version 1.34, you would put in configure.in the following statement:
    +
    AM_LIB_GEDCOM_PARSER(1,34)
    +
    +To be able to use this macro in the sources of your application, you have three options:
    +
      +
    • Put the file m4/gedcom.m4 in your autoconf data directory (i.e. the path given by 'aclocal --print-ac-dir', usually /usr/share/aclocal).  You can do this automatically by going into the m4 subdirectory and typing 'make install-m4'.
      +
      +
    • +
    • If you're using autoconf, but not automake, copy the contents of m4/gedcom.m4 in the aclocal.m4 file in your sources.
      +
      +
    • +
    • If you're using automake, copy the contents of m4/gedcom.m4 in the acinclude.m4 file in your sources.
      +
    • +
    +
    +There are three preprocessor symbols defined for version checks in the + header (but their direct use is deprecated: please use the macro above):
    + +
      +
    • GEDCOM_PARSE_VERSION_MAJOR
    • +
    • GEDCOM_PARSE_VERSION_MINOR
    • +
    • GEDCOM_PARSE_VERSION
      +
    • + +
    + The last one is equal to (GEDCOM_PARSE_VERSION_MAJOR * 1000) + GEDCOM_PARSE_VERSION_MINOR.
    + + +
    + +
    $Id$
    $Name$

    + + +
                        
    + + +
    +
    +
    +
    +
    +
    + \ No newline at end of file