1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
4 <title>Using the GEDCOM parser library</title>
6 <meta http-equiv="content-type"
7 content="text/html; charset=ISO-8859-1">
9 <body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099"
12 <h1 align="center">Using the GEDCOM parser library</h1>
18 <li><a href="#anchor">Overview</a></li>
19 <li><a href="#Error_handling">Error handling</a></li>
20 <li><a href="#Data_callback_mechanism">Data callback mechanism</a></li>
23 <li><a href="#Start_and_end_callbacks">Start and end callbacks</a></li>
24 <li><a href="#Default_callbacks">Default callbacks</a></li>
27 <li><a href="#Other_API_functions">Other API functions</a></li>
30 <li><a href="#Debugging">Debugging</a></li>
31 <li><a href="#Error_treatment">Error treatment</a></li>
32 <li><a href="#Compatibility_mode">Compatibility mode</a></li>
35 <li><a href="#Converting_character_sets">Converting character sets</a></li>
36 <li><a href="#Support_for_configure.in">Support for configure.in</a><br>
38 <li><a href="interface.html">Interface details</a><br>
43 <hr width="100%" size="2">
44 <h2><a name="Overview"></a>Overview<br>
46 The GEDCOM parser library is built as a callback-based parser (comparable
47 to the SAX interface of XML). It comes with:<br>
50 <li>a library (<code>libgedcom.so</code>), to be linked in the
51 application program</li>
52 <li>a header file (<code>gedcom.h</code>), to be used in the sources
53 of the application program</li>
54 <li>a header file (<code>gedcom-tags.h</code>) that is also installed,
55 but that is automatically included via <code>gedcom.h</code><br>
59 Next to these, there is also a data directory in <code>$PREFIX/share/gedcom-parse</code>
60 that contains some additional stuff, but which is not immediately
61 important at first. I'll leave the description of the data directory
64 The very simplest call of the gedcom parser is simply the following
65 piece of code (include of the gedcom header is assumed, as everywhere
68 <blockquote><code>int result;<br>
70 <b>gedcom_init</b>();<br>
72 result = <b>gedcom_parse_file</b>("myfamily.ged");<br>
74 Although this will not provide much information, one thing it does
75 is parse the entire file and return the result. The function returns
76 0 on success and 1 on failure. No other information is available
77 using this function only.<br>
79 The call to <code>gedcom_init</code>() should be one of the first calls
80 in your program. The requirement is that it should come before the first
81 call to <code>iconv_open</code> (part of the generic character set conversion
82 feature) in the program, either by your program itself, or indirectly by
83 the library calls it makes. Practically, it should e.g. come before
84 any calls to any GTK functions, because GTK uses <code>iconv_open</code>
85 in its initialization. For the same reason it is also advised to put
86 the <code>-lgedcom</code> option on the linking of the program as the last
87 option, so that it's initialization code is run first.<br>
89 The next sections will refine this piece of code to be able to have
90 meaningful errors and the actual data that is in the file.<br>
92 <hr width="100%" size="2">
93 <h2><a name="Error_handling"></a>Error handling</h2>
94 Since this is a relatively simple topic, it is discussed before the
95 actual callback mechanism, although it also uses a callback...<br>
97 The library can be used in several different circumstances, both
98 terminal-based as GUI-based. Therefore, it leaves the actual display
99 of the error message up to the application. For this, the application
100 needs to register a callback before parsing the GEDCOM file, which will
101 be called by the library on errors, warnings and messages.<br>
103 A typical piece of code would be:<br>
105 <blockquote><code>void <b>my_message_handler</b> (Gedcom_msg_type type,
111 <b>gedcom_set_message_handler</b>(my_message_handler);<br>
113 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
115 In the above piece of code, <code>my_message_handler</code> is the
116 callback that will be called for errors (<code>type=ERROR</code>), warnings
117 (<code>type=WARNING</code>) and messages (<code>type=MESSAGE</code>). The
118 callback must have the signature as in the example. For errors,
119 the <code> msg</code> passed to the callback will have the format:<br>
121 <blockquote><code>Error on line</code> <i><lineno></i>: <i><actual_message></i><br>
123 Note that the entire string will be properly internationalized, and
124 encoded in UTF-8 (<a href="encoding.html">Why UTF-8?</a>). Also,
125 no newline is appended, so that the application program can use it in any
126 way it wants. Warnings are similar, but use "Warning" instead of "Error".
127 Messages are plain text, without any prefix.<br>
129 With this in place, the resulting code will already show errors and
130 warnings produced by the parser, e.g. on the terminal if a simple <code>
131 printf</code> is used in the message handler.<br>
133 <hr width="100%" size="2">
134 <h2><a name="Data_callback_mechanism"></a>Data callback mechanism</h2>
135 The most important use of the parser is of course to get the data
136 out of the GEDCOM file. As already mentioned, the parser uses a callback
137 mechanism for that. In fact, the mechanism involves two levels.<br>
139 The primary level is that each of the sections in a GEDCOM file is
140 notified to the application code via a "start element" callback and an
141 "end element" callback (much like in a SAX interface for XML), i.e. when
142 a line containing a certain tag is parsed, the "start element" callback
143 is called for that tag, and when all its subordinate lines with their
144 tags have been processed, the "end element" callback is called for the
145 original tag. Since GEDCOM is hierarchical, this results in properly
146 nested calls to appropriate "start element" and "end element" callbacks.<br>
148 However, it would be typical for a genealogy program to support only
149 a subset of the GEDCOM standard, certainly a program that is still under
150 development. Moreover, under GEDCOM it is allowed for an application
151 to define its own tags, which will typically not be supported by
152 another application. Still, in that case, data preservation is important;
153 it would hardly be accepted that information that is not understood by
154 a certain program is just removed.<br>
156 Therefore, the second level of callbacks involves a "default callback".
157 An application needs to subscribe to callbacks for tags it does support,
158 and need to provide a "default callback" which will be called for tags
159 it doesn't support. The application can then choose to just store
160 the information that comes via the default callback in plain textual format.<br>
162 After this introduction, let's see what the API looks like...<br>
165 <h3><a name="Start_and_end_callbacks"></a>Start and end callbacks</h3>
167 <h4><i>Callbacks for records</i> <br>
169 As a simple example, we will get some information from the header
170 of a GEDCOM file. First, have a look at the following piece of code:<br>
172 <blockquote><code>Gedcom_ctxt <b>my_header_start_cb</b> (int level, <br>
173
174 Gedcom_val xref, <br>
175
176 char *tag, <br>
177
178 char *raw_value,<br>
179
180 int parsed_tag, <br>
181
182 Gedcom_val parsed_value)<br>
184 printf("The header starts\n");<br>
185 return (Gedcom_ctxt)1;<br>
188 void <b>my_header_end_cb</b> (Gedcom_ctxt self)<br>
190 printf("The header ends, context is %d\n", (int)self);
191 /* context will print as "1" */<br>
195 <b>gedcom_subscribe_to_record</b>(REC_HEAD, my_header_start_cb,
196 my_header_end_cb);<br>
198 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
200 Using the <code>gedcom_subscribe_to_record</code> function, the
201 application requests to use the specified callbacks as start and end
202 callback. The end callback is optional: you can pass <code>NULL</code>
203 if you are not interested in the end callback. The identifiers
204 to use as first argument to the function (here <code>REC_HEAD</code>)
205 are described in the <a href="interface.html#Record_identifiers"> interface
208 From the name of the function it becomes clear that this function
209 is specific to complete records. For the separate elements in records
210 there is another function, which we'll see shortly. Again, the callbacks
211 need to have the signatures as shown in the example.<br>
213 The <code>Gedcom_ctxt</code> type that is used as a result of the
214 start callback and as an argument to the end callback is vital for passing
215 context necessary for the application. This type is meant to be opaque;
216 in fact, it's a void pointer, so you can pass anything via it. The
217 important thing to know is that the context that the application returns
218 in the start callback will be passed in the end callback as an argument,
219 and as we will see shortly, also to all the directly subordinate elements
222 The <code>tag</code> is the GEDCOM tag in string format, the <code>parsed_tag</code>
223 is an integer, for which symbolic values are defined as <code>TAG_HEAD,</code>
224 <code>TAG_SOUR,</code> <code>TAG_DATA,</code> ... and <code>USERTAG
225 </code><code></code> for the application-specific tags. These values
226 are defined in the header <code>gedcom-tags.h</code> that is installed,
227 and included via <code> gedcom.h</code> (so no need to include <code>gedcom-tags.h</code>
230 The example passes a simple integer as context, but an application
231 could e.g. pass a <code>struct</code> (or an object in a C++ application)
232 that will contain the information for the header. In the end callback,
233 the application could then e.g. do some finalizing operations on the
234 <code> struct</code> to put it in its database.<br>
236 (Note that the <code>Gedcom_val</code> type for the <code>xref</code>
237 and <code>parsed_value</code> arguments was not discussed, see further
241 <h4><i>Callbacks for elements</i></h4>
242 We will now retrieve the SOUR field (the name of the program that
243 wrote the file) from the header:<br>
245 <blockquote><code>Gedcom_ctxt <b>my_header_source_start_cb</b>(Gedcom_ctxt
247
248 int
249 level,<br>
250
251 char*
252 tag,<br>
253
254 char*
255 raw_value,<br>
256
257 int
258 parsed_tag,<br>
259
260 Gedcom_val
261 parsed_value)<br>
263 char *source = GEDCOM_STRING(parsed_value);<br>
264 printf("This file was written by %s\n", source);<br>
265 return parent;<br>
268 void <b>my_header_source_end_cb</b>(Gedcom_ctxt parent,<br>
269
270 Gedcom_ctxt self,<br>
271
272 Gedcom_val parsed_value)<br>
274 printf("End of the source description\n");<br>
278 <b>gedcom_subscribe_to_element</b>(ELT_HEAD_SOUR,<br>
279
280 my_header_source_start_cb,<br>
281
282 my_header_source_end_cb);<br>
284 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
286 The subscription mechanism for elements is similar, only the signatures
287 of the callbacks differ. The signature for the start callback shows
288 that the context of the parent line (here e.g. the <code>struct</code>
289 that describes the header) is passed to this start callback. The
290 callback itself returns here in this example the same context, but this
291 can be its own context object of course. The end callback is called
292 with both the context of the parent and the context of itself, which in this
293 example will be the same. Again, the list of identifiers to use as
294 a first argument for the subscription function are detailed in the <a
295 href="interface.html#Element_identifiers"> interface details</a> .<br>
297 If we look at the other arguments of the start callback, we see the
298 level number (the initial number of the line in the GEDCOM file), the tag
299 (e.g. "SOUR"), and then a raw value, a parsed tag and a parsed value. The
300 raw value is just the raw string that occurs as value on the line next
301 to the tag (in UTF-8 encoding). The parsed value is the meaningful
302 value that is parsed from that raw string. The parsed tag is described
303 in the section for record callbacks above.<br>
305 The <code>Gedcom_val</code> type is meant to be an opaque type. The
306 only thing that needs to be known about it is that it can contain specific
307 data types, which have to be retrieved from it using pre-defined macros.
308 These data types are described in the <a
309 href="interface.html#Gedcom_val_types"> interface details</a>.
312 Some extra notes:<br>
316 <li>The <code>Gedcom_val</code> argument of the end callback
317 is currently not used. It is there for future enhancements.</li>
318 <li>There are also two <code>Gedcom_val</code> arguments
319 in the start callback for records. The first one (<code>xref</code>
320 ) contains the <code>xref_value</code> corresponding to the cross-reference
321 (or <code>NULL</code> if there isn't one), the second one (<code>parsed_value</code>
322 ) contains the value that is parsed from the <code>raw_value</code>. See
323 the <a href="interface.html#Record_identifiers">interface details</a>
330 <h3><a name="Default_callbacks"></a>Default callbacks<br>
332 As described above, an application doesn't always implement the entire
333 GEDCOM spec, and application-specific tags may have been added by other
334 applications. To preserve this extra data anyway, a default callback
335 can be registered by the application, as in the following example:<br>
337 <blockquote><code>void <b>my_default_cb</b> (Gedcom_ctxt parent, int level,
338 char* tag, char* raw_value, int parsed_tag)<br>
344 <b>gedcom_set_default_callback</b>(my_default_cb);<br>
346 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
348 This callback has a similar signature as the previous ones,
349 but it doesn't contain a parsed value. However, it does contain the
350 parent context, that was returned by the application for the most specific
351 containing tag that the application supported.<br>
353 Suppose e.g. that this callback is called for some tags in the header
354 that are specific to some other application, then our application could
355 make sure that the parent context contains the struct or object that represents
356 the header, and use the default callback here to add the level, tag and
357 raw_value as plain text in a member of that struct or object, thus preserving
358 the information. The application can then write this out when the
359 data is saved again in a GEDCOM file. To make it more specific, consider
360 the following example:<br>
362 <blockquote><code>struct header {<br>
363 char* source;<br>
365 char* extra_text;<br>
368 Gedcom_ctxt my_header_start_cb(int level, Gedcom_val xref, char* tag,
370
371 int parsed_tag, Gedcom_val parsed_value)<br>
373 struct header head = my_make_header_struct();<br>
374 return (Gedcom_ctxt)head;<br>
377 void my_default_cb(Gedcom_ctxt parent, int level, char* tag, char*
378 raw_value, int parsed_tag)<br>
380 struct header head = (struct header)parent;<br>
381 my_header_add_to_extra_text(head, level, tag, raw_value);<br>
384 gedcom_set_default_callback(my_default_cb);<br>
385 gedcom_subscribe_to_record(REC_HEAD, my_header_start, NULL);<br>
387 result = gedcom_parse_file(filename);</code><br>
389 Note that the default callback will be called for any tag that isn't
390 specifically subscribed upon by the application, and can thus be called
391 in various contexts. For simplicity, the example above doesn't take
392 this into account (the <code>parent</code> could be of different
393 types, depending on the context).<br>
395 Note also that the default callback is not called when the parent context
396 is <code>NULL</code><code></code>. This is e.g. the case if none
397 of the "upper" tags has been subscribed upon.<br>
400 <hr width="100%" size="2">
402 <h2><a name="Other_API_functions"></a>Other API functions<br>
404 Although the above describes the basic interface of libgedcom, there
405 are some other functions that allow to customize the behaviour of the library.
406 These will be explained in the current section.<br>
409 <h3><a name="Debugging"></a>Debugging</h3>
410 The library can generate various debugging output, not only from itself,
411 but also the debugging output generated by the yacc parser. By default,
412 no debugging output is generated, but this can be customized using the
413 following function:<br>
416 <blockquote><code>void <b>gedcom_set_debug_level</b> (int level, FILE*
417 trace_output)</code><br>
419 The <code>level</code> can be one of the following values:<br>
423 <li>0: no debugging information (this is the
425 <li>1: only debugging information from libgedcom
427 <li>2: debugging information from libgedcom
432 If the <code>trace_output</code> is <code>NULL</code>, debugging information
433 will be written to <code>stderr</code>, otherwise the given file handle
434 is used (which must be open).<br>
438 <h3><a name="Error_treatment"></a>Error treatment</h3>
439 One of the previous sections already described the callback to be
440 registered to get error messages. The library also allows to customize
441 what happens on an error, using the following function:<br>
444 <blockquote><code>void <b>gedcom_set_error_handling</b> (Gedcom_err_mech
445 mechanism)</code><br>
447 The <code>mechanism</code> can be one of:<br>
451 <li><code>IMMED_FAIL</code>: immediately fail
452 the parsing on an error (this is the default)</li>
453 <li><code>DEFER_FAIL</code>: continue parsing
454 after an error, but return a failure code eventually</li>
455 <li><code>IGNORE_ERRORS</code>: continue parsing
456 after an error, return success always</li>
460 This doesn't influence the generation of error or warning messages,
461 only the behaviour of the parser and its return code.<br>
465 <h3><a name="Compatibility_mode"></a>Compatibility mode<br>
467 Applications are not necessarily true to the GEDCOM spec (or use a
468 different version than 5.5). The intention is that the library is
469 resilient to this, and goes in compatibility mode for files written by specific
470 programs (detected via the HEAD.SOUR tag). This compatibility mode
471 can be enabled and disabled via the following function:<br>
474 <blockquote><code>void <b>gedcom_set_compat_handling</b> (int enable_compat)</code><br>
476 The argument can be:<br>
480 <li>0: disable compatibility mode</li>
481 <li>1: allow compatibility mode (this is the
487 Currently, there is a beginning for compatibility for ftree and Lifelines (3.0.2).<br>
489 <hr width="100%" size="2">
490 <h2><a name="Converting_character_sets"></a>Converting character sets</h2>
491 All strings passed by the GEDCOM parser to the application are in UTF-8
492 encoding. Typically, an application needs to convert this to something
493 else to be able to display it.<br>
495 The most common case is that the output character set is controlled by
496 the <code>locale</code> mechanism (i.e. via the <code>LANG</code>, <code>
497 LC_ALL</code> or <code>LC_CTYPE</code> environment variables), which also
498 controls the <code>gettext</code> mechanism in the application. <br>
502 The source distribution of <code>
503 gedcom-parse</code> contains an example implementation (<code>utf8-locale.c</code>
504 and <code> utf8-locale.h</code> in the "t" subdirectory of the top directory).
505 Feel free to use it in your source code (it is not part of the library,
506 and it isn't installed anywhere, so you need to take over the source and
507 header file in your application). <br>
509 Its interface is:<br>
512 <pre><code>char *<b>convert_utf8_to_locale</b> (char *input, int *conv_failures);<br>char *<b>convert_locale_to_utf8</b> (char *input);<br></code></pre>
514 Both functions return a pointer to a static buffer that is overwritten
515 on each call. To function properly, the application must first set
516 the locale using the <code>setlocale</code> function (the second step detailed
517 below). All other steps given below, including setting up and closing
518 down the conversion handles, are transparantly handled by the two functions.
521 If you pass a pointer to an integer to the first function, it will be
522 set to the number of conversion failures, i.e. characters that couldn't
523 be converted; you can also just pass <code>NULL</code> if you are not interested
524 (note that usually, the interesting information is just whether there <i>
525 were</i> conversion failures or not, which is then given by the integer
526 being bigger than zero or not). The second function doesn't need this,
527 because any locale can be converted to UTF-8.<br>
529 You can change the "?" that is output for characters that can't be converted
530 to any string you want, using the following function before the conversion
534 <pre><code>void <b>convert_set_unknown</b> (const char *unknown);</code></pre>
537 If you want to have your own functions for it instead of this example
538 implementation, the following steps need to be taken by the application
539 (more detailed info can be found in the info file of the GNU libc library
540 in the "Generic Charset Conversion" section under "Character Set Handling"
542 href="http://www.gnu.org/manual/glibc-2.2.3/html_chapter/libc_6.html#SEC99">
546 <li>inclusion of some headers:</li>
552 <pre><code>#include <locale.h> /* for setlocale */<br>#include <langinfo.h> /* for nl_langinfo */<br>#include <iconv.h> /* for iconv_* functions */<br></code></pre>
557 <li>set the program's current locale to what
558 the user configured in the environment:</li>
564 <pre><code>setlocale(LC_ALL, "");</code><br></pre>
569 <li>open a conversion handle for conversion
570 from UTF-8 to the character set of the current locale (once for the entire
577 <pre><code>iconv_t iconv_handle;<br>...<br>iconv_handle = iconv_open(nl_langinfo(CODESET), "UTF-8");</code><br>if (iconv_handle == (iconv_t) -1)<br> /* signal an error */<br></pre>
582 <li>then, every string can be converted
583 using the following:</li>
589 <pre><code>/* char* in_buf is the input buffer, size_t in_len is its length */<br>/* char* out_buf is the output buffer, size_t out_len is its length */<br><br>size_t nconv;<br>char *in_ptr = in_buf;<br>char *out_ptr = out_buf;<br>nconv = iconv(iconv_handle, &in_ptr, &in_len, &out_ptr, &out_len);</code></pre>
593 <blockquote>If the output buffer is not big enough, <code>iconv</code> will
594 return -1 and set <code>errno</code> to <code>E2BIG</code>. Also,
595 the <code>in_ptr</code> and <code>out_ptr</code> will point just after
596 the last successfully converted character in the respective buffers, and
597 the <code> in_len</code> and <code>out_len</code> will be updated to show
598 the remaining lengths. There can be two strategies here:<br>
601 <li>Make sure from the beginning
602 that the output buffer is big enough. However, it's difficult to find
603 an absolute maximum length in advance, even given the length of the input
607 <li>Do the conversion in several
608 steps, growing the output buffer each time to make more space, and calling
609 <code>iconv</code> consecutively until the conversion is complete.
610 This is the preferred way (a function could be written to encapsulate
614 Another error case is when the conversion was unsuccessful (if one of
615 the characters can't be represented in the target character set). The
616 <code> iconv</code> function will then also return -1 and set <code>errno</code>
617 to <code>EILSEQ</code>; the <code>in_ptr</code> will point to the character
618 that couldn't be converted. In that case, again two strategies are
622 <li>Just fail the conversion,
623 and show an error. This is not very user friendly, of course.<br>
626 <li>Skip over the character that
627 can't be converted and append a "?" to the output buffer, then call <code>
628 iconv</code> again. Skipping over a UTF-8 character is fairly simple,
629 as follows from the <a
630 href="http://www.cl.cam.ac.uk/%7Emgk25/unicode.html#utf-8">encoding rules</a>
638 <li>if the first byte is in
639 binary 0xxxxxxx, then the character is only one byte long, just skip over
643 <li>if the first byte is in
644 binary 11xxxxxx, then skip over that byte and all bytes 10xxxxxx that follow.<br>
653 <li>eventually, the conversion
654 handle needs to be closed (when the program exits):<br>
661 <pre><code>iconv_close(iconv_handle);<br></code></pre>
664 The example implementation
665 mentioned above grows the output buffer dynamically and outputs "?" for characters
666 that can't be converted.<br>
669 <hr width="100%" size="2">
671 <h2><a name="Support_for_configure.in"></a>Support for configure.in</h2>
672 Programs using the GEDCOM parser library and using autoconf to configure
673 their sources can use the following statements in configure.in (the example
674 is checking for gedcom-parse, version 1.34):<br>
676 <blockquote><code>AC_CHECK_LIB(gedcom, gedcom_parse_file,,<br>
677 AC_MSG_ERROR(Cannot
678 find libgedcom: Please install gedcom-parse))<br>
679 AC_MSG_CHECKING(for libgedcom version)<br>
681 #include <stdio.h><br>
682 #include <stdlib.h><br>
683 #include <gedcom.h><br>
687 if (GEDCOM_PARSE_VERSION >= 1034) exit(0);<br>
690 ac_gedcom_version_ok='yes',<br>
691 ac_gedcom_version_ok='no',<br>
692 ac_gedcom_version_ok='no')<br>
693 if test "$ac_gedcom_version_ok" = 'yes' ; then<br>
694 AC_MSG_RESULT(ok)<br>
696 AC_MSG_RESULT(not ok)<br>
697 AC_MSG_ERROR(You need at least version 1.34 of gedcom-parse)<br>
700 There are three preprocessor symbols defined for version checks in the
704 <li><code>GEDCOM_PARSE_VERSION_MAJOR</code></li>
705 <li><code>GEDCOM_PARSE_VERSION_MINOR</code></li>
706 <li><code>GEDCOM_PARSE_VERSION</code><br>
710 The last one is equal to <code>(GEDCOM_PARSE_VERSION_MAJOR * 1000) + GEDCOM_PARSE_VERSION_MINOR.</code><br>
712 <hr width="100%" size="2">
714 <pre><font size="-1">$Id$<br>$Name$</font><br></pre>