1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title>Using the GEDCOM parser library</title>
4 <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></head>
6 <body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099">
8 <h1 align="center">Using the GEDCOM parser library</h1>
14 <li><a href="#anchor">Overview</a></li>
15 <li><a href="#Error_handling">Error handling</a></li>
16 <li><a href="#Data_callback_mechanism">Data callback mechanism</a></li>
19 <li><a href="#Start_and_end_callbacks">Start and end callbacks</a></li>
20 <li><a href="#Default_callbacks">Default callbacks</a></li>
23 <li><a href="#Other_API_functions">Other API functions</a></li>
26 <li><a href="#Debugging">Debugging</a></li>
27 <li><a href="#Error_treatment">Error treatment</a></li>
28 <li><a href="#Compatibility_mode">Compatibility mode</a></li>
31 <li><a href="#Converting_character_sets">Converting character sets</a></li>
32 <li><a href="#Support_for_configure.in">Support for configure.in</a><br>
34 <li><a href="interface.html">Interface details</a><br>
39 <hr width="100%" size="2">
40 <h2><a name="Overview"></a>Overview<br>
42 The GEDCOM parser library is built as a callback-based parser (comparable
43 to the SAX interface of XML). It comes with:<br>
46 <li>a library (<code>libgedcom.so</code>), to be linked in the
47 application program</li>
48 <li>a header file (<code>gedcom.h</code>), to be used in the sources
49 of the application program</li>
50 <li>a header file (<code>gedcom-tags.h</code>) that is also installed,
51 but that is automatically included via <code>gedcom.h</code><br>
55 Next to these, there is also a data directory in <code>$PREFIX/share/gedcom-parse</code>
56 that contains some additional stuff, but which is not immediately
57 important at first. I'll leave the description of the data directory
60 The very simplest call of the gedcom parser is simply the following
61 piece of code (include of the gedcom header is assumed, as everywhere
64 <blockquote><code>int result;<br>
66 <b>gedcom_init</b>();<br>
68 result = <b>gedcom_parse_file</b>("myfamily.ged");<br>
70 Although this will not provide much information, one thing it does
71 is parse the entire file and return the result. The function returns
72 0 on success and 1 on failure. No other information is available
73 using this function only.<br>
75 The call to <code>gedcom_init</code>() should be one of the first calls
76 in your program. The requirement is that it should come before the first
77 call to <code>iconv_open</code> (part of the generic character set conversion
78 feature) in the program, either by your program itself, or indirectly by
79 the library calls it makes. Practically, it should e.g. come before
80 any calls to any GTK functions, because GTK uses <code>iconv_open</code>
81 in its initialization. For the same reason it is also advised to put
82 the <code>-lgedcom</code> option on the linking of the program as the last
83 option, so that its initialization code is run first.<br>
85 The next sections will refine this piece of code to be able to have
86 meaningful errors and the actual data that is in the file.<br>
88 <hr width="100%" size="2">
89 <h2><a name="Error_handling"></a>Error handling</h2>
90 Since this is a relatively simple topic, it is discussed before the
91 actual callback mechanism, although it also uses a callback...<br>
93 The library can be used in several different circumstances, both
94 terminal-based as GUI-based. Therefore, it leaves the actual display
95 of the error message up to the application. For this, the application
96 needs to register a callback before parsing the GEDCOM file, which will
97 be called by the library on errors, warnings and messages.<br>
99 A typical piece of code would be:<br>
101 <blockquote><code>void <b>my_message_handler</b> (Gedcom_msg_type type,
107 <b>gedcom_set_message_handler</b>(my_message_handler);<br>
109 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
111 In the above piece of code, <code>my_message_handler</code> is the
112 callback that will be called for errors (<code>type=ERROR</code>), warnings
113 (<code>type=WARNING</code>) and messages (<code>type=MESSAGE</code>). The
114 callback must have the signature as in the example. For errors,
115 the <code> msg</code> passed to the callback will have the format:<br>
117 <blockquote><code>Error on line</code> <i><lineno></i>: <i><actual_message></i><br>
119 Note that the entire string will be properly internationalized, and
120 encoded in UTF-8 (<a href="encoding.html">Why UTF-8?</a>). Also,
121 no newline is appended, so that the application program can use it in any
122 way it wants. Warnings are similar, but use "Warning" instead of "Error".
123 Messages are plain text, without any prefix.<br>
125 With this in place, the resulting code will already show errors and
126 warnings produced by the parser, e.g. on the terminal if a simple <code>
127 printf</code> is used in the message handler.<br>
129 <hr width="100%" size="2">
130 <h2><a name="Data_callback_mechanism"></a>Data callback mechanism</h2>
131 The most important use of the parser is of course to get the data
132 out of the GEDCOM file. As already mentioned, the parser uses a callback
133 mechanism for that. In fact, the mechanism involves two levels.<br>
135 The primary level is that each of the sections in a GEDCOM file is
136 notified to the application code via a "start element" callback and an
137 "end element" callback (much like in a SAX interface for XML), i.e. when
138 a line containing a certain tag is parsed, the "start element" callback
139 is called for that tag, and when all its subordinate lines with their
140 tags have been processed, the "end element" callback is called for the
141 original tag. Since GEDCOM is hierarchical, this results in properly
142 nested calls to appropriate "start element" and "end element" callbacks.<br>
144 However, it would be typical for a genealogy program to support only
145 a subset of the GEDCOM standard, certainly a program that is still under
146 development. Moreover, under GEDCOM it is allowed for an application
147 to define its own tags, which will typically not be supported by
148 another application. Still, in that case, data preservation is important;
149 it would hardly be accepted that information that is not understood by
150 a certain program is just removed.<br>
152 Therefore, the second level of callbacks involves a "default callback".
153 An application needs to subscribe to callbacks for tags it does support,
154 and need to provide a "default callback" which will be called for tags
155 it doesn't support. The application can then choose to just store
156 the information that comes via the default callback in plain textual format.<br>
158 After this introduction, let's see what the API looks like...<br>
161 <h3><a name="Start_and_end_callbacks"></a>Start and end callbacks</h3>
163 <h4><i>Callbacks for records</i> <br>
165 As a simple example, we will get some information from the header
166 of a GEDCOM file. First, have a look at the following piece of code:<br>
168 <blockquote><code>Gedcom_ctxt <b>my_header_start_cb</b> (Gedcom_rec rec,<br>
169 int level, <br>
170
171 Gedcom_val xref, <br>
172
173 char *tag, <br>
174
175 char *raw_value,<br>
176
177 int parsed_tag, <br>
178
179 Gedcom_val parsed_value)<br>
181 printf("The header starts\n");<br>
182 return (Gedcom_ctxt)1;<br>
185 void <b>my_header_end_cb</b> (Gedcom_rec rec, Gedcom_ctxt self)<br>
187 printf("The header ends, context is %d\n", (int)self);
188 /* context will print as "1" */<br>
192 <b>gedcom_subscribe_to_record</b>(REC_HEAD, my_header_start_cb,
193 my_header_end_cb);<br>
195 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
197 Using the <code>gedcom_subscribe_to_record</code> function, the
198 application requests to use the specified callbacks as start and end
199 callback. The end callback is optional: you can pass <code>NULL</code>
200 if you are not interested in the end callback. The identifiers
201 to use as first argument to the function (here <code>REC_HEAD</code>)
202 are described in the <a href="interface.html#Record_identifiers"> interface
203 details</a> . These are also passed as first argument in the callbacks (the <code>Gedcom_rec</code> argument).<br>
205 From the name of the function it becomes clear that this function
206 is specific to complete records. For the separate elements in records
207 there is another function, which we'll see shortly. Again, the callbacks
208 need to have the signatures as shown in the example.<br>
210 The <code>Gedcom_ctxt</code> type that is used as a result of the
211 start callback and as an argument to the end callback is vital for passing
212 context necessary for the application. This type is meant to be opaque;
213 in fact, it's a void pointer, so you can pass anything via it. The
214 important thing to know is that the context that the application returns
215 in the start callback will be passed in the end callback as an argument,
216 and as we will see shortly, also to all the directly subordinate elements
219 The <code>tag</code> is the GEDCOM tag in string format, the <code>parsed_tag</code>
220 is an integer, for which symbolic values are defined as <code>TAG_HEAD,</code>
221 <code>TAG_SOUR,</code> <code>TAG_DATA,</code> ... and <code>USERTAG
222 </code><code></code> for the application-specific tags. These values
223 are defined in the header <code>gedcom-tags.h</code> that is installed,
224 and included via <code> gedcom.h</code> (so no need to include <code>gedcom-tags.h</code>
227 The example passes a simple integer as context, but an application
228 could e.g. pass a <code>struct</code> (or an object in a C++ application)
229 that will contain the information for the header. In the end callback,
230 the application could then e.g. do some finalizing operations on the
231 <code> struct</code> to put it in its database.<br>
233 (Note that the <code>Gedcom_val</code> type for the <code>xref</code>
234 and <code>parsed_value</code> arguments was not discussed, see further
238 <h4><i>Callbacks for elements</i></h4>
239 We will now retrieve the SOUR field (the name of the program that
240 wrote the file) from the header:<br>
242 <blockquote><code>Gedcom_ctxt <b>my_header_source_start_cb</b>(Gedcom_elt elt,<br>
243
244 Gedcom_ctxt
246
247 int
248 level,<br>
249
250 char*
251 tag,<br>
252
253 char*
254 raw_value,<br>
255
256 int
257 parsed_tag,<br>
258
259 Gedcom_val
260 parsed_value)<br>
262 char *source = GEDCOM_STRING(parsed_value);<br>
263 printf("This file was written by %s\n", source);<br>
264 return parent;<br>
267 void <b>my_header_source_end_cb</b>(Gedcom_elt elt,<br>
268 Gedcom_ctxt parent,<br>
269
270 Gedcom_ctxt self,<br>
271
272 Gedcom_val parsed_value)<br>
274 printf("End of the source description\n");<br>
278 <b>gedcom_subscribe_to_element</b>(ELT_HEAD_SOUR,<br>
279
280 my_header_source_start_cb,<br>
281
282 my_header_source_end_cb);<br>
284 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
286 The subscription mechanism for elements is similar, only the signatures
287 of the callbacks differ. The signature for the start callback shows
288 that the context of the parent line (here e.g. the <code>struct</code>
289 that describes the header) is passed to this start callback. The
290 callback itself returns here in this example the same context, but this
291 can be its own context object of course. The end callback is called
292 with both the context of the parent and the context of itself, which in this
293 example will be the same. Again, the list of identifiers to use as
294 a first argument for the subscription function are detailed in the <a href="interface.html#Element_identifiers"> interface details</a> . Again, these are passed as first argument in the callback (the <code>Gedcom_elt</code> argument).<br>
296 If we look at the other arguments of the start callback, we see the
297 level number (the initial number of the line in the GEDCOM file), the tag
298 (e.g. "SOUR"), and then a raw value, a parsed tag and a parsed value. The
299 raw value is just the raw string that occurs as value on the line next
300 to the tag (in UTF-8 encoding). The parsed value is the meaningful
301 value that is parsed from that raw string. The parsed tag is described
302 in the section for record callbacks above.<br>
304 The <code>Gedcom_val</code> type is meant to be an opaque type. The
305 only thing that needs to be known about it is that it can contain specific
306 data types, which have to be retrieved from it using pre-defined macros.
307 These data types are described in the <a href="interface.html#Gedcom_val_types"> interface details</a>.
310 Some extra notes:<br>
314 <li>The <code>Gedcom_val</code> argument of the end callback
315 is currently not used. It is there for future enhancements.</li>
316 <li>There are also two <code>Gedcom_val</code> arguments
317 in the start callback for records. The first one (<code>xref</code>
318 ) contains the <code>xref_value</code> corresponding to the cross-reference
319 (or <code>NULL</code> if there isn't one), the second one (<code>parsed_value</code>
320 ) contains the value that is parsed from the <code>raw_value</code>. See
321 the <a href="interface.html#Record_identifiers">interface details</a>
328 <h3><a name="Default_callbacks"></a>Default callbacks<br>
330 As described above, an application doesn't always implement the entire
331 GEDCOM spec, and application-specific tags may have been added by other
332 applications. To preserve this extra data anyway, a default callback
333 can be registered by the application, as in the following example:<br>
335 <blockquote><code>void <b>my_default_cb</b> (Gedcom_elt elt, Gedcom_ctxt parent, int level,
336 char* tag, char* raw_value, int parsed_tag)<br>
342 <b>gedcom_set_default_callback</b>(my_default_cb);<br>
344 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
346 This callback has a similar signature as the previous ones,
347 but it doesn't contain a parsed value. However, it does contain the
348 parent context, that was returned by the application for the most specific
349 containing tag that the application supported.<br>
351 Suppose e.g. that this callback is called for some tags in the header
352 that are specific to some other application, then our application could
353 make sure that the parent context contains the struct or object that represents
354 the header, and use the default callback here to add the level, tag and
355 raw_value as plain text in a member of that struct or object, thus preserving
356 the information. The application can then write this out when the
357 data is saved again in a GEDCOM file. To make it more specific, consider
358 the following example:<br>
360 <blockquote><code>struct header {<br>
361 char* source;<br>
363 char* extra_text;<br>
366 Gedcom_ctxt my_header_start_cb(Gedcom_rec rec, int level, Gedcom_val xref, char* tag,
368
369 int parsed_tag, Gedcom_val parsed_value)<br>
371 struct header head = my_make_header_struct();<br>
372 return (Gedcom_ctxt)head;<br>
375 void my_default_cb(Gedcom_elt elt, Gedcom_ctxt parent, int level, char* tag, char*
376 raw_value, int parsed_tag)<br>
378 struct header head = (struct header)parent;<br>
379 my_header_add_to_extra_text(head, level, tag, raw_value);<br>
382 gedcom_set_default_callback(my_default_cb);<br>
383 gedcom_subscribe_to_record(REC_HEAD, my_header_start, NULL);<br>
385 result = gedcom_parse_file(filename);</code><br>
387 Note that the default callback will be called for any tag that isn't
388 specifically subscribed upon by the application, and can thus be called
389 in various contexts. For simplicity, the example above doesn't take
390 this into account (the <code>parent</code> could be of different
391 types, depending on the context).<br>
393 Note also that the default callback is not called when the parent context
394 is <code>NULL</code><code></code>. This is e.g. the case if none
395 of the "upper" tags has been subscribed upon.<br>
398 <hr width="100%" size="2">
400 <h2><a name="Other_API_functions"></a>Other API functions<br>
402 Although the above describes the basic interface of libgedcom, there
403 are some other functions that allow to customize the behaviour of the library.
404 These will be explained in the current section.<br>
407 <h3><a name="Debugging"></a>Debugging</h3>
408 The library can generate various debugging output, not only from itself,
409 but also the debugging output generated by the yacc parser. By default,
410 no debugging output is generated, but this can be customized using the
411 following function:<br>
414 <blockquote><code>void <b>gedcom_set_debug_level</b> (int level, FILE*
415 trace_output)</code><br>
417 The <code>level</code> can be one of the following values:<br>
421 <li>0: no debugging information (this is the
423 <li>1: only debugging information from libgedcom
425 <li>2: debugging information from libgedcom
430 If the <code>trace_output</code> is <code>NULL</code>, debugging information
431 will be written to <code>stderr</code>, otherwise the given file handle
432 is used (which must be open).<br>
436 <h3><a name="Error_treatment"></a>Error treatment</h3>
437 One of the previous sections already described the callback to be
438 registered to get error messages. The library also allows to customize
439 what happens on an error, using the following function:<br>
442 <blockquote><code>void <b>gedcom_set_error_handling</b> (Gedcom_err_mech
443 mechanism)</code><br>
445 The <code>mechanism</code> can be one of:<br>
449 <li><code>IMMED_FAIL</code>: immediately fail
450 the parsing on an error (this is the default)</li>
451 <li><code>DEFER_FAIL</code>: continue parsing
452 after an error, but return a failure code eventually</li>
453 <li><code>IGNORE_ERRORS</code>: continue parsing
454 after an error, return success always</li>
458 This doesn't influence the generation of error or warning messages,
459 only the behaviour of the parser and its return code.<br>
463 <h3><a name="Compatibility_mode"></a>Compatibility mode<br>
465 Applications are not necessarily true to the GEDCOM spec (or use a
466 different version than 5.5). The intention is that the library is
467 resilient to this, and goes in compatibility mode for files written by specific
468 programs (detected via the HEAD.SOUR tag). This compatibility mode
469 can be enabled and disabled via the following function:<br>
472 <blockquote><code>void <b>gedcom_set_compat_handling</b> (int enable_compat)</code><br>
474 The argument can be:<br>
478 <li>0: disable compatibility mode</li>
479 <li>1: allow compatibility mode (this is the
485 Currently, there is a beginning for compatibility for ftree and Lifelines (3.0.2).<br>
487 <hr width="100%" size="2">
488 <h2><a name="Converting_character_sets"></a>Converting character sets</h2>
489 All strings passed by the GEDCOM parser to the application are in UTF-8
490 encoding. Typically, an application needs to convert this to something
491 else to be able to display it.<br>
493 The most common case is that the output character set is controlled by
494 the <code>locale</code> mechanism (i.e. via the <code>LANG</code>, <code>
495 LC_ALL</code> or <code>LC_CTYPE</code> environment variables), which also
496 controls the <code>gettext</code> mechanism in the application. <br>
500 The source distribution of <code>
501 gedcom-parse</code> contains an example implementation (<code>utf8-locale.c</code>
502 and <code> utf8-locale.h</code> in the "t" subdirectory of the top directory).
503 Feel free to use it in your source code (it is not part of the library,
504 and it isn't installed anywhere, so you need to take over the source and
505 header file in your application). <br>
507 Its interface is:<br>
510 <pre><code>char *<b>convert_utf8_to_locale</b> (char *input, int *conv_failures);<br>char *<b>convert_locale_to_utf8</b> (char *input);<br></code></pre>
512 Both functions return a pointer to a static buffer that is overwritten
513 on each call. To function properly, the application must first set
514 the locale using the <code>setlocale</code> function (the second step detailed
515 below). All other steps given below, including setting up and closing
516 down the conversion handles, are transparantly handled by the two functions.
519 If you pass a pointer to an integer to the first function, it will be
520 set to the number of conversion failures, i.e. characters that couldn't
521 be converted; you can also just pass <code>NULL</code> if you are not interested
522 (note that usually, the interesting information is just whether there <i>
523 were</i> conversion failures or not, which is then given by the integer
524 being bigger than zero or not). The second function doesn't need this,
525 because any locale can be converted to UTF-8.<br>
527 You can change the "?" that is output for characters that can't be converted
528 to any string you want, using the following function before the conversion
532 <pre><code>void <b>convert_set_unknown</b> (const char *unknown);</code></pre>
535 If you want to have your own functions for it instead of this example
536 implementation, the following steps need to be taken by the application
537 (more detailed info can be found in the info file of the GNU libc library
538 in the "Generic Charset Conversion" section under "Character Set Handling"
539 or online <a href="http://www.gnu.org/manual/glibc-2.2.3/html_chapter/libc_6.html#SEC99">
543 <li>inclusion of some headers:</li>
549 <pre><code>#include <locale.h> /* for setlocale */<br>#include <langinfo.h> /* for nl_langinfo */<br>#include <iconv.h> /* for iconv_* functions */<br></code></pre>
554 <li>set the program's current locale to what
555 the user configured in the environment:</li>
561 <pre><code>setlocale(LC_ALL, "");</code><br></pre>
566 <li>open a conversion handle for conversion
567 from UTF-8 to the character set of the current locale (once for the entire
574 <pre><code>iconv_t iconv_handle;<br>...<br>iconv_handle = iconv_open(nl_langinfo(CODESET), "UTF-8");</code><br>if (iconv_handle == (iconv_t) -1)<br> /* signal an error */<br></pre>
579 <li>then, every string can be converted
580 using the following:</li>
586 <pre><code>/* char* in_buf is the input buffer, size_t in_len is its length */<br>/* char* out_buf is the output buffer, size_t out_len is its length */<br><br>size_t nconv;<br>char *in_ptr = in_buf;<br>char *out_ptr = out_buf;<br>nconv = iconv(iconv_handle, &in_ptr, &in_len, &out_ptr, &out_len);</code></pre>
590 <blockquote>If the output buffer is not big enough, <code>iconv</code> will
591 return -1 and set <code>errno</code> to <code>E2BIG</code>. Also,
592 the <code>in_ptr</code> and <code>out_ptr</code> will point just after
593 the last successfully converted character in the respective buffers, and
594 the <code> in_len</code> and <code>out_len</code> will be updated to show
595 the remaining lengths. There can be two strategies here:<br>
598 <li>Make sure from the beginning
599 that the output buffer is big enough. However, it's difficult to find
600 an absolute maximum length in advance, even given the length of the input
604 <li>Do the conversion in several
605 steps, growing the output buffer each time to make more space, and calling
606 <code>iconv</code> consecutively until the conversion is complete.
607 This is the preferred way (a function could be written to encapsulate
611 Another error case is when the conversion was unsuccessful (if one of
612 the characters can't be represented in the target character set). The
613 <code> iconv</code> function will then also return -1 and set <code>errno</code>
614 to <code>EILSEQ</code>; the <code>in_ptr</code> will point to the character
615 that couldn't be converted. In that case, again two strategies are
619 <li>Just fail the conversion,
620 and show an error. This is not very user friendly, of course.<br>
623 <li>Skip over the character that
624 can't be converted and append a "?" to the output buffer, then call <code>
625 iconv</code> again. Skipping over a UTF-8 character is fairly simple,
626 as follows from the <a href="http://www.cl.cam.ac.uk/%7Emgk25/unicode.html#utf-8">encoding rules</a>
634 <li>if the first byte is in
635 binary 0xxxxxxx, then the character is only one byte long, just skip over
639 <li>if the first byte is in
640 binary 11xxxxxx, then skip over that byte and all bytes 10xxxxxx that follow.<br>
649 <li>eventually, the conversion
650 handle needs to be closed (when the program exits):<br>
657 <pre><code>iconv_close(iconv_handle);<br></code></pre>
660 The example implementation
661 mentioned above grows the output buffer dynamically and outputs "?" for characters
662 that can't be converted.<br>
665 <hr width="100%" size="2">
667 <h2><a name="Support_for_configure.in"></a>Support for configure.in</h2>
668 Programs using the GEDCOM parser library and using autoconf to configure
669 their sources can use the following statements in configure.in (the example
670 is checking for gedcom-parse, version 1.34):<br>
672 <blockquote><code>AC_CHECK_LIB(gedcom, gedcom_parse_file,,<br>
673 AC_MSG_ERROR(Cannot
674 find libgedcom: Please install gedcom-parse))<br>
675 AC_MSG_CHECKING(for libgedcom version)<br>
677 #include <stdio.h><br>
678 #include <stdlib.h><br>
679 #include <gedcom.h><br>
683 if (GEDCOM_PARSE_VERSION >= 1034) exit(0);<br>
686 ac_gedcom_version_ok='yes',<br>
687 ac_gedcom_version_ok='no',<br>
688 ac_gedcom_version_ok='no')<br>
689 if test "$ac_gedcom_version_ok" = 'yes' ; then<br>
690 AC_MSG_RESULT(ok)<br>
692 AC_MSG_RESULT(not ok)<br>
693 AC_MSG_ERROR(You need at least version 1.34 of gedcom-parse)<br>
696 There are three preprocessor symbols defined for version checks in the
700 <li><code>GEDCOM_PARSE_VERSION_MAJOR</code></li>
701 <li><code>GEDCOM_PARSE_VERSION_MINOR</code></li>
702 <li><code>GEDCOM_PARSE_VERSION</code><br>
706 The last one is equal to <code>(GEDCOM_PARSE_VERSION_MAJOR * 1000) + GEDCOM_PARSE_VERSION_MINOR.</code><br>
708 <hr width="100%" size="2">
710 <pre><font size="-1">$Id$<br>$Name$</font><br></pre>