1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title>Using the GEDCOM parser library</title>
4 <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></head><body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099">
6 <h1 align="center">Using the GEDCOM parser library</h1>
12 <li><a href="#anchor">Overview</a></li>
13 <li><a href="#Error_handling">Error handling</a></li>
14 <li><a href="#Data_callback_mechanism">Data callback mechanism</a></li>
17 <li><a href="#Start_and_end_callbacks">Start and end callbacks</a></li>
18 <li><a href="#Default_callbacks">Default callbacks</a></li>
21 <li><a href="#Other_API_functions">Other API functions</a></li>
24 <li><a href="#Debugging">Debugging</a></li>
25 <li><a href="#Error_treatment">Error treatment</a></li>
26 <li><a href="#Compatibility_mode">Compatibility mode</a></li>
28 </ul><li><a href="#Converting_character_sets">Converting character sets</a><br>
31 <li><a href="interface.html">Interface details</a><br>
36 <hr width="100%" size="2">
37 <h2><a name="Overview"></a>Overview<br>
39 The GEDCOM parser library is built as a callback-based parser (comparable
40 to the SAX interface of XML). It comes with:<br>
43 <li>a library (<code>libgedcom.so</code>), to be linked in the application
45 <li>a header file (<code>gedcom.h</code>), to be used in the sources
46 of the application program</li>
47 <li>a header file (<code>gedcom-tags.h</code>) that is also installed,
48 but that is automatically included via <code>gedcom.h</code><br>
52 Next to these, there is also a data directory in <code>$PREFIX/share/gedcom-parse</code>
53 that contains some additional stuff, but which is not immediately
54 important at first. I'll leave the description of the data directory
57 The very simplest call of the gedcom parser is simply the following
58 piece of code (include of the gedcom header is assumed, as everywhere in
61 <blockquote><code>int result;<br>
63 result = <b>gedcom_parse_file</b>("myfamily.ged");<br>
65 Although this will not provide much information, one thing it does
66 is parse the entire file and return the result. The function returns
67 0 on success and 1 on failure. No other information is available using
68 this function only.<br>
70 The next sections will refine this to be able to have meaningful errors
71 and the actual data that is in the file.<br>
73 <hr width="100%" size="2">
74 <h2><a name="Error_handling"></a>Error handling</h2>
75 Since this is a relatively simple topic, it is discussed before the
76 actual callback mechanism, although it also uses a callback...<br>
78 The library can be used in several different circumstances, both terminal-based
79 as GUI-based. Therefore, it leaves the actual display of the error
80 message up to the application. For this, the application needs to register
81 a callback before parsing the GEDCOM file, which will be called by the library
82 on errors, warnings and messages.<br>
84 A typical piece of code would be:<br>
86 <blockquote><code>void <b>my_message_handler</b> (Gedcom_msg_type type,
92 <b>gedcom_set_message_handler</b>(my_message_handler);<br>
94 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
96 In the above piece of code, <code>my_message_handler</code> is the callback
97 that will be called for errors (<code>type=ERROR</code>), warnings (<code>type=WARNING</code>) and messages (<code>type=MESSAGE</code>). The
98 callback must have the signature as in the example. For errors, the
99 <code> msg</code> passed to the callback will have the format:<br>
101 <blockquote><code>Error on line</code> <i><lineno></i>: <i><actual_message></i><br>
103 Note that the entire string will be properly internationalized, and
104 encoded in UTF-8 (<a href="encoding.html">Why UTF-8?</a>). Also,
105 no newline is appended, so that the application program can use it in any
106 way it wants. Warnings are similar, but use "Warning" instead of
107 "Error". Messages are plain text, without any prefix.<br>
109 With this in place, the resulting code will already show errors and
110 warnings produced by the parser, e.g. on the terminal if a simple <code>
111 printf</code> is used in the message handler.<br>
113 <hr width="100%" size="2">
114 <h2><a name="Data_callback_mechanism"></a>Data callback mechanism</h2>
115 The most important use of the parser is of course to get the data out
116 of the GEDCOM file. As already mentioned, the parser uses a callback
117 mechanism for that. In fact, the mechanism involves two levels.<br>
119 The primary level is that each of the sections in a GEDCOM file is notified
120 to the application code via a "start element" callback and an "end element"
121 callback (much like in a SAX interface for XML), i.e. when a line containing
122 a certain tag is parsed, the "start element" callback is called for that
123 tag, and when all its subordinate lines with their tags have been processed,
124 the "end element" callback is called for the original tag. Since GEDCOM
125 is hierarchical, this results in properly nested calls to appropriate "start
126 element" and "end element" callbacks.<br>
128 However, it would be typical for a genealogy program to support only
129 a subset of the GEDCOM standard, certainly a program that is still under
130 development. Moreover, under GEDCOM it is allowed for an application
131 to define its own tags, which will typically not be supported by another
132 application. Still, in that case, data preservation is important;
133 it would hardly be accepted that information that is not understood by a
134 certain program is just removed.<br>
136 Therefore, the second level of callbacks involves a "default callback".
137 An application needs to subscribe to callbacks for tags it does support,
138 and need to provide a "default callback" which will be called for tags
139 it doesn't support. The application can then choose to just store
140 the information that comes via the default callback in plain textual format.<br>
142 After this introduction, let's see what the API looks like...<br>
145 <h3><a name="Start_and_end_callbacks"></a>Start and end callbacks</h3>
147 <h4><i>Callbacks for records</i> <br>
149 As a simple example, we will get some information from the header of
150 a GEDCOM file. First, have a look at the following piece of code:<br>
152 <blockquote><code>Gedcom_ctxt <b>my_header_start_cb</b> (int level,
154
155 Gedcom_val xref, <br>
156
157 char *tag, <br>
158
159 char *raw_value,<br>
160
161 int parsed_tag, <br>
162
163 Gedcom_val parsed_value)<br>
165 printf("The header starts\n");<br>
166 return (Gedcom_ctxt)1;<br>
169 void <b>my_header_end_cb</b> (Gedcom_ctxt self)<br>
171 printf("The header ends, context is %d\n", (int)self); /* context
172 will print as "1" */<br>
176 <b>gedcom_subscribe_to_record</b>(REC_HEAD, my_header_start_cb,
177 my_header_end_cb);<br>
179 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
181 Using the <code>gedcom_subscribe_to_record</code> function, the application
182 requests to use the specified callbacks as start and end callback. The end
183 callback is optional: you can pass <code>NULL</code> if you are not interested
184 in the end callback. The identifiers to use as first argument to
185 the function (here <code>REC_HEAD</code>) are described in the <a href="interface.html#Record_identifiers">
186 interface details</a>.<br>
188 From the name of the function it becomes clear that this function is
189 specific to complete records. For the separate elements in records
190 there is another function, which we'll see shortly. Again, the callbacks
191 need to have the signatures as shown in the example.<br>
193 The <code>Gedcom_ctxt</code> type that is used as a result of the start
194 callback and as an argument to the end callback is vital for passing context
195 necessary for the application. This type is meant to be opaque; in
196 fact, it's a void pointer, so you can pass anything via it. The important
197 thing to know is that the context that the application returns in the start
198 callback will be passed in the end callback as an argument, and as we will
199 see shortly, also to all the directly subordinate elements of the record.<br>
201 The <code>tag</code> is the GEDCOM tag in string format, the <code>parsed_tag</code>
202 is an integer, for which symbolic values are defined as <code>TAG_HEAD,</code>
203 <code>TAG_SOUR,</code> <code>TAG_DATA,</code> ... and <code>USERTAG </code><code></code>
204 for the application-specific tags. These values are defined in the
205 header <code>gedcom-tags.h</code> that is installed, and included via <code>
206 gedcom.h</code> (so no need to include <code>gedcom-tags.h</code> yourself).<br>
208 The example passes a simple integer as context, but an application could
209 e.g. pass a <code>struct</code> (or an object in a C++ application) that will contain the information for the
210 header. In the end callback, the application could then e.g. do some
211 finalizing operations on the <code>struct</code> to put it in its database.<br>
213 (Note that the <code>Gedcom_val</code> type for the <code>xref</code>
214 and <code>parsed_value</code> arguments was not discussed, see further
218 <h4><i>Callbacks for elements</i></h4>
219 We will now retrieve the SOUR field (the name of the program that wrote
220 the file) from the header:<br>
222 <blockquote><code>Gedcom_ctxt <b>my_header_source_start_cb</b>(Gedcom_ctxt
224
225 int
226 level,<br>
227
228 char*
229 tag,<br>
230
231 char*
232 raw_value,<br>
233
234 int
235 parsed_tag,<br>
236
237 Gedcom_val
238 parsed_value)<br>
240 char *source = GEDCOM_STRING(parsed_value);<br>
241 printf("This file was written by %s\n", source);<br>
242 return parent;<br>
245 void <b>my_header_source_end_cb</b>(Gedcom_ctxt parent,<br>
246
247 Gedcom_ctxt self,<br>
248
249 Gedcom_val parsed_value)<br>
251 printf("End of the source description\n");<br>
255 <b>gedcom_subscribe_to_element</b>(ELT_HEAD_SOUR,<br>
256
257 my_header_source_start_cb,<br>
258
259 my_header_source_end_cb);<br>
261 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
263 The subscription mechanism for elements is similar, only the signatures
264 of the callbacks differ. The signature for the start callback shows
265 that the context of the parent line (here e.g. the <code>struct</code> that
266 describes the header) is passed to this start callback. The callback
267 itself returns here in this example the same context, but this can be its own context object
268 of course. The end callback is called with both the context of the
269 parent and the context of itself, which in this example will be the same.
270 Again, the list of identifiers to use as a first argument for the
271 subscription function are detailed in the <a href="interface.html#Element_identifiers">
272 interface details</a> .<br>
274 If we look at the other arguments of the start callback, we see the
275 level number (the initial number of the line in the GEDCOM file), the tag
276 (e.g. "SOUR"), and then a raw value, a parsed tag and a parsed value. The
277 raw value is just the raw string that occurs as value on the line next to
278 the tag (in UTF-8 encoding). The parsed value is the meaningful value
279 that is parsed from that raw string. The parsed tag is described in
280 the section for record callbacks above.<br>
282 The <code>Gedcom_val</code> type is meant to be an opaque type. The
283 only thing that needs to be known about it is that it can contain specific
284 data types, which have to be retrieved from it using pre-defined macros.
285 These data types are described in the <a href="interface.html#Gedcom_val_types">
286 interface details</a>. <br>
288 Some extra notes:<br>
291 <li>The <code>Gedcom_val</code> argument of the end callback
292 is currently not used. It is there for future enhancements.</li>
293 <li>There are also two <code>Gedcom_val</code> arguments in
294 the start callback for records. The first one (<code>xref</code>) contains the <code>xref_value</code> corresponding to the cross-reference (or <code>NULL</code> if there isn't one), the second one (<code>parsed_value</code>) contains the value that is parsed from the <code>raw_value</code>. See the <a href="interface.html#Record_identifiers">interface details</a>.</li>
298 <h3><a name="Default_callbacks"></a>Default callbacks<br>
300 As described above, an application doesn't always implement the entire
301 GEDCOM spec, and application-specific tags may have been added by other applications.
302 To preserve this extra data anyway, a default callback can be registered
303 by the application, as in the following example:<br>
305 <blockquote><code>void <b>my_default_cb</b> (Gedcom_ctxt parent,
306 int level, char* tag, char* raw_value, int parsed_tag)<br>
312 <b>gedcom_set_default_callback</b>(my_default_cb);<br>
314 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
316 This callback has a similar signature as the previous ones,
317 but it doesn't contain a parsed value. However, it does contain the
318 parent context, that was returned by the application for the most specific
319 containing tag that the application supported.<br>
321 Suppose e.g. that this callback is called for some tags in the header
322 that are specific to some other application, then our application could make
323 sure that the parent context contains the struct or object that represents
324 the header, and use the default callback here to add the level, tag and
325 raw_value as plain text in a member of that struct or object, thus preserving
326 the information. The application can then write this out when the
327 data is saved again in a GEDCOM file. To make it more specific, consider
328 the following example:<br>
330 <blockquote><code>struct header {<br>
331 char* source;<br>
333 char* extra_text;<br>
336 Gedcom_ctxt my_header_start_cb(int level, Gedcom_val xref, char* tag,
338
339 int parsed_tag, Gedcom_val parsed_value)<br>
341 struct header head = my_make_header_struct();<br>
342 return (Gedcom_ctxt)head;<br>
345 void my_default_cb(Gedcom_ctxt parent, int level, char* tag, char* raw_value,
348 struct header head = (struct header)parent;<br>
349 my_header_add_to_extra_text(head, level, tag, raw_value);<br>
352 gedcom_set_default_callback(my_default_cb);<br>
353 gedcom_subscribe_to_record(REC_HEAD, my_header_start, NULL);<br>
355 result = gedcom_parse_file(filename);</code><br>
357 Note that the default callback will be called for any tag that isn't
358 specifically subscribed upon by the application, and can thus be called
359 in various contexts. For simplicity, the example above doesn't take
360 this into account (the <code>parent</code> could be of different
361 types, depending on the context).<br>
363 Note also that the default callback is not called when the parent context is <code>NULL</code><code></code>. This is e.g. the case if none of the "upper" tags has been subscribed upon.<br>
365 <hr width="100%" size="2">
367 <h2><a name="Other_API_functions"></a>Other API functions<br>
369 Although the above describes the basic interface of libgedcom, there
370 are some other functions that allow to customize the behaviour of the library.
371 These will be explained in the current section.<br>
373 <h3><a name="Debugging"></a>Debugging</h3>
374 The library can generate various debugging output, not only from itself,
375 but also the debugging output generated by the yacc parser. By default,
376 no debugging output is generated, but this can be customized using the
377 following function:<br>
379 <blockquote><code>void <b>gedcom_set_debug_level</b> (int level,
380 FILE* trace_output)</code><br>
382 The <code>level</code> can be one of the following values:<br>
385 <li>0: no debugging information (this is the
387 <li>1: only debugging information from libgedcom
389 <li>2: debugging information from libgedcom and
393 If the <code>trace_output</code> is <code>NULL</code>, debugging information
394 will be written to <code>stderr</code>, otherwise the given file handle
395 is used (which must be open).<br>
398 <h3><a name="Error_treatment"></a>Error treatment</h3>
399 One of the previous sections already described the callback to be registered
400 to get error messages. The library also allows to customize what
401 happens on an error, using the following function:<br>
403 <blockquote><code>void <b>gedcom_set_error_handling</b> (Gedcom_err_mech
404 mechanism)</code><br>
406 The <code>mechanism</code> can be one of:<br>
410 <li><code>IMMED_FAIL</code>: immediately fail the
411 parsing on an error (this is the default)</li>
412 <li><code>DEFER_FAIL</code>: continue parsing after
413 an error, but return a failure code eventually</li>
414 <li><code>IGNORE_ERRORS</code>: continue parsing
415 after an error, return success always</li>
419 This doesn't influence the generation of error or warning messages, only
420 the behaviour of the parser and its return code.<br>
424 <h3><a name="Compatibility_mode"></a>Compatibility mode<br>
426 Applications are not necessarily true to the GEDCOM spec (or use a different
427 version than 5.5). The intention is that the library is resilient
428 to this, and goes in compatibility mode for files written by specific programs
429 (detected via the HEAD.SOUR tag). This compatibility mode can be
430 enabled and disabled via the following function:<br>
433 <blockquote><code>void <b>gedcom_set_compat_handling</b>
434 (int enable_compat)</code><br>
436 The argument can be:<br>
440 <li>0: disable compatibility mode</li>
441 <li>1: allow compatibility mode (this is the default)<br>
446 Note that, currently, no actual compatibility code is present, but this
447 is on the to-do list.<br>
448 <hr width="100%" size="2">
449 <h2><a name="Converting_character_sets"></a>Converting character sets</h2>
450 All strings passed by the GEDCOM parser to the application are in UTF-8 encoding.
451 Typically, an application needs to convert this to something else to
452 be able to display it.<br>
454 The most common case is that the output character set is controlled by the <code>locale</code> mechanism (i.e. via the <code>LANG</code>, <code>LC_ALL</code> or <code>LC_CTYPE</code> environment variables), which also controls the <code>gettext</code>
455 mechanism in the application. <br>
462 The source distribution of <code>gedcom-parse</code> contains an example implementation (<code>utf8-locale.c</code> and <code>utf8-locale.h</code>
463 in the top directory). Feel free to use it in
464 your source code (it is not part of the library, and it isn't installed anywhere,
465 so you need to take over the source and header file in your application).
469 Its interface is:<br>
470 <blockquote><pre><code>char *<b>convert_utf8_to_locale</b> (char *input, int *conv_failures);<br>char *<b>convert_locale_to_utf8</b> (char *input);<br></code></pre></blockquote>
472 Both functions return a pointer to a static buffer that is overwritten on
473 each call. To function properly, the application must first set the
474 locale using the <code>setlocale</code> function (the second step detailed below).
475 All other steps given below, including setting up and closing down the conversion
476 handles, are transparantly handled by the two functions. <br>
478 If you pass a pointer to an integer to the first function, it will be set
479 to the number of conversion failures, i.e. characters that couldn't be converted;
480 you can also just pass <code>NULL</code> if you are not interested (note that usually, the interesting information is just whether there <i>were</i>
481 conversion failures or not, which is then given by the integer being bigger
482 than zero or not). The second function doesn't need this, because any
483 locale can be converted to UTF-8.<br>
486 You can change the "?" that is output for characters that can't be converted
487 to any string you want, using the following function before the conversion
489 <blockquote><pre><code>void <b>convert_set_unknown</b> (const char *unknown);</code></pre></blockquote>
491 If you want to have your own functions for it instead of this example implementation, the following steps need to
492 be taken by the application (more detailed info can be found in the info
493 file of the GNU libc library in the "Generic Charset Conversion" section
494 under "Character Set Handling" or online <a href="http://www.gnu.org/manual/glibc-2.2.3/html_chapter/libc_6.html#SEC99">here</a>):<br>
496 <li>inclusion of some headers:</li>
500 <pre><code>#include <locale.h> /* for setlocale */<br>#include <langinfo.h> /* for nl_langinfo */<br>#include <iconv.h> /* for iconv_* functions */<br></code></pre>
504 <li>set the program's current locale to what the user configured in the environment:</li>
508 <pre><code>setlocale(LC_ALL, "");</code><br></pre>
512 <li>open a conversion handle for conversion from UTF-8 to the character set of the current locale (once for the entire program):</li>
516 <pre><code>iconv_t iconv_handle;<br>...<br>iconv_handle = iconv_open(nl_langinfo(CODESET), "UTF-8");</code><br>if (iconv_handle == (iconv_t) -1)<br> /* signal an error */<br></pre>
520 <li>then, every string can be converted using the following:</li>
524 <pre><code>/* char* in_buf is the input buffer, size_t in_len is its length */<br>/* char* out_buf is the output buffer, size_t out_len is its length */<br><br>size_t nconv;<br>char *in_ptr = in_buf;<br>char *out_ptr = out_buf;<br>nconv = iconv(iconv_handle, &in_ptr, &in_len, &out_ptr, &out_len);</code></pre>
527 <blockquote>If the output buffer is not big enough, <code>iconv</code> will return -1 and set <code>errno</code> to <code>E2BIG</code>. Also, the <code>in_ptr</code> and <code>out_ptr</code> will point just after the last successfully converted character in the respective buffers, and the <code>in_len</code> and <code>out_len</code> will be updated to show the remaining lengths. There can be two strategies here:<br>
529 <li>Make sure from the beginning
530 that the output buffer is big enough. However, it's difficult to find
531 an absolute maximum length in advance, even given the length of the input
535 <li>Do the conversion in several steps, growing the output buffer each time to make more space, and calling <code>iconv</code>
536 consecutively until the conversion is complete. This is the preferred
537 way (a function could be written to encapsulate all this).</li>
539 Another error case is when the conversion was unsuccessful (if one of the
540 characters can't be represented in the target character set). The <code>iconv</code> function will then also return -1 and set <code>errno</code> to <code>EILSEQ</code>; the <code>in_ptr</code> will point to the character that couldn't be converted. In that case, again two strategies are possible:<br>
542 <li>Just fail the conversion, and show an error. This is not very user friendly, of course.<br>
545 <li>Skip over the character that can't be converted and append a "?" to the output buffer, then call <code>iconv</code> again. Skipping over a UTF-8 character is fairly simple, as follows from the <a href="http://www.cl.cam.ac.uk/%7Emgk25/unicode.html#utf-8">encoding rules</a>:</li>
549 <li>if the first byte is in binary 0xxxxxxx, then the character is only one byte long, just skip over that byte<br>
552 <li>if the first byte is in binary 11xxxxxx, then skip over that byte and all bytes 10xxxxxx that follow.<br>
558 <li>eventually, the conversion handle needs to be closed (when the program exits):<br>
563 <pre><code>iconv_close(iconv_handle);<br></code></pre>
566 The example implementation mentioned above grows the output buffer dynamically and outputs
567 "?" for characters that can't be converted.<br>
569 <hr width="100%" size="2">
571 <pre><font size="-1">$Id$<br>$Name$</font><br></pre>