1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
4 <title>Using the GEDCOM parser library</title>
6 <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
10 <h1 align="center">Using the GEDCOM parser library</h1>
16 <li><a href="#anchor">Overview</a></li>
17 <li><a href="#Error_handling">Error handling</a></li>
18 <li><a href="#Data_callback_mechanism">Data callback mechanism</a></li>
21 <li><a href="#Start_and_end_callbacks">Start and end callbacks</a></li>
22 <li><a href="#Default_callbacks">Default callbacks</a></li>
25 <li><a href="#Other_API_functions">Other API functions</a></li>
28 <li><a href="#Debugging">Debugging</a></li>
29 <li><a href="#Error_treatment">Error treatment</a></li>
30 <li><a href="#Compatibility_mode">Compatibility mode</a></li>
33 <li><a href="interface.html">Interface details</a><br>
38 <hr width="100%" size="2">
39 <h2><a name="Overview"></a>Overview<br>
41 The GEDCOM parser library is built as a callback-based parser (comparable
42 to the SAX interface of XML). It comes with:<br>
45 <li>a library (<code>libgedcom.so</code>), to be linked in the application
47 <li>a header file (<code>gedcom.h</code>), to be used in the sources
48 of the application program</li>
49 <li>a header file (<code>gedcom-tags.h</code>) that is also installed,
50 but that is automatically included via <code>gedcom.h</code><br>
54 Next to these, there is also a data directory in <code>$PREFIX/share/gedcom-parse</code>
55 that contains some additional stuff, but which is not immediately important
56 at first. I'll leave the description of the data directory for later.<br>
58 The very simplest call of the gedcom parser is simply the following
59 piece of code (include of the gedcom header is assumed, as everywhere in
62 <blockquote><code>int result;<br>
64 result = <b>gedcom_parse_file</b>("myfamily.ged");<br>
66 Although this will not provide much information, one thing it does is
67 parse the entire file and return the result. The function returns
68 0 on success and 1 on failure. No other information is available using
69 this function only.<br>
71 The next sections will refine this to be able to have meaningful errors
72 and the actual data that is in the file.<br>
74 <hr width="100%" size="2">
75 <h2><a name="Error_handling"></a>Error handling</h2>
76 Since this is a relatively simple topic, it is discussed before the actual
77 callback mechanism, although it also uses a callback...<br>
79 The library can be used in several different circumstances, both terminal-based
80 as GUI-based. Therefore, it leaves the actual display of the error
81 message up to the application. For this, the application needs to
82 register a callback before parsing the GEDCOM file, which will be called
83 by the library on errors, warnings and messages.<br>
85 A typical piece of code would be:<br>
87 <blockquote><code>void <b>my_message_handler</b> (Gedcom_msg_type type,
93 <b>gedcom_set_message_handler</b>(my_message_handler);<br>
95 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
97 In the above piece of code, <code>my_message_handler</code> is the callback
98 that will be called for errors (<code>type=ERROR</code>), warnings (<code>
99 type=WARNING</code>) and messages (<code>type=MESSAGE</code>). The
100 callback must have the signature as in the example. For errors, the
101 <code> msg</code> passed to the callback will have the format:<br>
103 <blockquote><code>Error on line</code> <i><lineno></i>: <i><actual_message></i><br>
105 Note that the entire string will be properly internationalized, and encoded
106 in UTF-8 (see "Why UTF-8?" <i>LINK TBD</i>). Also, no newline
107 is appended, so that the application program can use it in any way it wants.
108 Warnings are similar, but use "Warning" instead of "Error". Messages
109 are plain text, without any prefix.<br>
111 With this in place, the resulting code will already show errors and warnings
112 produced by the parser, e.g. on the terminal if a simple <code>printf</code>
113 is used in the message handler.<br>
115 <hr width="100%" size="2">
116 <h2><a name="Data_callback_mechanism"></a>Data callback mechanism</h2>
117 The most important use of the parser is of course to get the data out
118 of the GEDCOM file. As already mentioned, the parser uses a callback
119 mechanism for that. In fact, the mechanism involves two levels.<br>
121 The primary level is that each of the sections in a GEDCOM file is notified
122 to the application code via a "start element" callback and an "end element"
123 callback (much like in a SAX interface for XML), i.e. when a line containing
124 a certain tag is parsed, the "start element" callback is called for that
125 tag, and when all its subordinate lines with their tags have been processed,
126 the "end element" callback is called for the original tag. Since GEDCOM
127 is hierarchical, this results in properly nested calls to appropriate "start
128 element" and "end element" callbacks.<br>
130 However, it would be typical for a genealogy program to support only
131 a subset of the GEDCOM standard, certainly a program that is still under
132 development. Moreover, under GEDCOM it is allowed for an application
133 to define its own tags, which will typically not be supported by another
134 application. Still, in that case, data preservation is important;
135 it would hardly be accepted that information that is not understood by
136 a certain program is just removed.<br>
138 Therefore, the second level of callbacks involves a "default callback".
139 An application needs to subscribe to callbacks for tags it does support,
140 and need to provide a "default callback" which will be called for tags it
141 doesn't support. The application can then choose to just store the
142 information that comes via the default callback in plain textual format.<br>
144 After this introduction, let's see what the API looks like...<br>
147 <h3><a name="Start_and_end_callbacks"></a>Start and end callbacks</h3>
149 <h4><i>Callbacks for records</i> <br>
151 As a simple example, we will get some information from the header of
152 a GEDCOM file. First, have a look at the following piece of code:<br>
154 <blockquote><code>Gedcom_ctxt <b>my_header_start_cb</b> (int level,
156
157 Gedcom_val xref, <br>
158
159 char *tag, <br>
160
161 char *raw_value,<br>
162
163 int parsed_tag, <br>
164
165 Gedcom_val parsed_value)<br>
167 printf("The header starts\n");<br>
168 return (Gedcom_ctxt)1;<br>
171 void <b>my_header_end_cb</b> (Gedcom_ctxt self)<br>
173 printf("The header ends, context is %d\n", self); /* context
174 will print as "1" */<br>
178 <b>gedcom_subscribe_to_record</b>(REC_HEAD, my_header_start_cb,
179 my_header_end_cb);<br>
181 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
183 Using the <code>gedcom_subscribe_to_record</code> function, the application
184 requests to use the specified callbacks as start and end callback. The
185 end callback is optional: you can pass <code>NULL</code> if you are not
186 interested in the end callback. The identifiers to use as first argument
187 to the function (here <code>REC_HEAD</code>) are described in the <a href="interface.html#Record_identifiers">
188 interface details</a>.<br>
190 From the name of the function it becomes clear that this function is
191 specific to complete records. For the separate elements in records
192 there is another function, which we'll see shortly. Again, the callbacks
193 need to have the signatures as shown in the example.<br>
195 The <code>Gedcom_ctxt</code> type that is used as a result of the start
196 callback and as an argument to the end callback is vital for passing context
197 necessary for the application. This type is meant to be opaque; in
198 fact, it's a void pointer, so you can pass anything via it. The important
199 thing to know is that the context that the application returns in the start
200 callback will be passed in the end callback as an argument, and as we will
201 see shortly, also to all the directly subordinate elements of the record.<br>
203 The <code>tag</code> is the GEDCOM tag in string format, the <code>parsed_tag</code>
204 is an integer, for which symbolic values are defined as <code>TAG_HEAD,</code>
205 <code>TAG_SOUR,</code> <code>TAG_DATA,</code> ... and <code>USERTAG </code><code></code>
206 for the application-specific tags. These values are defined in the
207 header <code>gedcom-tags.h</code> that is installed, and included via <code>
208 gedcom.h</code> (so no need to include <code>gedcom-tags.h</code> yourself).<br>
210 The example passes a simple integer as context, but an application could
211 e.g. pass a <code>struct</code> that will contain the information for the
212 header. In the end callback, the application could then e.g. do some
213 finalizing operations on the <code>struct</code> to put it in its database.<br>
215 (Note that the <code>Gedcom_val</code> type for the <code>xref</code>
216 and <code>parsed_value</code> arguments was not discussed, see further
220 <h4><i>Callbacks for elements</i></h4>
221 We will now retrieve the SOUR field (the name of the program that wrote
222 the file) from the header:<br>
224 <blockquote><code>Gedcom_ctxt <b>my_header_source_start_cb</b>(Gedcom_ctxt
226
227 int
228 level,<br>
229
230 char*
231 tag,<br>
232
233 char*
234 raw_value,<br>
235
236 int
237 parsed_tag,<br>
238
239 Gedcom_val
240 parsed_value)<br>
242 char *source = GEDCOM_STRING(parsed_value);<br>
243 printf("This file was written by %s\n", source);<br>
244 return parent;<br>
247 void <b>my_header_source_end_cb</b>(Gedcom_ctxt parent,<br>
248
249 Gedcom_ctxt self,<br>
250
251 Gedcom_val parsed_value)<br>
253 printf("End of the source description\n");<br>
257 <b>gedcom_subscribe_to_element</b>(ELT_HEAD_SOUR,<br>
258
259 my_header_source_start_cb,<br>
260
261 my_header_source_end_cb);<br>
263 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
265 The subscription mechanism for elements is similar, only the signatures
266 of the callbacks differ. The signature for the start callback shows
267 that the context of the parent line (e.g. the <code>struct</code> that describes
268 the header) is passed to this start callback. The callback itself
269 returns here the same context, but this can be its own context object of
270 course. The end callback is called with both the context of the parent
271 and the context of itself, which will be the same in the example. Again,
272 the list of identifiers to use as a first argument for the subscription
273 function are detailed in the <a href="interface.html#Element_identifiers">
274 interface details</a> .<br>
276 If we look at the other arguments of the start callback, we see the level
277 number (the initial number of the line in the GEDCOM file), the tag (e.g.
278 "SOUR"), and then a raw value, a parsed tag and a parsed value. The
279 raw value is just the raw string that occurs as value on the line next to
280 the tag (in UTF-8 encoding). The parsed value is the meaningful value
281 that is parsed from that raw string. The parsed tag is described in
282 the section for record callbacks.<br>
284 The <code>Gedcom_val</code> type is meant to be an opaque type. The
285 only thing that needs to be known about it is that it can contain specific
286 data types, which have to be retrieved from it using pre-defined macros.
287 These data types are described in the <a href="interface.html#Gedcom_val_types">
288 interface details</a>. <br>
290 Some extra notes:<br>
293 <li>The <code>Gedcom_val</code> argument of the end callback
294 is currently not used. It is there for future enhancements.</li>
295 <li>There is also a <code>Gedcom_val</code> argument in the
296 start callback for records. This argument is currently a string value
297 giving the pointer in string form.</li>
301 <h3><a name="Default_callbacks"></a>Default callbacks<br>
303 As described above, an application doesn't always implement the entire
304 GEDCOM spec, and application-specific tags may have been added by other
305 applications. To preserve this extra data anyway, a default callback
306 can be registered by the application, as in the following example:<br>
308 <blockquote><code>void <b>my_default_cb</b> (Gedcom_ctxt parent,
309 int level, char* tag, char* raw_value, int parsed_tag)<br>
315 <b>gedcom_set_default_callback</b>(my_default_cb);<br>
317 result = <b>gedcom_parse_file</b>("myfamily.ged");</code><br>
319 This callback has a similar signature as the previous ones,
320 but it doesn't contain a parsed value. However, it does contain the
321 parent context, that was returned by the application for the most specific
322 containing tag that the application supported.<br>
324 Suppose e.g. that this callback is called for some tags in the header
325 that are specific to some other application, then our application could
326 make sure that the parent context contains the struct or object that represents
327 the header, and use the default callback here to add the level, tag and raw_value
328 as plain text in a member of that struct or object, thus preserving the
329 information. The application can then write this out when the data
330 is saved again in a GEDCOM file. To make it more specific, consider
331 the following example:<br>
333 <blockquote><code>struct header {<br>
334 char* source;<br>
336 char* extra_text;<br>
339 Gedcom_ctxt my_header_start_cb(int level, Gedcom_val xref, char* tag,
341
342 int parsed_tag, Gedcom_val parsed_value)<br>
344 struct header head = my_make_header_struct();<br>
345 return (Gedcom_ctxt)head;<br>
348 void my_default_cb(Gedcom_ctxt parent, int level, char* tag, char* raw_value,
351 struct header head = (struct header)parent;<br>
352 my_header_add_to_extra_text(head, level, tag, raw_value);<br>
355 gedcom_set_default_callback(my_default_cb);<br>
356 gedcom_subscribe_to_record(REC_HEAD, my_header_start, NULL);<br>
358 result = gedcom_parse_file(filename);</code><br>
360 Note that the default callback will be called for any tag that isn't specifically
361 subscribed upon by the application, and can thus be called in various contexts.
362 For simplicity, the example above doesn't take this into account (the
363 <code>parent</code> could be of different types, depending
366 <hr width="100%" size="2">
368 <h2><a name="Other_API_functions"></a>Other API functions<br>
370 Although the above describes the basic interface of libgedcom, there are
371 some other functions that allow to customize the behaviour of the library.
372 These will be explained in the current section.<br>
374 <h3><a name="Debugging"></a>Debugging</h3>
375 The library can generate various debugging output, not only from itself,
376 but also the debugging output generated by the yacc parser. By default,
377 no debugging output is generated, but this can be customized using the following
380 <blockquote><code>void <b>gedcom_set_debug_level</b> (int level,
381 FILE* trace_output)</code><br>
383 The <code>level</code> can be one of the following values:<br>
386 <li>0: no debugging information (this is the default)</li>
387 <li>1: only debugging information from libgedcom
389 <li>2: debugging information from libgedcom and
393 If the <code>trace_output</code> is <code>NULL</code>, debugging information
394 will be written to <code>stderr</code>, otherwise the given file handle is
395 used (which must be open).<br>
398 <h3><a name="Error_treatment"></a>Error treatment</h3>
399 One of the previous sections already described the callback to be registered
400 to get error messages. The library also allows to customize what happens
401 on an error, using the following function:<br>
403 <blockquote><code>void <b>gedcom_set_error_handling</b> (Gedcom_err_mech
404 mechanism)</code><br>
406 The <code>mechanism</code> can be one of:<br>
409 <li><code>IMMED_FAIL</code>: immediately fail the
410 parsing on an error (this is the default)</li>
411 <li><code>DEFER_FAIL</code>: continue parsing after
412 an error, but return a failure code eventually</li>
413 <li><code>IGNORE_ERRORS</code>: continue parsing after
414 an error, return success always</li>
417 This doesn't influence the generation of error or warning messages, only
418 the behaviour of the parser and its return code.<br>
421 <h3><a name="Compatibility_mode"></a>Compatibility mode<br>
423 Applications are not necessarily true to the GEDCOM spec (or use a different
424 version than 5.5). The intention is that the library is resilient to
425 this, and goes in compatibility mode for files written by specific programs
426 (detected via the HEAD.SOUR tag). This compatibility mode can be enabled
427 and disabled via the following function:<br>
429 <blockquote><code>void <b>gedcom_set_compat_handling</b>
430 (int enable_compat)</code><br>
432 The argument can be:<br>
435 <li>0: disable compatibility mode</li>
436 <li>1: allow compatibility mode (this is the default)<br>
440 Note that, currently, no actual compatibility code is present, but this
441 is on the to-do list.<br>
443 <hr width="100%" size="2">
445 <pre>$Id$<br>$Name$<br></pre>