X-Git-Url: https://git.dlugolecki.net.pl/?a=blobdiff_plain;f=doc%2Fusage.html;h=00003be0d0b4484dd762b1c4b4e7c0b174d1065f;hb=b1a87b467d9c6b38aa4a2b2e4537b0deff9a251e;hp=786cf1f1694714bb02dfe73eaf6bf5ec7e564767;hpb=440a08f883e410cb0c714ccdc3bf95d23be3d805;p=gedcom-parse.git diff --git a/doc/usage.html b/doc/usage.html index 786cf1f..00003be 100644 --- a/doc/usage.html +++ b/doc/usage.html @@ -1,9 +1,7 @@ Using the GEDCOM parser library - - - +

Using the GEDCOM parser library

@@ -27,7 +25,9 @@

Error treatment

Compatibility mode

- +

Converting character sets
+

Interface details

@@ -94,15 +94,14 @@ actual callback mechanism, although it also uses a callback...
result = gedcom_parse_file("myfamily.ged");
In the above piece of code, my_message_handler is the callback - that will be called for errors (type=ERROR), warnings (


-    type=WARNING

) and messages (type=MESSAGE). The + that will be called for errors (type=ERROR), warnings (type=WARNING) and messages (type=MESSAGE). The callback must have the signature as in the example. For errors, the msg passed to the callback will have the format:

Error on line <lineno>: <actual_message>

Note that the entire string will be properly internationalized, and -encoded in UTF-8 (see "Why UTF-8?" LINK TBD). Also, +encoded in UTF-8 (Why UTF-8?). Also, no newline is appended, so that the application program can use it in any way it wants. Warnings are similar, but use "Warning" instead of "Error". Messages are plain text, without any prefix.
@@ -169,7 +168,7 @@ a GEDCOM file. First, have a look at the following piece of code:

void my_header_end_cb (Gedcom_ctxt self)
{
- printf("The header ends, context is %d\n", self); /* context + printf("The header ends, context is %d\n", (int)self); /* context will print as "1" */
}

@@ -207,7 +206,7 @@ need to have the signatures as shown in the example.
gedcom.h (so no need to include gedcom-tags.h yourself).

The example passes a simple integer as context, but an application could - e.g. pass a struct that will contain the information for the + e.g. pass a struct (or an object in a C++ application) that will contain the information for the header. In the end callback, the application could then e.g. do some finalizing operations on the struct to put it in its database.

@@ -263,11 +262,11 @@ for this)
The subscription mechanism for elements is similar, only the signatures of the callbacks differ. The signature for the start callback shows - that the context of the parent line (e.g. the struct that + that the context of the parent line (here e.g. the struct that describes the header) is passed to this start callback. The callback -itself returns here the same context, but this can be its own context object +itself returns here in this example the same context, but this can be its own context object of course. The end callback is called with both the context of the -parent and the context of itself, which will be the same in the example. +parent and the context of itself, which in this example will be the same. Again, the list of identifiers to use as a first argument for the subscription function are detailed in the interface details .
@@ -278,7 +277,7 @@ level number (the initial number of the line in the GEDCOM file), the tag raw value is just the raw string that occurs as value on the line next to the tag (in UTF-8 encoding). The parsed value is the meaningful value that is parsed from that raw string. The parsed tag is described in - the section for record callbacks.
+ the section for record callbacks above.

The Gedcom_val type is meant to be an opaque type. The only thing that needs to be known about it is that it can contain specific @@ -291,9 +290,8 @@ level number (the initial number of the line in the GEDCOM file), the tag

The Gedcom_val argument of the end callback is currently not used. It is there for future enhancements.
There is also a Gedcom_val argument in -the start callback for records. This argument is currently a string -value giving the pointer in string form.
There are also two Gedcom_val arguments in +the start callback for records. The first one (xref) contains the xref_value corresponding to the cross-reference (or NULL if there isn't one), the second one (parsed_value) contains the value that is parsed from the raw_value. See the interface details.

@@ -361,6 +359,8 @@ specifically subscribed upon by the application, and can thus be called in various contexts. For simplicity, the example above doesn't take this into account (the parent could be of different types, depending on the context).
+
+Note also that the default callback is not called when the parent context is NULL. This is e.g. the case if none of the "upper" tags has been subscribed upon.

@@ -445,9 +445,115 @@ enabled and disabled via the following function:
Note that, currently, no actual compatibility code is present, but this is on the to-do list.
+

Converting character sets

+All strings passed by the GEDCOM parser to the application are in UTF-8 encoding. + Typically, an application needs to convert this to something else to +be able to display it.
+
+The most common case is that the output character set is controlled by the locale mechanism (i.e. via the LANG, LC_ALL or LC_CTYPE environment variables), which also controls the gettext + mechanism in the application. For this, the following steps need to +be taken by the application (more detailed info can be found in the info +file of the GNU libc library in the "Generic Charset Conversion" section +under "Character Set Handling" or online here):
+

inclusion of some headers:

#include <locale.h>    /* for setlocale */
#include <langinfo.h>  /* for nl_langinfo */
#include <iconv.h>     /* for iconv_* functions */

set the program's current locale to what the user configured in the environment:

setlocale(LC_ALL, "");

open a conversion handle for conversion from UTF-8 to the character set of the current locale (once for the entire program):

iconv_t iconv_handle;
...
iconv_handle = iconv_open(nl_langinfo(CODESET), "UTF-8");
if (iconv_handle == (iconv_t) -1)
  /* signal an error */

then, every string can be converted using the following:

/* char* in_buf is the input buffer,    size_t in_len is its length */
/* char* out_buf is the output buffer,  size_t out_len is its length */

size_t nconv;
char *in_ptr = in_buf;
char *out_ptr = out_buf;
nconv = iconv(iconv_handle, &in_ptr, &in_len, &out_ptr, &out_len);

If the output buffer is not big enough, iconv will return -1 and set errno to E2BIG. Also, the in_ptr and out_ptr will point just after the last successfully converted character in the respective buffers, and the in_len and out_len will be updated to show the remaining lengths. There can be two strategies here:
+
+
Make sure from the beginning +that the output buffer is big enough. However, it's difficult to find +an absolute maximum length in advance, even given the length of the input +string.
+
+
+
Do the conversion in several steps, growing the output buffer each time to make more space, and calling iconv + consecutively until the conversion is complete. This is the preferred +way (a function could be written to encapsulate all this).
+
+Another error case is when the conversion was unsuccessful (if one of the +characters can't be represented in the target character set). The iconv function will then also return -1 and set errno to EILSEQ; the in_ptr will point to the character that couldn't be converted. In that case, again two strategies are possible:
+
+
Just fail the conversion, and show an error. This is not very user friendly, of course.
+
+
+
Skip over the character that can't be converted and append a "?" to the output buffer, then call iconv again. Skipping over a UTF-8 character is fairly simple, as follows from the encoding rules:
+
+
+
+
if the first byte is in binary 0xxxxxxx, then the character is only one byte long, just skip over that byte
+
+
+
if the first byte is in binary 11xxxxxx, then skip over that byte and all bytes 10xxxxxx that follow.
+
+
+
+

eventually, the conversion handle needs to be closed (when the program exits):
+

iconv_close(iconv_handle);

+ -

+ The source distribution of gedcom-parse contains an example implementation (utf8-locale.c and utf8-locale.h + in the top directory) that grows the output buffer dynamically and outputs +"?" for characters that can't be converted. Feel free to use it in +your source code (it is not part of the library, and it isn't installed anywhere, +so you need to take over the source and header file in your application). +
+
+Its interface is:
+

char *convert_utf8_to_locale (char *input);
char *convert_locale_to_utf8 (char *input);

+Both functions return a pointer to a static buffer that is overwritten on +each call. To function properly, the application must first set the +locale using the setlocale function (the second step above). + All other steps, including setting up and closing down the conversion +handles, are transparantly handled by the two functions.
+
+You can change the "?" that is output for characters that can't be converted +to any string you want, using the following function before the conversion +calls:
+

void convert_set_unknown (const char *unknown);

$Id$
$Name$