From 5609cf009e11653fec8314695d9897f9b94204d3 Mon Sep 17 00:00:00 2001 From: Peter Verthez Date: Sun, 13 Jan 2002 12:22:04 +0000 Subject: [PATCH] Explanation about character encoding. --- doc/usage.html | 114 +++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 111 insertions(+), 3 deletions(-) diff --git a/doc/usage.html b/doc/usage.html index cd050df..00003be 100644 --- a/doc/usage.html +++ b/doc/usage.html @@ -25,7 +25,9 @@
  • Error treatment
  • Compatibility mode
  • - +
  • Converting character sets
    +
  • +
  • Interface details
  • @@ -99,7 +101,7 @@ actual callback mechanism, although it also uses a callback...
    Error on line <lineno>: <actual_message>
    Note that the entire string will be properly internationalized, and -encoded in UTF-8 (see "Why UTF-8?"  LINK TBD).  Also, +encoded in UTF-8 (Why UTF-8?).  Also, no newline is appended, so that the application program can use it in any way it wants.  Warnings are similar, but use "Warning" instead of "Error".  Messages are plain text, without any prefix.
    @@ -443,9 +445,115 @@ enabled and disabled via the following function:
    Note that, currently, no actual compatibility code is present, but this is on the to-do list.
    +
    +

    Converting character sets

    +All strings passed by the GEDCOM parser to the application are in UTF-8 encoding. + Typically, an application needs to convert this to something else to +be able to display it.
    +
    +The most common case is that the output character set is controlled by the locale mechanism (i.e. via the LANG, LC_ALL or LC_CTYPE environment variables), which also controls the gettext + mechanism in the application.  For this, the following steps need to +be taken by the application (more detailed info can be found in the info +file of the GNU libc library in the "Generic Charset Conversion" section +under "Character Set Handling" or online here):
    + +
    +
    +
    #include <locale.h>    /* for setlocale */
    #include <langinfo.h> /* for nl_langinfo */
    #include <iconv.h> /* for iconv_* functions */
    +
    +
    + +
    +
    +
    setlocale(LC_ALL, "");
    +
    +
    + +
    +
    +
    iconv_t iconv_handle;
    ...
    iconv_handle = iconv_open(nl_langinfo(CODESET), "UTF-8");

    if (iconv_handle == (iconv_t) -1)
    /* signal an error */
    +
    +
    + +
    +
    +
    /* char* in_buf is the input buffer,    size_t in_len is its length */
    /* char* out_buf is the output buffer, size_t out_len is its length */

    size_t nconv;
    char *in_ptr = in_buf;
    char *out_ptr = out_buf;
    nconv = iconv(iconv_handle, &in_ptr, &in_len, &out_ptr, &out_len);
    +
    +
    +
    If the output buffer is not big enough, iconv will return -1 and set errno to E2BIG.  Also, the in_ptr and out_ptr will point just after the last successfully converted character in the respective buffers, and the in_len and out_len will be updated to show the remaining lengths.  There can be two strategies here:
    + +Another error case is when the conversion was unsuccessful (if one of the +characters can't be represented in the target character set).  The iconv function will then also return -1 and set errno to EILSEQ; the in_ptr will point to the character that couldn't be converted.  In that case, again two strategies are possible:
    + +
      +
        +
      1. if the first byte is in binary 0xxxxxxx, then the character is only one byte long, just skip over that byte
        +
        +
      2. +
      3. if the first byte is in binary 11xxxxxx, then skip over that byte and all bytes 10xxxxxx that follow.
        +
      4. +
      +
    +
    + +
    +
    +
    iconv_close(iconv_handle);
    +
    +
    + -
    + The source distribution of gedcom-parse contains an example implementation (utf8-locale.c and utf8-locale.h + in the top directory) that grows the output buffer dynamically and outputs +"?" for characters that can't be converted.  Feel free to use it in +your source code (it is not part of the library, and it isn't installed anywhere, +so you need to take over the source and header file in your application). + 
    +
    +Its interface is:
    +
    +
    char *convert_utf8_to_locale (char *input);
    char *convert_locale_to_utf8 (char *input);
    +
    +Both functions return a pointer to a static buffer that is overwritten on +each call.  To function properly, the application must first set the +locale using the setlocale function (the second step above). + All other steps, including setting up and closing down the conversion +handles, are transparantly handled by the two functions.
    +
    +You can change the "?" that is output for characters that can't be converted +to any string you want, using the following function before the conversion +calls:
    +
    +
    void convert_set_unknown (const char *unknown);
    +
    +
    $Id$
    $Name$

    -- 2.30.2