utf8/doc/utf8tools.html

   1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title>Using the GEDCOM parser library</title>
   2
   3
   4   <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"></head><body text="#000000" bgcolor="#ffffff" link="#000099" vlink="#990099" alink="#000099">
   5
   6 <h1 align="center">UTF-8 tools library</h1>
   7          <br>
   8
   9 <h2>Index</h2>
  10
  11 <ul>
  12           <li><a href="#Introduction">Introduction</a></li><li><a href="#UTF-8_string_functions">UTF-8 string functions</a><br>
  13   </li>
  14   <li><a href="#Converting_character_sets">Converting character sets</a></li>
  15   <ul>
  16     <li><a href="#Generic_interface">Generic interface</a></li>
  17     <li><a href="#Specific_locale_conversion">Specific locale conversion</a><br>
  18     </li>
  19   </ul>
  20
  21
  22
  23
  24 </ul>
  25
  26 <hr width="100%" size="2"><br>
  27 <h2><a name="Introduction"></a>Introduction</h2>
  28 The library <code>libutf8tools</code> is part of the GEDCOM parser library,
  29 but it can be used in unrelated programs too. &nbsp;It provides some help
  30 functions for handling UTF-8 encoding. &nbsp;It comes with the following
  31 installed:<br>
  32 <ul>
  33   <li>a library '<code>libutf8tools.so</code>', which should be linked in in your program</li>
  34   <li>a header '<code>utf8tools.h</code>', which should be included in the source code of your program</li>
  35 </ul>
  36 The following sections describe the features of the library.<br>
  37 <hr width="100%" size="2">
  38 <h2><a name="UTF-8_string_functions"></a>UTF-8 string functions</h2>
  39 The following simple functions are available to handle UTF-8 strings in general:<br>
  40
  41
  42 <blockquote>
  43   <pre><code>int   <b>is_utf8_string</b> (char *input);<br>int   <b>utf8_strlen</b> (char *input);<br></code></pre></blockquote>
  44 The
  45 first one returns 1 if the given input is a valid UTF-8 string, it returns
  46 0 otherwise, the second gives the number of UTF-8 characters in the given
  47 input. &nbsp;Note that the second function assumes that the input is valid
  48 UTF-8, and gives unpredictable results if it isn't.<br>
  49 <hr width="100%" size="2">
  50
  51 <h2><a name="Converting_character_sets"></a>Converting character sets</h2>
  52 For conversion from and to UTF-8 there is a generic interface which gives
  53 all the necessary flexibility, and a specific interface for conversion to
  54 and from the locale, which is less flexible, but more straightforward.<br>
  55 <h3><a name="Generic_interface"></a>Generic interface</h3>
  56 <h4>Conversion handle</h4>
  57 In general, the program needs to initialize a conversion handle before some
  58 actual text can be converted to and from UTF-8. &nbsp;This initialization
  59 (and the cleanup at the end) is performed via the following functions:
  60 <blockquote>
  61   <pre><code>convert_t   <b>initialize_utf8_conversion</b> (const char *charset, int ext_outbuf);<br>void   <b>     cleanup_utf8_conversion</b> (convert_t conv);<br></code></pre></blockquote>
  62 The first function returns a conversion handle, which needs to be passed
  63 to all generic conversion functions. &nbsp;Through this handle, bidirectional
  64 conversion can take place between UTF-8 and the given character set <code>'charset'</code>.
  65 &nbsp;The implementation of this handle is not visible to the program that
  66 uses it. &nbsp;In case of an error, the returned value is NULL and <code>errno</code> gives the error that occurred.<br>
  67 <br>
  68 The second parameter <code>ext_outbuf </code>should be non-zero if you want
  69 to control the output buffer yourself (see below). &nbsp;For normal circumstances,
  70 you should pass 0 for this parameter.<br>
  71 <br>
  72 To avoid memory leaks, it is advised that conversion handles are cleaned up when not needed anymore, using the <code>cleanup_utf8_conversion</code> function. &nbsp;Note that after using this function, any access to the handle will result in undefined behaviour.<br>
  73 <h4>Conversion functions</h4>
  74 Once a conversion handle is initialized, it can be used to convert text between
  75 UTF-8 and the given character set. &nbsp;There are three functions available
  76 to do so:
  77 <blockquote>
  78   <pre><code>char* <b>convert_from_utf8</b> (convert_t conv, const char* input, int* conv_fails, size_t* output_len);<br><br>char* <b>convert_to_utf8</b> (convert_t conv, const char* input, size_t input_len);<br>char* <b>convert_to_utf8_incremental</b> (convert_t conv, const char* input, size_t input_len);<br></code></pre></blockquote>
  79 All three functions take the conversion handle as first parameter, and the
  80 text to convert as second parameter. &nbsp;They return a pointer to an output
  81 buffer, which is overwritten at each call of the functions (unless you control
  82 your own output buffers, see below). &nbsp;<br>
  83 <br>
  84 The difference between the last two functions is that <code>convert_to_utf8</code> converts only entire strings (i.e. it resets the conversion state each time), whereas <code>convert_to_utf8_incremental</code>
  85 takes previous conversions into account for the current conversion (left
  86 over input characters from the previous conversion can then be combined with
  87 the current input characters). &nbsp;If you pass <code>NULL</code> as input to <code>convert_to_utf8_incremental</code>, the conversion restarts from a clean state.<br>
  88 <br>
  89 Since conversion from UTF-8 to another character set can fail (it's possible
  90 that some characters cannot be encoded in the target character set), the
  91 function <code>convert_from_utf8</code> has a third parameter, <code>conv_fails</code>,
  92 which can return the number of conversion failures in the input. &nbsp;Pass
  93 a pointer to an integer if you're interested, or pass NULL otherwise. &nbsp;Note
  94 that for conversion failures the string '?' will be put in the output instead
  95 of the character that could not be converted. &nbsp;This string can be changed
  96 using:<br>
  97 <blockquote>
  98   <pre><code>int <b>conversion_set_unknown</b> (convert_t conv, const char *unknown);</code><br></pre>
  99   </blockquote>
 100 Some character sets use wide characters to encode text. &nbsp;But since the
 101 conversion functions above for simplicity all need and return normal <code>char</code>
 102 strings, it is necessary to know in some cases how long the strings are (if
 103 the string is actually using wide characters, then it cannot be considered
 104 a null-terminated string, so <code>strlen</code> cannot work on it). &nbsp;<br>
 105 <br>
 106 For this reason, the function <code>convert_from_utf8</code> has a fourth
 107 parameter which can return the length of the output string (pass NULL if
 108 you know you don't need it), and the other functions have an <code>input_len</code> parameter, which should always be the string length of the <code>input</code> string, even if it could also be retrieved via strlen.<br>
 109 <h4>Controlling the output buffer<br>
 110 </h4>
 111 In some cases, you'd like to control the output buffer yourself, e.g. when
 112 you want to have multiple output buffers for the efficiency of not having
 113 to copy the strings. &nbsp;This can be done by declaring your intention at
 114 the initialization of the conversion handle (see above). &nbsp;In that case,
 115 the initialization doesn't allocate an output buffer itself, and you have
 116 to control it via the following functions, before you can do any conversion:
 117 <blockquote>
 118   <pre><code>conv_buffer_t <b>create_conv_buffer</b> (int initial_size);<br>void          <b>free_conv_buffer</b> (conv_buffer_t buf);<br><br>int           <b>conversion_set_output_buffer</b> (convert_t conv, conv_buffer_t buf);<br></code></pre></blockquote>
 119 The first function returns a handle to a new conversion buffer with given
 120 initial size (the buffer is expanded dynamically when necessary). &nbsp;The
 121 second function frees the buffer: all further access to the buffer handle
 122 will result in undefined behaviour.<br>
 123 <br>
 124 The third function allows to set the current output buffer for the given
 125 conversion handle. &nbsp;This allows to switch between output buffers. &nbsp;The
 126 function returns 1 on success, 0 on failure.<br>
 127 <h3><a name="Specific_locale_conversion"></a>Specific locale conversion<br>
 128 </h3>
 129 For conversion to the current locale, there is a simpler interface available,
 130 which takes care of the conversion handle implicitly. &nbsp;The following
 131 functions are available:<br>
 132 <blockquote>
 133   <pre><code></code><code>char *<b>convert_utf8_to_locale</b> (char *input, int *conv_failures);<br>char *<b>convert_locale_to_utf8</b> (char *input);<br></code></pre>
 134 </blockquote>
 135 <blockquote>
 136   </blockquote>
 137
 138     Both functions return a pointer to a static buffer that is overwritten
 139  on each call. &nbsp;To function properly, the application must first set
 140 the locale using the <code>setlocale</code> function.
 141  &nbsp;<br>
 142                          <br>
 143    If you pass a pointer to an integer to the first function, it will be
 144 set  to the number of conversion failures, i.e. characters that couldn't
 145 be converted;  you can also just pass <code>NULL</code> if you are not interested
 146 (note  that usually, the interesting information is just whether there <i>
 147 were</i>    conversion failures or not, which is then given by the integer
 148 being bigger  than zero or not). &nbsp;The second function doesn't need this,
 149 because any  locale can be converted to UTF-8.<br>
 150                          <br>
 151     You can change the "?" that is output for characters that can't be converted
 152  to any string you want, using the following function before the conversion
 153  calls:<br>
 154
 155 <blockquote>
 156   <pre><code>void <b>convert_set_unknown</b> (const char *unknown);</code><br></pre>
 157   </blockquote>
 158
 159
 160 <hr width="100%" size="2">
 161
 162 <pre><font size="-1">$Id$<br>$Name$</font><br></pre>
 163
 164
 165 <pre>                    </pre>
 166
 167
 168 <br>
 169 <br>
 170 <br>
 171 <br>
 172 <br>
 173 <br>
 174 <br>
 175 </body></html>