X-Git-Url: https://git.dlugolecki.net.pl/?a=blobdiff_plain;f=utf8%2Fdoc%2Futf8tools.html;fp=utf8%2Fdoc%2Futf8tools.html;h=b7bd5d70b530cfec94aa317dbbde0cba375b56dd;hb=f4eb6ab0f45d0e81abdca3f121a8c91aaa11330d;hp=0000000000000000000000000000000000000000;hpb=ba80113991af72ec053d761276c6c7f4a4035880;p=gedcom-parse.git diff --git a/utf8/doc/utf8tools.html b/utf8/doc/utf8tools.html new file mode 100644 index 0000000..b7bd5d7 --- /dev/null +++ b/utf8/doc/utf8tools.html @@ -0,0 +1,175 @@ +Using the GEDCOM parser library + + + + +

UTF-8 tools library

+
+ +

Index

+ +

Introduction
UTF-8 string functions
+
Converting character sets

Generic interface
Specific locale conversion
+

+ +

Introduction

+The library libutf8tools is part of the GEDCOM parser library, +but it can be used in unrelated programs too. It provides some help +functions for handling UTF-8 encoding. It comes with the following +installed:
+

a library 'libutf8tools.so', which should be linked in in your program
a header 'utf8tools.h', which should be included in the source code of your program

+The following sections describe the features of the library.
+

UTF-8 string functions

+The following simple functions are available to handle UTF-8 strings in general:
+ + +

int   is_utf8_string (char *input);
int   utf8_strlen (char *input);

+The +first one returns 1 if the given input is a valid UTF-8 string, it returns +0 otherwise, the second gives the number of UTF-8 characters in the given +input. Note that the second function assumes that the input is valid +UTF-8, and gives unpredictable results if it isn't.
+

+ +

Converting character sets

+For conversion from and to UTF-8 there is a generic interface which gives +all the necessary flexibility, and a specific interface for conversion to +and from the locale, which is less flexible, but more straightforward.
+

Generic interface

Conversion handle

+In general, the program needs to initialize a conversion handle before some +actual text can be converted to and from UTF-8. This initialization +(and the cleanup at the end) is performed via the following functions: +

convert_t   initialize_utf8_conversion (const char *charset, int ext_outbuf);
void        cleanup_utf8_conversion (convert_t conv);

+The first function returns a conversion handle, which needs to be passed +to all generic conversion functions. Through this handle, bidirectional +conversion can take place between UTF-8 and the given character set 'charset'. + The implementation of this handle is not visible to the program that +uses it. In case of an error, the returned value is NULL and errno gives the error that occurred.
+
+The second parameter ext_outbuf should be non-zero if you want +to control the output buffer yourself (see below). For normal circumstances, +you should pass 0 for this parameter.
+
+To avoid memory leaks, it is advised that conversion handles are cleaned up when not needed anymore, using the cleanup_utf8_conversion function. Note that after using this function, any access to the handle will result in undefined behaviour.
+

Conversion functions

+Once a conversion handle is initialized, it can be used to convert text between +UTF-8 and the given character set. There are three functions available +to do so: +

char* convert_from_utf8 (convert_t conv, const char* input, int* conv_fails, size_t* output_len);

char* convert_to_utf8 (convert_t conv, const char* input, size_t input_len);
char* convert_to_utf8_incremental (convert_t conv, const char* input, size_t input_len);

+All three functions take the conversion handle as first parameter, and the +text to convert as second parameter. They return a pointer to an output +buffer, which is overwritten at each call of the functions (unless you control +your own output buffers, see below).
+
+The difference between the last two functions is that convert_to_utf8 converts only entire strings (i.e. it resets the conversion state each time), whereas convert_to_utf8_incremental +takes previous conversions into account for the current conversion (left +over input characters from the previous conversion can then be combined with +the current input characters). If you pass NULL as input to convert_to_utf8_incremental, the conversion restarts from a clean state.
+
+Since conversion from UTF-8 to another character set can fail (it's possible +that some characters cannot be encoded in the target character set), the +function convert_from_utf8 has a third parameter, conv_fails, +which can return the number of conversion failures in the input. Pass +a pointer to an integer if you're interested, or pass NULL otherwise. Note +that for conversion failures the string '?' will be put in the output instead +of the character that could not be converted. This string can be changed +using:
+

int conversion_set_unknown (convert_t conv, const char *unknown);

+Some character sets use wide characters to encode text. But since the +conversion functions above for simplicity all need and return normal char +strings, it is necessary to know in some cases how long the strings are (if +the string is actually using wide characters, then it cannot be considered +a null-terminated string, so strlen cannot work on it).
+
+For this reason, the function convert_from_utf8 has a fourth +parameter which can return the length of the output string (pass NULL if +you know you don't need it), and the other functions have an input_len parameter, which should always be the string length of the input string, even if it could also be retrieved via strlen.
+

Controlling the output buffer
+

+In some cases, you'd like to control the output buffer yourself, e.g. when +you want to have multiple output buffers for the efficiency of not having +to copy the strings. This can be done by declaring your intention at +the initialization of the conversion handle (see above). In that case, +the initialization doesn't allocate an output buffer itself, and you have +to control it via the following functions, before you can do any conversion: +

conv_buffer_t create_conv_buffer (int initial_size);
void          free_conv_buffer (conv_buffer_t buf);

int           conversion_set_output_buffer (convert_t conv, conv_buffer_t buf);

+The first function returns a handle to a new conversion buffer with given +initial size (the buffer is expanded dynamically when necessary). The +second function frees the buffer: all further access to the buffer handle +will result in undefined behaviour.
+
+The third function allows to set the current output buffer for the given +conversion handle. This allows to switch between output buffers. The +function returns 1 on success, 0 on failure.
+

Specific locale conversion
+

+For conversion to the current locale, there is a simpler interface available, +which takes care of the conversion handle implicitly. The following +functions are available:
+

char *convert_utf8_to_locale (char *input, int *conv_failures);
char *convert_locale_to_utf8 (char *input);

+

+ + Both functions return a pointer to a static buffer that is overwritten + on each call. To function properly, the application must first set +the locale using the setlocale function. +
+
+ If you pass a pointer to an integer to the first function, it will be +set to the number of conversion failures, i.e. characters that couldn't +be converted; you can also just pass NULL if you are not interested +(note that usually, the interesting information is just whether there +were conversion failures or not, which is then given by the integer +being bigger than zero or not). The second function doesn't need this, +because any locale can be converted to UTF-8.
+
+ You can change the "?" that is output for characters that can't be converted + to any string you want, using the following function before the conversion + calls:
+ +

void convert_set_unknown (const char *unknown);

+ + +

+ +

$Id$
$Name$