From: Peter Verthez Date: Sat, 28 Dec 2002 13:21:32 +0000 (+0000) Subject: Added documentation. X-Git-Url: https://git.dlugolecki.net.pl/?a=commitdiff_plain;h=f4eb6ab0f45d0e81abdca3f121a8c91aaa11330d;hp=ba80113991af72ec053d761276c6c7f4a4035880;p=gedcom-parse.git Added documentation. --- diff --git a/utf8/.cvsignore b/utf8/.cvsignore index 77d059e..84183e4 100644 --- a/utf8/.cvsignore +++ b/utf8/.cvsignore @@ -3,4 +3,3 @@ configure config.h.in Makefile.in stamp-h.in -README diff --git a/utf8/README b/utf8/README new file mode 100644 index 0000000..e69de29 diff --git a/utf8/configure.in b/utf8/configure.in index 9967fe0..4e0e464 100644 --- a/utf8/configure.in +++ b/utf8/configure.in @@ -89,4 +89,5 @@ else fi AC_OUTPUT(Makefile + doc/Makefile libcharset/Makefile) diff --git a/utf8/doc/.cvsignore b/utf8/doc/.cvsignore new file mode 100644 index 0000000..70845e0 --- /dev/null +++ b/utf8/doc/.cvsignore @@ -0,0 +1 @@ +Makefile.in diff --git a/utf8/doc/Makefile.am b/utf8/doc/Makefile.am new file mode 100644 index 0000000..93f87c0 --- /dev/null +++ b/utf8/doc/Makefile.am @@ -0,0 +1,9 @@ +## Process this file with automake to produce Makefile.in +# $Id$ +# $Name$ + +DOC_FILES = utf8tools.html +EXTRA_DIST = $(DOC_FILES) +docdir = $(datadir)/doc/@PACKAGE@-@VERSION@ + +doc_DATA = $(DOC_FILES) diff --git a/utf8/doc/utf8tools.html b/utf8/doc/utf8tools.html new file mode 100644 index 0000000..b7bd5d7 --- /dev/null +++ b/utf8/doc/utf8tools.html @@ -0,0 +1,175 @@ +Using the GEDCOM parser library + + + + +

UTF-8 tools library

+
+ +

Index

+ +

Introduction
UTF-8 string functions
+
Converting character sets

Generic interface
Specific locale conversion
+

+ +

Introduction

+The library libutf8tools is part of the GEDCOM parser library, +but it can be used in unrelated programs too. It provides some help +functions for handling UTF-8 encoding. It comes with the following +installed:
+

a library 'libutf8tools.so', which should be linked in in your program
a header 'utf8tools.h', which should be included in the source code of your program

+The following sections describe the features of the library.
+

UTF-8 string functions

+The following simple functions are available to handle UTF-8 strings in general:
+ + +

int   is_utf8_string (char *input);
int   utf8_strlen (char *input);

+The +first one returns 1 if the given input is a valid UTF-8 string, it returns +0 otherwise, the second gives the number of UTF-8 characters in the given +input. Note that the second function assumes that the input is valid +UTF-8, and gives unpredictable results if it isn't.
+

+ +

Converting character sets

+For conversion from and to UTF-8 there is a generic interface which gives +all the necessary flexibility, and a specific interface for conversion to +and from the locale, which is less flexible, but more straightforward.
+

Generic interface

Conversion handle

+In general, the program needs to initialize a conversion handle before some +actual text can be converted to and from UTF-8. This initialization +(and the cleanup at the end) is performed via the following functions: +

convert_t   initialize_utf8_conversion (const char *charset, int ext_outbuf);
void        cleanup_utf8_conversion (convert_t conv);

+The first function returns a conversion handle, which needs to be passed +to all generic conversion functions. Through this handle, bidirectional +conversion can take place between UTF-8 and the given character set 'charset'. + The implementation of this handle is not visible to the program that +uses it. In case of an error, the returned value is NULL and errno gives the error that occurred.
+
+The second parameter ext_outbuf should be non-zero if you want +to control the output buffer yourself (see below). For normal circumstances, +you should pass 0 for this parameter.
+
+To avoid memory leaks, it is advised that conversion handles are cleaned up when not needed anymore, using the cleanup_utf8_conversion function. Note that after using this function, any access to the handle will result in undefined behaviour.
+

Conversion functions

+Once a conversion handle is initialized, it can be used to convert text between +UTF-8 and the given character set. There are three functions available +to do so: +

char* convert_from_utf8 (convert_t conv, const char* input, int* conv_fails, size_t* output_len);

char* convert_to_utf8 (convert_t conv, const char* input, size_t input_len);
char* convert_to_utf8_incremental (convert_t conv, const char* input, size_t input_len);

+All three functions take the conversion handle as first parameter, and the +text to convert as second parameter. They return a pointer to an output +buffer, which is overwritten at each call of the functions (unless you control +your own output buffers, see below).
+
+The difference between the last two functions is that convert_to_utf8 converts only entire strings (i.e. it resets the conversion state each time), whereas convert_to_utf8_incremental +takes previous conversions into account for the current conversion (left +over input characters from the previous conversion can then be combined with +the current input characters). If you pass NULL as input to convert_to_utf8_incremental, the conversion restarts from a clean state.
+
+Since conversion from UTF-8 to another character set can fail (it's possible +that some characters cannot be encoded in the target character set), the +function convert_from_utf8 has a third parameter, conv_fails, +which can return the number of conversion failures in the input. Pass +a pointer to an integer if you're interested, or pass NULL otherwise. Note +that for conversion failures the string '?' will be put in the output instead +of the character that could not be converted. This string can be changed +using:
+

int conversion_set_unknown (convert_t conv, const char *unknown);

+Some character sets use wide characters to encode text. But since the +conversion functions above for simplicity all need and return normal char +strings, it is necessary to know in some cases how long the strings are (if +the string is actually using wide characters, then it cannot be considered +a null-terminated string, so strlen cannot work on it).
+
+For this reason, the function convert_from_utf8 has a fourth +parameter which can return the length of the output string (pass NULL if +you know you don't need it), and the other functions have an input_len parameter, which should always be the string length of the input string, even if it could also be retrieved via strlen.
+

Controlling the output buffer
+

+In some cases, you'd like to control the output buffer yourself, e.g. when +you want to have multiple output buffers for the efficiency of not having +to copy the strings. This can be done by declaring your intention at +the initialization of the conversion handle (see above). In that case, +the initialization doesn't allocate an output buffer itself, and you have +to control it via the following functions, before you can do any conversion: +

conv_buffer_t create_conv_buffer (int initial_size);
void          free_conv_buffer (conv_buffer_t buf);

int           conversion_set_output_buffer (convert_t conv, conv_buffer_t buf);

+The first function returns a handle to a new conversion buffer with given +initial size (the buffer is expanded dynamically when necessary). The +second function frees the buffer: all further access to the buffer handle +will result in undefined behaviour.
+
+The third function allows to set the current output buffer for the given +conversion handle. This allows to switch between output buffers. The +function returns 1 on success, 0 on failure.
+

Specific locale conversion
+

+For conversion to the current locale, there is a simpler interface available, +which takes care of the conversion handle implicitly. The following +functions are available:
+

char *convert_utf8_to_locale (char *input, int *conv_failures);
char *convert_locale_to_utf8 (char *input);

+

+ + Both functions return a pointer to a static buffer that is overwritten + on each call. To function properly, the application must first set +the locale using the setlocale function. +
+
+ If you pass a pointer to an integer to the first function, it will be +set to the number of conversion failures, i.e. characters that couldn't +be converted; you can also just pass NULL if you are not interested +(note that usually, the interesting information is just whether there +were conversion failures or not, which is then given by the integer +being bigger than zero or not). The second function doesn't need this, +because any locale can be converted to UTF-8.
+
+ You can change the "?" that is output for characters that can't be converted + to any string you want, using the following function before the conversion + calls:
+ +

void convert_set_unknown (const char *unknown);

+ + +

+ +

$Id$
$Name$