From f4eb6ab0f45d0e81abdca3f121a8c91aaa11330d Mon Sep 17 00:00:00 2001 From: Peter Verthez Date: Sat, 28 Dec 2002 13:21:32 +0000 Subject: [PATCH] Added documentation. --- utf8/.cvsignore | 1 - utf8/README | 0 utf8/configure.in | 1 + utf8/doc/.cvsignore | 1 + utf8/doc/Makefile.am | 9 +++ utf8/doc/utf8tools.html | 175 ++++++++++++++++++++++++++++++++++++++++ 6 files changed, 186 insertions(+), 1 deletion(-) create mode 100644 utf8/README create mode 100644 utf8/doc/.cvsignore create mode 100644 utf8/doc/Makefile.am create mode 100644 utf8/doc/utf8tools.html diff --git a/utf8/.cvsignore b/utf8/.cvsignore index 77d059e..84183e4 100644 --- a/utf8/.cvsignore +++ b/utf8/.cvsignore @@ -3,4 +3,3 @@ configure config.h.in Makefile.in stamp-h.in -README diff --git a/utf8/README b/utf8/README new file mode 100644 index 0000000..e69de29 diff --git a/utf8/configure.in b/utf8/configure.in index 9967fe0..4e0e464 100644 --- a/utf8/configure.in +++ b/utf8/configure.in @@ -89,4 +89,5 @@ else fi AC_OUTPUT(Makefile + doc/Makefile libcharset/Makefile) diff --git a/utf8/doc/.cvsignore b/utf8/doc/.cvsignore new file mode 100644 index 0000000..70845e0 --- /dev/null +++ b/utf8/doc/.cvsignore @@ -0,0 +1 @@ +Makefile.in diff --git a/utf8/doc/Makefile.am b/utf8/doc/Makefile.am new file mode 100644 index 0000000..93f87c0 --- /dev/null +++ b/utf8/doc/Makefile.am @@ -0,0 +1,9 @@ +## Process this file with automake to produce Makefile.in +# $Id$ +# $Name$ + +DOC_FILES = utf8tools.html +EXTRA_DIST = $(DOC_FILES) +docdir = $(datadir)/doc/@PACKAGE@-@VERSION@ + +doc_DATA = $(DOC_FILES) diff --git a/utf8/doc/utf8tools.html b/utf8/doc/utf8tools.html new file mode 100644 index 0000000..b7bd5d7 --- /dev/null +++ b/utf8/doc/utf8tools.html @@ -0,0 +1,175 @@ +Using the GEDCOM parser library + + + + +

UTF-8 tools library

+
+ +

Index

+ + + +

+

Introduction

+The library libutf8tools is part of the GEDCOM parser library, +but it can be used in unrelated programs too.  It provides some help +functions for handling UTF-8 encoding.  It comes with the following +installed:
+ +The following sections describe the features of the library.
+
+

UTF-8 string functions

+The following simple functions are available to handle UTF-8 strings in general:
+ + +
+
int   is_utf8_string (char *input);
int utf8_strlen (char *input);
+The +first one returns 1 if the given input is a valid UTF-8 string, it returns +0 otherwise, the second gives the number of UTF-8 characters in the given +input.  Note that the second function assumes that the input is valid +UTF-8, and gives unpredictable results if it isn't.
+
+ +

Converting character sets

+For conversion from and to UTF-8 there is a generic interface which gives +all the necessary flexibility, and a specific interface for conversion to +and from the locale, which is less flexible, but more straightforward.
+

Generic interface

+

Conversion handle

+In general, the program needs to initialize a conversion handle before some +actual text can be converted to and from UTF-8.  This initialization +(and the cleanup at the end) is performed via the following functions: +
+
convert_t   initialize_utf8_conversion (const char *charset, int ext_outbuf);
void cleanup_utf8_conversion (convert_t conv);
+The first function returns a conversion handle, which needs to be passed +to all generic conversion functions.  Through this handle, bidirectional +conversion can take place between UTF-8 and the given character set 'charset'. + The implementation of this handle is not visible to the program that +uses it.  In case of an error, the returned value is NULL and errno gives the error that occurred.
+
+The second parameter ext_outbuf should be non-zero if you want +to control the output buffer yourself (see below).  For normal circumstances, +you should pass 0 for this parameter.
+
+To avoid memory leaks, it is advised that conversion handles are cleaned up when not needed anymore, using the cleanup_utf8_conversion function.  Note that after using this function, any access to the handle will result in undefined behaviour.
+

Conversion functions

+Once a conversion handle is initialized, it can be used to convert text between +UTF-8 and the given character set.  There are three functions available +to do so: +
+
char* convert_from_utf8 (convert_t conv, const char* input, int* conv_fails, size_t* output_len);

char* convert_to_utf8 (convert_t conv, const char* input, size_t input_len);
char* convert_to_utf8_incremental (convert_t conv, const char* input, size_t input_len);
+All three functions take the conversion handle as first parameter, and the +text to convert as second parameter.  They return a pointer to an output +buffer, which is overwritten at each call of the functions (unless you control +your own output buffers, see below).  
+
+The difference between the last two functions is that convert_to_utf8 converts only entire strings (i.e. it resets the conversion state each time), whereas convert_to_utf8_incremental +takes previous conversions into account for the current conversion (left +over input characters from the previous conversion can then be combined with +the current input characters).  If you pass NULL as input to convert_to_utf8_incremental, the conversion restarts from a clean state.
+
+Since conversion from UTF-8 to another character set can fail (it's possible +that some characters cannot be encoded in the target character set), the +function convert_from_utf8 has a third parameter, conv_fails, +which can return the number of conversion failures in the input.  Pass +a pointer to an integer if you're interested, or pass NULL otherwise.  Note +that for conversion failures the string '?' will be put in the output instead +of the character that could not be converted.  This string can be changed +using:
+
+
int conversion_set_unknown (convert_t conv, const char *unknown);
+
+Some character sets use wide characters to encode text.  But since the +conversion functions above for simplicity all need and return normal char +strings, it is necessary to know in some cases how long the strings are (if +the string is actually using wide characters, then it cannot be considered +a null-terminated string, so strlen cannot work on it).  
+
+For this reason, the function convert_from_utf8 has a fourth +parameter which can return the length of the output string (pass NULL if +you know you don't need it), and the other functions have an input_len parameter, which should always be the string length of the input string, even if it could also be retrieved via strlen.
+

Controlling the output buffer
+

+In some cases, you'd like to control the output buffer yourself, e.g. when +you want to have multiple output buffers for the efficiency of not having +to copy the strings.  This can be done by declaring your intention at +the initialization of the conversion handle (see above).  In that case, +the initialization doesn't allocate an output buffer itself, and you have +to control it via the following functions, before you can do any conversion: +
+
conv_buffer_t create_conv_buffer (int initial_size);
void free_conv_buffer (conv_buffer_t buf);

int conversion_set_output_buffer (convert_t conv, conv_buffer_t buf);
+The first function returns a handle to a new conversion buffer with given +initial size (the buffer is expanded dynamically when necessary).  The +second function frees the buffer: all further access to the buffer handle +will result in undefined behaviour.
+
+The third function allows to set the current output buffer for the given +conversion handle.  This allows to switch between output buffers.  The +function returns 1 on success, 0 on failure.
+

Specific locale conversion
+

+For conversion to the current locale, there is a simpler interface available, +which takes care of the conversion handle implicitly.  The following +functions are available:
+
+
char *convert_utf8_to_locale (char *input, int *conv_failures);
char *convert_locale_to_utf8 (char *input);
+
+
+
+ + Both functions return a pointer to a static buffer that is overwritten + on each call.  To function properly, the application must first set +the locale using the setlocale function. +  
+
+ If you pass a pointer to an integer to the first function, it will be +set to the number of conversion failures, i.e. characters that couldn't +be converted; you can also just pass NULL if you are not interested +(note that usually, the interesting information is just whether there +were conversion failures or not, which is then given by the integer +being bigger than zero or not).  The second function doesn't need this, +because any locale can be converted to UTF-8.
+
+ You can change the "?" that is output for characters that can't be converted + to any string you want, using the following function before the conversion + calls:
+ +
+
void convert_set_unknown (const char *unknown);
+
+ + +
+ +
$Id$
$Name$

+ + +
                    
+ + +
+
+
+
+
+
+
+ \ No newline at end of file -- 2.30.2