Tizen Native API  6.5

C API: Charset Detection API.

Required Header

#include <utils_i18n.h>

Overview

This function provides a facility for detecting the charset or encoding of character data in an unknown text format. The input data can be from an array of bytes.

Character set detection is at best an imprecise operation. The detection process will attempt to identify the charset that best matches the characteristics of the byte data, but the process is partly statistical in nature, and the results can not be guaranteed to always be correct.

For best accuracy in charset detection, the input data should be primarily in a single language, and a minimum of a few hundred bytes worth of plain text in the language are needed. The detection process will attempt to ignore html or xml style markup that could otherwise obscure the content.

An alternative to the ICU Charset Detector is the Compact Encoding Detector, https://github.com/google/compact_enc_det. It often gives more accurate results, especially with short input samples.

Functions

int i18n_ucsdet_create (i18n_ucharset_detector_h *ucsd)
 Creates an i18n_ucharset_detector_h.
int i18n_ucsdet_destroy (i18n_ucharset_detector_h ucsd)
 Destroys a charset detector.
int i18n_ucsdet_set_text (i18n_ucharset_detector_h ucsd, const char *text_in, int32_t len)
 Sets the input byte data whose charset is to detected.
int i18n_ucsdet_set_declared_encoding (i18n_ucharset_detector_h ucsd, const char *encoding, int32_t length)
 Sets the declared encoding for charset detection.
int i18n_ucsdet_detect (i18n_ucharset_detector_h ucsd, i18n_ucharset_match_h *ucsm)
 Gets the charset that best matches the supplied input data.
int i18n_ucsdet_detect_all (i18n_ucharset_detector_h ucsd, int32_t *matches_found, i18n_ucharset_match_h **ucsm)
 Gets all charset matches that appear to be consistent with the input, returning an array of results.
int i18n_ucsdet_get_name (const i18n_ucharset_match_h ucsm, const char **name)
 Gets the name of the charset represented by an i18n_ucharset_match_h.
int i18n_ucsdet_get_confidence (const i18n_ucharset_match_h ucsm, int32_t *number)
 Gets a confidence number for the quality of the match of the byte data with the charset.
int i18n_ucsdet_get_language (const i18n_ucharset_match_h ucsm, const char **code)
 Gets the RFC 3066 code for the language of the input data.
int i18n_ucsdet_get_uchars (const i18n_ucharset_match_h ucsm, i18n_uchar *buf, int32_t cap, int32_t *number)
 Gets the entire input text as an i18n_uchar string, placing it into a caller-supplied buffer.
int i18n_ucsdet_get_all_detectable_charsets (i18n_ucharset_detector_h ucsd, i18n_uenumeration_h *iterator)
 Gets an iterator over the set of all detectable charsets - over the charsets that are known to the charset detection service.
int i18n_ucsdet_is_input_filter_enabled (i18n_ucharset_detector_h ucsd, i18n_ubool *result)
 Gets whether input filtering is enabled for this charset detector.
int i18n_ucsdet_enable_input_filter (i18n_ucharset_detector_h ucsd, i18n_ubool filter, i18n_ubool *previous_setting)
 Enables filtering of input text.

Typedefs

typedef void * i18n_ucharset_detector_h
 An i18n_ucharset_detector_h handle.
typedef void * i18n_ucharset_match_h
 An i18n_ucharset_match_h handle.

Typedef Documentation

typedef void* i18n_ucharset_detector_h

An i18n_ucharset_detector_h handle.

Since :
6.0
typedef void* i18n_ucharset_match_h

An i18n_ucharset_match_h handle.

Since :
6.0

Function Documentation

Creates an i18n_ucharset_detector_h.

Since :
6.0
Remarks:
The ucsd should be released using i18n_ucsdet_destroy().
Parameters:
[out]ucsdThe newly created charset detector.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter

Destroys a charset detector.

All storage and any other resources owned by this charset detector will be released. Failure to destroy a charset detector when finished with it can result in memory leaks in the application.

Since :
6.0
Parameters:
[in]ucsdThe charset detector to be destroyed.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter

Gets the charset that best matches the supplied input data.

Note though, that because the detection only looks at the start of the input data, there is a possibility that the returned charset will fail to handle the full set of input data.

The returned match ucsm is owned by the detector ucsd. It will remain valid until the detector input is reset, or until the detector is destroyed.

Since :
6.0
Remarks:
The ucsm is valid until ucsd is released.
Parameters:
[in]ucsdThe charset detector to be used.
[out]ucsmAn i18n_ucharset_match_h representing the best matching charset, or NULL if no charset matches the byte data.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
int i18n_ucsdet_detect_all ( i18n_ucharset_detector_h  ucsd,
int32_t *  matches_found,
i18n_ucharset_match_h **  ucsm 
)

Gets all charset matches that appear to be consistent with the input, returning an array of results.

The results are ordered with the best quality match first.

Because the detection only looks at a limited amount of the input byte data, some of the returned charsets may fail to handle the all of input data.

Since :
6.0
Parameters:
[in]ucsdThe charset detector to be used.
[out]matches_foundPointer to a variable that will be set to the number of charsets identified that are consistent with the input data.
[out]ucsmA pointer to an array of pointers to i18n_ucharset_match_h. This array, and the i18n_ucharset_match_h instances it contains, are owned by the Ucsdet, and will remain valid until the detector ucsd is destroyed or modified.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
int i18n_ucsdet_enable_input_filter ( i18n_ucharset_detector_h  ucsd,
i18n_ubool  filter,
i18n_ubool previous_setting 
)

Enables filtering of input text.

If filtering is enabled, text within angle brackets ("<" and ">") will be removed before detection, which will remove most HTML or XML markup.

Since :
6.0
Parameters:
[in]ucsdThe charset detector to check.
[in]filterTrue to enable input text filtering.
[out]previous_settingThe previous setting.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter

Gets an iterator over the set of all detectable charsets - over the charsets that are known to the charset detection service.

The returned iterator provides access to the names of the charsets.

The state of the Charset detector that is passed in does not affect the result of this function, but requiring a valid charset detector as a parameter insures that the charset detection service has been safely initialized and that the required detection data is available.

Note: Multiple different charset encodings in a same family may use a single shared name in this implementation. For example, this method returns an array including "ISO-8859-1" (ISO Latin 1), but not including "windows-1252" (Windows Latin 1). However, actual detection result could be "windows-1252" when the input data matches Latin 1 code points with any points only available in "windows-1252".

Since :
6.0
Remarks:
The iterator should be released using i18n_uenumeration_destroy().
Parameters:
[in]ucsdA Charset detector.
[out]iteratorAn iterator providing access to the detectable charset names.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
int i18n_ucsdet_get_confidence ( const i18n_ucharset_match_h  ucsm,
int32_t *  number 
)

Gets a confidence number for the quality of the match of the byte data with the charset.

Confidence numbers range from zero to 100, with 100 representing complete confidence and zero representing no confidence.

The confidence values are somewhat arbitrary. They define an an ordering within the results for any single detection operation but are not generally comparable between the results for different input.

A confidence value of ten does have a general meaning - it is used for charsets that can represent the input data, but for which there is no other indication that suggests that the charset is the correct one. Pure 7 bit ASCII data, for example, is compatible with a great many charsets, most of which will appear as possible matches with a confidence of 10.

Since :
6.0
Parameters:
[in]ucsmThe charset match object.
[out]numberA confidence number for the charset match.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
int i18n_ucsdet_get_language ( const i18n_ucharset_match_h  ucsm,
const char **  code 
)

Gets the RFC 3066 code for the language of the input data.

The Charset Detection service is intended primarily for detecting charsets, not language. For some, but not all, charsets, a language is identified as a byproduct of the detection process, and that is what is returned by this function.

CAUTION:
1. Language information is not available for input data encoded in all charsets. In particular, no language is identified for UTF-8 input data.
2. Closely related languages may sometimes be confused. If more accurate language detection is required, a linguistic analysis package should be used.

The storage for the returned code is owned by ucsm, and will remain valid while ucsm is valid.

Since :
6.0
Remarks:
The code should be released using free().
Parameters:
[in]ucsmThe charset match object.
[out]codeThe RFC 3066 code for the language of the input data, or an empty string if the language could not be determined.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
int i18n_ucsdet_get_name ( const i18n_ucharset_match_h  ucsm,
const char **  name 
)

Gets the name of the charset represented by an i18n_ucharset_match_h.

The storage for the returned name string is owned by ucsm, and will remain valid while ucsm is valid.

The name returned is suitable for use with the ICU conversion APIs.

Since :
6.0
Remarks:
The name should be released using free().
Parameters:
[in]ucsmThe charset match object.
[out]nameThe name of the matching charset.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
int i18n_ucsdet_get_uchars ( const i18n_ucharset_match_h  ucsm,
i18n_uchar buf,
int32_t  cap,
int32_t *  number 
)

Gets the entire input text as an i18n_uchar string, placing it into a caller-supplied buffer.

A terminating NUL character will be appended to the buffer if space is available.

The number of i18n_uchar characters in the output string, not including the terminating NUL, is returned.

If the supplied buffer is smaller than required to hold the output, the contents of the buffer are undefined. The full output string length (the number of i18n_uchar characters) is returned as always, and can be used to allocate a buffer of the correct size.

Since :
6.0
Parameters:
[in]ucsmThe charset match object.
[in]bufAn i18n_uchar buffer to be filled with the converted text data.
[in]capThe capacity of the buffer in i18n_uchar.
[out]numberThe number of i18n_uchar in the output string.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter

Gets whether input filtering is enabled for this charset detector.

Input filtering removes text that appears to be HTML or XML markup from the input before applying the code page detection heuristics.

Since :
6.0
Parameters:
[in]ucsdThe charset detector to check.
[out]resultTRUE if filtering is enabled.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
int i18n_ucsdet_set_declared_encoding ( i18n_ucharset_detector_h  ucsd,
const char *  encoding,
int32_t  length 
)

Sets the declared encoding for charset detection.

The declared encoding of an input text is an encoding obtained by the user from an HTTP header or XML declaration or similar source that can be provided as an additional hint to the charset detector.

How and whether the declared encoding will be used during the detection process is TBD.

Since :
6.0
Parameters:
[in]ucsdThe charset detector to be used.
[in]encodingAn encoding for the current data obtained from a header or declaration or other source outside of the byte data itself.
[in]lengthThe length of the encoding name, or -1 if the name string is NUL terminated.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
int i18n_ucsdet_set_text ( i18n_ucharset_detector_h  ucsd,
const char *  text_in,
int32_t  len 
)

Sets the input byte data whose charset is to detected.

Ownership of the input text byte array remains with the caller. The input string must not be altered or deleted until the charset detector is either destroyed or reset to refer to different input text.

Since :
6.0
Parameters:
[in]ucsdThe charset detector to be used.
[in]text_inThe input text of unknown encoding.
[in]lenThe length of the input text, or -1 if the text is NUL terminated.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter