Unicode

Concerns

 

Unicode ("unique code") is a system for encoding text characters (letters, syllables, ideograms, punctuation, special characters, digits) standardized by the international standardization organization ISO.

It consists of an alphanumeric character set that summarizes all known text characters worldwide. With Unicode you can use not only the letters of the Latin alphabet, but also the Greek, Cyrillic, Arabic, Hebrew, Thai alphabet and the various Japanese (Katakana, Hiragana), Chinese and Korean scripts (Hangul) to describe. In addition, special mathematical, commercial and technical characters can be encoded. Unicode also includes four control characters: end of line, end of paragraph, left-to-right writing direction, and right-to-left writing direction. The right-to-left direction control character is required for Arabic and Hebrew, for example.

The main difference between Unicode and traditional character sets is encoding. The 7-bit encoding can be used to display a maximum of 128 characters with the ASCII character set, with an 8-bit encoding of 256 different characters. However, because there are more than 256 different characters worldwide, character sets have been introduced that use more than one byte to encode each text character.

When speaking of Unicode, the character set UCS-2 is usually meant. The 2 in the label indicates that two bytes (16 bits) are used to encode each character. Thus, 65,536 characters can already be represented in this so-called first level of Unicode ("Basic Multilingual Plane", short: BMP). On the other levels of Unicode, rarely used, mostly historical characters such as ancient Egyptian hieroglyphics and hardly in use Chinese characters are encoded. 16 bits are no longer sufficient to represent these characters. Therefore, each character is encoded with 32 bits, so that a total of 4,294,967,296 different characters are possible. This encoding is called UCS-4, with the 4 indicating that four bytes (32 bits) are used to encode each character.

Thus, UCS-4 encoding allows any Unicode character, regardless of the level of Unicode, to be added in a 32-bit data word. However, this encoding is only used when Unicode characters that belong to a higher level than the BMP are to be used, given their high storage requirements.

In TCE, all strings are treated as Unicode strings.

In TCE, you can generally use first-level Unicode characters.

If you need more complex characters that are not available in the first level, open the Region and Language Options pane in Control Panel. Select the Languages tab here. In the Additional Language Support section, you will find two check boxes that you can select to enable additional languages.