GET OUR LATEST PRODUCT UPDATES
Be the first to receive the latest GlobalVision news, events and product updates.
Section 04 | Unicode Standard
The Unicode Standard provides a unique number for each character, no matter the platform, program, or language. It is the universal character encoding standard used to represent text for computer processing.
Unicode provides a consistent way to encode multilingual plain text, making it easier to exchange text files internationally, as it defines codes for the characters of major languages. This includes punctuation marks, diacritics, mathematical and technical symbols, arrows, dingbats, etc.
Before Unicode was invented, there were hundreds of different encoding systems, The European Union alone required several different systems, while even a single language like English required more than one for all its letters, punctuation, and technical symbols.
Unlike older systems, Unicode allows multiple writing systems to coexist in one data file. Systems that recognize Unicode can consistently read and process data from different languages.
Unicode uses 16-bit encoding that provides code points for more than 65,000 characters. To keep character coding simple and efficient, it assigns each character a unique 16-bit value, and does not use complex modes or escape codes. While 65,000 characters are sufficient for encoding thousands of characters used in major languages of the world, the Unicode Standard and ISO 10646 provide an extension mechanism called UTF-16 that allows for encoding as many as a million more characters, without the use of escape codes. This is sufficient for all known character-encoding requirements, including full coverage of all historic scripts of the world.
It should be noted that Unicode encodes scripts for languages, rather than just for languages. Systems that are written for more than one language share sets of graphical symbols that have historically related derivations. The union of all those graphical symbols is treated as a single collection of characters for encoding and is identified as a single script. Many scripts (especially Latin) are used to write many languages.
Unicode covers all the languages that can be written in the following scripts: Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syrian, Thaana, Devanagari, Bengali, Gurmukhi, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Georgian, Hangul, Ethiopic, Cherokee, Canadian Aboriginal Syllabics, Khmer, Mongolian, Han (Japanese, Chinese, Korean ideographs), Hiragana, Katakana, Yi, and much more. Please see Appendix 03 for the complete list of supported scripts.
Depending on the level of Unicode support in the browser used and whether the necessary fonts are installed, you may have display problems for some of the translations, particularly with complex scripts such as Arabic.
Saved with a text file, the encoding standard provides the information that the computer needs to display the text on the screen. For example, in the Cyrillic (Microsoft Windows) encoding script, the character Й has the numeric value 201. When a file is opened that contains this character on a computer that uses the Cyrillic (Windows) encoding script, the computer reads the 201 numeric value and displays Й on the screen. However, if the same file is opened on a computer that uses a different encoding script, the computer displays whatever character corresponds to the 201 numeric value in its default encoding standard. For example, if the computer uses the Western European (Windows) encoding script, the character in the original Cyrillic-based file will be displayed as É instead.
In this section, GlobalVision describes best practices relating to the Unicode Standard.
4.1 Install all necessary encoding scripts on your computer
Text appears with erroneous characters, garbled, or as Wingdings (question marks, boxes, stars, etc.: £®Ð£¸CPa ¡£Õ).
Alphabetic writing systems vary from one language to another. As a result, computers need to utilize different encoding scripts. The character that is displayed is dependent on the encoding script installed in your operating system. NOTE: Some display languages/ scripts are installed by default, while others require you to install additional language files.
✓ Install all required encoding systems (scripts) on your computer.
✗ Open files containing scripts that are not installed on your computer.
The following steps describe how to install or activate the necessary encoding scripts (if applicable).
- Retrieve the Microsoft Windows install CD.
- Install all the appropriate scripts/ languages.
- Go to the Control Panel and click on the Keyboards and Languages tab.
- Choose a language under Display Language.
- Click OK.
4.2 Validate your fonts for all languages used
Screen-reader software for the blind and other programs misinterpret the content/ text in PDF files.
- Fonts are sold with different encoding script packages.
- The ability to type a specific font does not guarantee the font contains the necessary encoding scripts for the language being used.
- Software programs cannot read characters if the font does not have all the necessary encoding scripts (e.g. mg = £®).
PCreate an asset inventory.
✓ Purchase the “Pro” version of a font.
✓ Check the supporting scripts when purchasing a font.
✓ Purchase all the required scripts.
✓ Check the scripts that come with a font every time you switch languages and fonts.
✓ Standardize on fonts used for each language.
✗ Use fonts that support only a limited number of scripts.
The following steps detail how to validate a font with Apple Font Book:
- Open Font Book.
- Select the font in question from the font list.
- Click on File and select Validate Font.
- Check the validation results.
4.3 Use fonts that support Unicode
Character corruption may occur when transferring text to a PDF file.
Not all fonts support Unicode encoding.
✓ Use fonts that are Unicode-based.
✓ Standardize on fonts throughout the supply chain.
✗ Use bitmap (screen) fonts.
The following steps detail how to identify Unicode characters:
- Select the character in question.
- Press ALT+X to display its Unicode value.
Adobe Illustrator and InDesign:
- Go to the Type menu and select Glyphs.
- Click on the character in question to display its Unicode value.