☯ Mastering UTF-8

Mastering UTF-8

Prerequisites

If you are facing problems with displaying textual data that is (supposed to be) encoded in UTF-8, then this page might help you. You do not need specific technical knowledge to read this.

Should you expect more theory - and understand German - please find the gory details in my German tutorial about character encoding & UTF‑8.

Why dealing with Unicode and UTF-8?

Unicode is a Character Set

A character set is a list of characters like digits, letters, or symbols. Unicode is a character set.

The Unicode character set includes a six digit number of characters. Using a character encoding that is based on Unicode allows to write and cite in any language you can imagine, plus gives access to many useful non-alphanumeric symbols. Even when writing in English language the extended character set makes sense: Currency symbols look quite professional (£, ¥, €); it is good style to write people's names (Lech Wałęsa, Søren Kierkegaard), cities (Haßfurt) or brands (Citroën) correctly; there is elegance in «quote» symbols; and you may want to show useful symbols without messing around with images. The next line is just text!
© ¾ ☎ ☮ ☯ ♬

Unicode as a very complete character set is without competition, thus you can't ignore it when dealing with textual data.

A character from the Unicode character is also called a Code Point. Code Points are numbered. The number of a character is most often shown as hexadecimal number (although you might as well reference it by a decimal or any other number format). In case you have a string and would like to see which Code Points it contains (i.e. translate the characters to their Unicode number): "Text Inspector", my homegrown little tool for inspecting Unicode strings might come handy.

UTF-8 is an Encoding

Encodings are needed because computers understand only bits and bytes. (Read it as: Computers can only deal with numbers between 0 and 255.) An Encoding translates the characters of its underlying Character Set to the corresponding numbers; one number for each character. That is the purpose of any Encoding.

UTF-8 is an Encoding for Unicode characters. It translates each Unicode character into a unique sequence of 1 to 4 numbers, each number being 1 byte long, which just means that each number has a value in the range from 0 to 255..

Example 1:
In UTF-8, the @ sign (Unicode character number 64) is represented by the number 64.
Example 2:
The Yin-Yang sign ☯ (Unicode character number 9755) is represented a sequence of 3 numbers, 226-152-175.

There are alternatives to UTF-8, but UTF-8 is the most elegant for the Unicode character set. It combines significant advantages over its predecessors and alternatives (UCS-2, UTF-16, UTF-32):

UTF-8 is compatible with older standards, e.g. with 7-bit ASCII. That means legacy ASCII text files do not change when converted to UTF-8.
It leads to smaller file sizes compared to other encodings that use multiple bytes for each and every character per se.
UTF-8 became the de-facto standard in computer industry and on the Web.

All that makes UTF-8 the Unicode encoding of choice.

In case you want to see how (i.e. by which numbers) the characters in your string are represented in Unicode: Again, my tool for inspecting Unicode strings shows, beside the Code Points mentioned in the previous section, also the UTF-8 value for each character in your string.

What might cause problems with UTF-8

Reader Software/Device Limitations

Even if a device is able to process UTF-8 encoded files, it still might happen that some special letters or symbols require specific fonts to be available (or installable) on the client (e.g. web browser). With UTF-8 becoming the state-of-the-art character encoding these limitations simply go away for most of what you may need. Since a while I did not run into any problems with writing or reading web pages, even when they contained graphcial symbols and characters of exotic languages like Arabic, Japanese, or Thai. However: Note that there are still limits: At the time of writing my contemporary browsers do not display “New Tai Lue” or “Old Turkic”. Could be worse: If my target audience was using VT100 compatible devices they would not even see the letter “ß” in my last name. So we must not expect all readers' devices or applications to be able to display any UTF-8 characters.

Rule #1:
Know your audience's display tools and use only characters that they all support!

Wrong Encoding

To use UTF-8, the page needs to be encoded in UTF-8. This sounds silly, but often your text editor is a monster and does not really show you what it is doing with your valuable input. The solution: Master your tools! Every good editor allows you to set a specific encoding, and UTF-8 (without BOM) might be the way to go. (If you wonder about the BOM thingy: Just read on; we will get to that as well.)

You may want to test if your file is indeed encoded in UTF-8. For that you need to look at the file in hex mode. If Vi or Vim is avaialble (Unix/Linux/MacOS command line) you can switch that to hex mode (hex mode on: :%!xxd, hex mode off: :%!xxd -r). In Windows an old but excellent tool is the hex editor XVI32. Your normal text editor might also offer a hex editing mode, but don't trust it! Now, in your standard editor, copy/paste the following characters to the end of your file:
Ä Ç ß
Open the file with the above mentioned hex editor and scroll to the end. You should see the following sequence:
C3 84 20 C3 87 20 C3 9F
If your file does not end like that now, it is not UTF-8.

Side note: Above sequence of numbers, i.e. the encoding, is hexadecimal. Hexadecimal numbers are often shown with the prefix "0x", which translates to "here comes a decimal number". Here we do not add such a prefix as that would be redundant, if not confusing.

Rule #2:
Make sure your file is indeed a UTF-8 encoded file!

Unspecified Character Encoding

We should tell the software (reader, browser, device firmware, ...) what encoding to expect. Systems may by default or by configuration assume an encoding different from UTF-8. For example in Central Europe systems are often configured to assume ISO-8859-1 or similar encodings. Authors of applications and websites should therefore make sure that the character encoding is communicated to the end user's device or application.

In HTML5 files, browsers should assume UTF-8 encoding as the default, but we never know what the reader has configured. Therefore it is recommended to include the following as the first tag in HTML5 source:
<meta charset="utf-8" />

In older HTML standards a meta tag like this does the trick:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

Communicating the encoding is highly recommended, even if the web server already sends an adequate Content-Type header. Our content should survive the move to a different web server or the creativity of the admin from hell.

Some applications (but not web browsers) extract the encoding information from a Byte Order Mark (BOM). A BOM is composed of two to four bytes at the beginning of each file. Good editing software allows the user to specify whether to add a BOM or not.

Depending on what is done with the data the BOM can be the source of the problem (browser display) or it can be the solution (e.g. for some data importers). More about this in the next section.

Rule #3:
In the head section of web pages always specify the content type including the encoding!

Byte Order Mark (BOM)

Problem caused because the BOM is missing:
As briefly explained above, some software expects your UTF-8 file to start with a corresponding BOM. Such systems can raise problems if your UTF-8 file does not have a UTF-8 BOM. When creating the data you need to configure your editor to add a BOM at the beginning of each file.

Problem caused because the BOM is present:
Other software is not capable of handling BOMs, in which case it may cause a problem if it finds a BOM. The most prominent example: Web pages that show in some browsers the following characters, usually in the upper left corner:
ï»¿

Whether your UTF-8 file must or must not have a BOM: You may want to check if it has one, i.e. to validate if your generator or editor has added a BOM or not. Do not use a sophisticated software for such a validation because it will hide the BOM and might show wrong information - even in hex mode! To read individual bytes it takes low level editors as mentioned above. On Unix/Linux/MacOS you can use vi / Vim in hex mode (on: :%!xxd, off: :%!xxd -r). On Windows my favourite is good old XVI32.

When opening the file with your hex editor you can see if it contains a BOM at the very beginning. The BOM for UTF-8 is:
EF BB BF

In case you see some other "strange characters" in front of you actual data this might be a different BOM. This can be either because your file is actually not encoded in UTF-8 or because it simply has a wrong BOM. The BOMs of other popular encodings look like the following:
FE FF
FF FE
00 00 FE FF
00 00 FF FE

In case your file has a BOM and must not have one the best way to remove it is to do so in the hex editor. After removing it just save and close the file and re-open it with your normal editor. Make sure that your editor now continues to encode your input as UTF-8.

Rule #4:
Make sure the correct BOM is available when it is required!
Rule #5:
When dealing with web pages make sure they do not contain a BOM!