Mastering UTF-8

Why dealing with UTF-8?

Readers who understand German and are interested in more details may refer to two other articles about character encoding in general and Unicode / UTF-8 in particular.

Character encodings that allow extended character sets beyond the ASCII standard are not only required to cover foreign languages (think of the German umlauts, French accents, but also of Chinese or Korean symbols). They allow, even in an English text, to use a variety of symbols and thus avoid graphical icons that are often hard to align inside text. Examples: ☎ ☮ ☯ ♬

What are the problems with UTF-8?

Reader Software/Device Limititions

Many special characters require specific fonts to be made available to the reader (browser or other display). With UTF-8 becoming the state-of-the-art character encoding these limitations simply go away for most of what you may need. Since a while I did not run into any problems with writing or reading web pages, even when they contained graphcial symbols and characters of exotic languages like Arabic, Japanese, or Thai. However: Note that there are still limits: At the time of writing my contemporary browsers do not display “New Tai Lue” or “Old Turkic”. Could be worse: If my target audience was using VT100 compatible devices they would not even see the letter “ß” in my last name. The take-away from this paragraph should be: Do not expect all reader end devices to be able to display all UTF-8 characters. Or: Know your target displays and use only characters that they all support.

Wrong Encoding

To use UTF-8, the page needs to be encoded in UTF-8. This sounds silly, but often your text editor is a monster and does not really show you what it is doing with your valuable input. The solution: Master your tools! Every good editor allows to set a specific encoding, and UTF-8 (without BOM) might be the way to go. (If you wonder about the BOM thingy: Just read on; we will get to that as well.)

You may want to test if your file is indeed encoded in UTF-8. For this get a good hex editor, like for example XVI32. Your normal text editor might also offer a hex editing mode, but don't trust it! Now, in your standard editor, copy/paste the following characters to the end of your file:
Ä Ç ß
Open the file with the above mentioned hex editor and scroll to the end. You should see the following sequence:
C3 84 20 C3 87 20 C3 9F
If your file does not end like that now, it simply is not UTF-8.

Unspecified Character Encoding

Computer systems are not too good in guessing the character encoding. If it is not specified, most devices or applications use a default encoding which is most often a more classic one with a fixed length of 8 bit (1 byte) per character like for example ISO-8859-1 in central Europe. Authors of applications and websites should make sure that the character encoding is communicated to the end user's device or application.

For web designers this means that each page's head section should show a meta tag like:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
This is good practice and highly recommended even if the web server already sends an adequate Content-Type header. That may sound redundant, but the file itself is the place where the encoding is best known. You don't want to delegate the responsibility for the character encoding to your server administrator or hosting provider.

Some applications (but not web browsers) extract the encoding information from a Byte Order Mark (BOM). A BOM is composed of two to four bytes at the beginning of each file. Good editing software allows the user to specify whether to add a BOM or not.

Depending on what is done with the data the BOM can be both the source of the problem (browser display) or it can be the solution (e.g. for some data importers). More about this in the next section.

Byte Order Mark (BOM)

Problem caused because the BOM is missing:
As briefly explained above, some software expects your UTF-8 file to start with a corresponsing BOM. Such systems can raise problems if your UTF-8 file does not have a UTF-8 BOM. When creating the data you need to configure your editor to add a BOM at the beginning of each file.

Problem caused because the BOM is present:
Other software is not capable of handling BOMs, in which case it may cause a problem if it finds a BOM. The most prominent example: Web pages that show in some browsers the following characters, usually in the upper left corner:


Whether your UTF-8 file must or must not have a BOM: you may want to check if it has one, i.e. to validate if your generator or editor has added a BOM or not. Do not use a sophisticated software for such a validation because it will hide the BOM and might show wrong information - even in hex mode! To read individual bytes I recommend to use a very basic hex editor. On Linux you can use vi in hex mode (on: :%!xxd, off: :%!xxd -r) or use Tweak. On Windows my favourite is XVI32.

When opening the file with your hex editor you can see if it contains a BOM at the very beginning. The BOM for UTF-8 is:
0xEF 0xBB 0xBF

In case you see some other "strange characters" in front of you actual data this might be a different BOM. This can be either because your file is actually not encoded in UTF-8 or because it simply has a wrong BOM. The most popular other BOMs are:
FE FF
FF FE
00 00 FE FF
00 00 FF FE

In case your file has a BOM and must not have one the best way to remove it is to do so in the hex editor. After removing it just save and close the file and re-open it with your normal editor. Make sure that your editor now continues to encode your input as UTF-8.

© Hermann F., 2011

Valid XHTML 1.0 Strict