Mastering UTF-8
Why dealing with UTF-8?
Readers who understand German and are interested in more details may refer to two other articles about character encoding in general and Unicode / UTF-8 in particular.
Character encodings that allow extended character sets beyond the ASCII standard are not only required to cover foreign languages (think of the German umlauts, French accents, but also of Chinese or Korean symbols). They allow, even in an English text, to use a variety of symbols and thus avoid graphical icons that are often hard to align inside text. Examples: ☎ ☮ ☯ ♬
What are the problems with UTF-8?
Reader Software/Device Limititions
Many special characters require specific fonts to be made available to the reader (browser or other display). With UTF-8 becoming the state-of-the-art character encoding these limitations simply go away for most of what you may need. Since a while I did not run into any problems with writing or reading web pages, even when they contained graphcial symbols and characters of exotic languages like Arabic, Japanese, or Thai. However: Note that there are still limits: At the time of writing my contemporary browsers do not display “New Tai Lue” or “Old Turkic”. Could be worse: If my target audience was using VT100 compatible devices they would not even see the letter “ß” in my last name. The take-away from this paragraph should be: Do not expect all reader end devices to be able to display all UTF-8 characters. Or: Know your target displays and use only characters that they all support.
Wrong Encoding
To use UTF-8, the page needs to be encoded in UTF-8. This sounds silly, but often your text editor is a monster and does not really show you what it is doing with your valuable input. The solution: Master your tools! Every good editor allows to set a specific encoding, and UTF-8 (without BOM) might be the way to go. (If you wonder about the BOM thingy: Just read on; we will get to that as well.)
You may want to test if your file is indeed encoded in
UTF-8. For this get a good hex editor, like for example
XVI32. Your normal text editor might also
offer a hex editing mode, but don't trust it!
Now, in your standard editor, copy/paste the following
characters to the end of your file:
Ä Ç ß
Open the file with the above mentioned hex editor and
scroll to the end. You should see the following
sequence:
C3 84 20 C3 87 20 C3 9F
If your file does not end like that now, it simply
is not UTF-8.
Unspecified Character Encoding
Computer systems are not too good in guessing the character encoding. If it is not specified, most devices or applications use a default encoding which is most often a more classic one with a fixed length of 8 bit (1 byte) per character like for example ISO-8859-1 in central Europe. Authors of applications and websites should make sure that the character encoding is communicated to the end user's device or application.
For web designers this means that each page's
head section should show a meta tag like:
<meta http-equiv="Content-Type"
content="text/html;charset=utf-8" />
This is good practice and highly recommended even if the
web server already sends an adequate Content-Type header.
That may sound redundant, but the file itself is the place
where the encoding is best known. You don't want to delegate
the responsibility for the character encoding to your
server administrator or hosting provider.
Some applications (but not web browsers) extract the encoding information from a Byte Order Mark (BOM). A BOM is composed of two to four bytes at the beginning of each file. Good editing software allows the user to specify whether to add a BOM or not.
Depending on what is done with the data the BOM can be both the source of the problem (browser display) or it can be the solution (e.g. for some data importers). More about this in the next section.
Byte Order Mark (BOM)
Problem caused because the BOM is missing:
As briefly explained above, some software expects your UTF-8
file to start with a corresponsing BOM. Such systems can
raise problems if your UTF-8 file does not have a UTF-8 BOM.
When creating the data you need to configure your editor to
add a BOM at the beginning of each file.
Problem caused because the BOM is present:
Other software is not capable of handling BOMs, in which case
it may cause a problem if it finds a BOM. The most prominent
example: Web pages that show in some browsers the following
characters, usually in the upper left corner:

Whether your UTF-8 file must or must not have a BOM: you may want to check if it has one, i.e. to validate if your generator or editor has added a BOM or not. Do not use a sophisticated software for such a validation because it will hide the BOM and might show wrong information - even in hex mode! To read individual bytes I recommend to use a very basic hex editor. On Linux you can use vi in hex mode (on: :%!xxd, off: :%!xxd -r) or use Tweak. On Windows my favourite is XVI32.
When opening the file with your hex editor you can see if it
contains a BOM at the very beginning.
The BOM for UTF-8 is:
0xEF 0xBB 0xBF
In case you see some other "strange characters" in front
of you actual data this might be a different BOM. This can be either
because your file is actually not encoded in UTF-8 or because it
simply has a wrong BOM. The most popular other BOMs are:
FE FF
FF FE
00 00 FE FF
00 00 FF FE
In case your file has a BOM and must not have one the best way to remove it is to do so in the hex editor. After removing it just save and close the file and re-open it with your normal editor. Make sure that your editor now continues to encode your input as UTF-8.