An In-depth Understanding of Character Encoding

April 18, 2011 in Development by Kayla Knight

Character encoding is not necessarily something many new web developers care about, or even understand. At first, it seems using and declaring proper character encoding is something we should do, but since we don’t see the immediate effect, we don’t always necessarily think it’s important. So what is character encoding anyway? Why should we as web developers care?

Well, it is more important than one may think. In this article we’ll go over exactly what character encoding is, why we should care about it and always use the correct form of encoding, and also how to determine and declare character encoding within a webpage.

An In-depth Understanding of Character Encoding

What is It?

At a technical level, character encoding is the way characters — letters, numbers, symbols — are represented in numerical values that a computer can understand. When we save an HTML document, or a web document of another type that would have to render some HTML code, it is saved with a certain character encoding. As web developers, our job is to declare what character encoding is being used so the browser knows, and can interpret it as such.

You may have experienced the effects of adhering to character encoding issues when using special characters. Ever copied and pasted a copyright sign (©), not realizing you forgot to replace it with its ASCII equivalent, only to get an ugly block or question mark where it’s supposed to be?

ASCII

This is because the character encoding chosen or assumed for that webpage does not cover special characters, such as the copyright sign. Don’t worry, this error is a good thing, and replacing a few characters needed with ASCII rather than upgrading the character encoding used is the more efficient way of doing things. Because the character encoding chosen doesn’t recognize special characters, it must resort to characters it does understand – c, o, p, and y. We can then place the keyword or numbers between an ampersand and semicolon and the browser will understand that it needs to interpret it into one single character or symbol.

Why Should I Care?

If the character encoding declared by the web developer is not the same as the actual character encoding for the HTML page, then many browsers (both new and old) may render the webpage and characters within it incorrectly. Even if you’re seeing everything ok in your main testing browser, with improper character encoding, another browser may interpret it in an entirely different way. Plus, we have the issue of older versions of those same browsers rendering differently again.

Search engines are likely to get confused too. Everything that reads a webpage, whether it be a browser or search engine spider, need to use the “computer code,” or the numerical values a webpage transfers down to in order to understand it. If we declare an incorrect character encoding, a search engine will try to read the webpage in the declared character encoding, rather than in the encoding it actually was saved as. It then cannot read everything accurately, or even at all.

Different Languages

Many browsers have a pretty easy time interpreting Roman letters and numbers, and we can use some of the same character encodings across many websites without much thought. However, all languages, of course, do not use Roman letters. Specifically, many languages in the Eastern world have entirely different character sets, and even many languages that use primarily Roman letters may have several special characters that they use on a regular basis, such as accented vowels, tildes, etc.

Languages

If writing a webpage in another language that uses several special characters, it would not be practical to write out the ASCII versions of those characters all throughout the page. This is also true for a primarily English website that would need to be translated into one or more languages.

Considering the Right Character Encoding

So now we see why it’s important to consider the character encoding of a webpage, and also the importance of choosing the right one. How does one choose the correct character encoding though? What are the options? What affects the choice, and what questions should be answered in order to make the decision?

What characters will be used? Rather, what characters will be used often? Never change the character encoding for just a few characters throughout the page, such as special symbols, spaces, or em dashes. Make this decision primarily on the language that needs to be used, and any special characters based on that. The choice will reflect what characters will be needed on a regular basis, throughout the entire site, perhaps many times per sentence.
Which character encodings are supported by many browsers? Some browsers cannot read certain encodings, either because they are older, or do not have add-ons or settings required to view certain encodings. For example, some browsers may not have higher encodings enabled on their browser that would support many non-roman characters for other languages, because the user of the browser does not ever need to read in those languages anyway. If a webpage were interpreted in that character encoding, and a wide portion of the webpage audience would not have that support in their browser, then there would be issues with reading the page. As a best practice, use common encodings that are supported widely across browsers.
What are the limitations of my webpage editor? Can it save correctly in the character encoding you’ve specified? There’s not a whole lot to worry about here, as most editors can handle a wide range of character encodings, but just be wary of this issue, and try to stick to the more common character encodings.

It’s true, most of the worry or character encoding may come only for non-English websites. Yet, the character encoding may want to be changed for typographical reasons. Especially curly quotes, commas, apostrophes, and other correct typographical elements may be wanted, for design purposes primarily. Otherwise, if certain math symbols would need to be used on an English website, this would be another reason to use a higher character encoding.

How To?

So we know what to consider when choosing a character encoding, but how does one implement this choice into a page, and which code is which? Adding a character encoding to a webpage is as easy as including a meta tag for it:

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

The character encoding is defined by the charset attribute. The character encoding for the example above is “ISO-8859-1“.

One can also declare character encoding in an XML/XHTML document like so:

<?xml version="1.0" encoding="utf-8" ?>

Or, with access to the FTP and server, it can be sent as a site-wide header:

Content-Type: text/html; charset=utf-8

Source: W3C: Character encodings

There are several character encodings one can use, but there’s a nice list of all the common HTML encodings listed here: HTML/XHTML Character Encodings. The web page features a nice table of the character encoding name and also what it supports and is best used for.

Conclusion

Character encoding is definitely important, even if we don’t see the immediate effects of an incorrectly encoded page directly. Using proper encoding is all about usability, both for the publishers of the website and for the viewers of its content. Fortunately, in most cases, finding the proper encoding can be easy, but when working with multilingual sites, or extensive typography, it’s a must to learn how to use it correctly.

Just take a look over some of the common character encodings, and know which is best in each situation. It’s truly not a hard concept to grasp, and the basis of it is quite easy if a bit of time is taken initially to understand it. By understanding why proper character encoding is important, we can make each webpage more user-friendly and easier to handle.