An In-depth Understanding of Character Encoding

An In-depth Understanding of Character Encoding

Character encoding is not necessarily something many new web developers care about, or even understand. At first, it seems using and declaring proper character encoding is something we should do, but since we don’t see the immediate effect, we don’t always necessarily think it’s important. So what is character encoding anyway? Why should we as web developers care?

Well, it is more important than one may think. In this article we’ll go over exactly what character encoding is, why we should care about it and always use the correct form of encoding, and also how to determine and declare character encoding within a webpage.

An In-depth Understanding of Character Encoding

What is It?

At a technical level, character encoding is the way characters — letters, numbers, symbols — are represented in numerical values that a computer can understand. When we save an HTML document, or a web document of another type that would have to render some HTML code, it is saved with a certain character encoding. As web developers, our job is to declare what character encoding is being used so the browser knows, and can interpret it as such.

You may have experienced the effects of adhering to character encoding issues when using special characters. Ever copied and pasted a copyright sign (©), not realizing you forgot to replace it with its ASCII equivalent, only to get an ugly block or question mark where it’s supposed to be?

ASCII

This is because the character encoding chosen or assumed for that webpage does not cover special characters, such as the copyright sign. Don’t worry, this error is a good thing, and replacing a few characters needed with ASCII rather than upgrading the character encoding used is the more efficient way of doing things. Because the character encoding chosen doesn’t recognize special characters, it must resort to characters it does understand – c, o, p, and y. We can then place the keyword or numbers between an ampersand and semicolon and the browser will understand that it needs to interpret it into one single character or symbol.

Why Should I Care?

If the character encoding declared by the web developer is not the same as the actual character encoding for the HTML page, then many browsers (both new and old) may render the webpage and characters within it incorrectly. Even if you’re seeing everything ok in your main testing browser, with improper character encoding, another browser may interpret it in an entirely different way. Plus, we have the issue of older versions of those same browsers rendering differently again.

Search engines are likely to get confused too. Everything that reads a webpage, whether it be a browser or search engine spider, need to use the “computer code,” or the numerical values a webpage transfers down to in order to understand it. If we declare an incorrect character encoding, a search engine will try to read the webpage in the declared character encoding, rather than in the encoding it actually was saved as. It then cannot read everything accurately, or even at all.

Different Languages

Many browsers have a pretty easy time interpreting Roman letters and numbers, and we can use some of the same character encodings across many websites without much thought. However, all languages, of course, do not use Roman letters. Specifically, many languages in the Eastern world have entirely different character sets, and even many languages that use primarily Roman letters may have several special characters that they use on a regular basis, such as accented vowels, tildes, etc.

Languages

If writing a webpage in another language that uses several special characters, it would not be practical to write out the ASCII versions of those characters all throughout the page. This is also true for a primarily English website that would need to be translated into one or more languages.

Considering the Right Character Encoding

So now we see why it’s important to consider the character encoding of a webpage, and also the importance of choosing the right one. How does one choose the correct character encoding though? What are the options? What affects the choice, and what questions should be answered in order to make the decision?

  • What characters will be used? Rather, what characters will be used often? Never change the character encoding for just a few characters throughout the page, such as special symbols, spaces, or em dashes. Make this decision primarily on the language that needs to be used, and any special characters based on that. The choice will reflect what characters will be needed on a regular basis, throughout the entire site, perhaps many times per sentence.
  • Which character encodings are supported by many browsers? Some browsers cannot read certain encodings, either because they are older, or do not have add-ons or settings required to view certain encodings. For example, some browsers may not have higher encodings enabled on their browser that would support many non-roman characters for other languages, because the user of the browser does not ever need to read in those languages anyway. If a webpage were interpreted in that character encoding, and a wide portion of the webpage audience would not have that support in their browser, then there would be issues with reading the page. As a best practice, use common encodings that are supported widely across browsers.
  • What are the limitations of my webpage editor? Can it save correctly in the character encoding you’ve specified? There’s not a whole lot to worry about here, as most editors can handle a wide range of character encodings, but just be wary of this issue, and try to stick to the more common character encodings.

It’s true, most of the worry or character encoding may come only for non-English websites. Yet, the character encoding may want to be changed for typographical reasons. Especially curly quotes, commas, apostrophes, and other correct typographical elements may be wanted, for design purposes primarily. Otherwise, if certain math symbols would need to be used on an English website, this would be another reason to use a higher character encoding.

How To?

So we know what to consider when choosing a character encoding, but how does one implement this choice into a page, and which code is which? Adding a character encoding to a webpage is as easy as including a meta tag for it:

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

The character encoding is defined by the charset attribute. The character encoding for the example above is “ISO-8859-1“.

One can also declare character encoding in an XML/XHTML document like so:

<?xml version="1.0" encoding="utf-8" ?> 

Or, with access to the FTP and server, it can be sent as a site-wide header:

Content-Type: text/html; charset=utf-8 

Source: W3C: Character encodings

There are several character encodings one can use, but there’s a nice list of all the common HTML encodings listed here: HTML/XHTML Character Encodings. The web page features a nice table of the character encoding name and also what it supports and is best used for.

Conclusion

Character encoding is definitely important, even if we don’t see the immediate effects of an incorrectly encoded page directly. Using proper encoding is all about usability, both for the publishers of the website and for the viewers of its content. Fortunately, in most cases, finding the proper encoding can be easy, but when working with multilingual sites, or extensive typography, it’s a must to learn how to use it correctly.

Just take a look over some of the common character encodings, and know which is best in each situation. It’s truly not a hard concept to grasp, and the basis of it is quite easy if a bit of time is taken initially to understand it. By understanding why proper character encoding is important, we can make each webpage more user-friendly and easier to handle.

Kayla Knight is a web designer and frontend web developer. She specializes in responsive web design, progressive web technologies, and also knows her way around most CMS's and PHP. You can find out more and check out her portfolio at kaylaknight.co.

Comments

    • Jenn,
    • April 20, 2011
    / Reply

    great article – just to add a point from a developer standpoint – make sure you encode your mysql database to match your page encoding. This will save many headaches with different characters.

    • Andrew,
    • April 21, 2011
    / Reply

    Joel Spolsky has another great article with a bit more technical stuff (and history) for those interested: http://www.joelonsoftware.com/articles/Unicode.html

  1. / Reply

    First, I want to thank you for blogging
    about character encodings. I worked for a few
    years as a web developer without understanding
    and using them. Much like Andrew (comment above),
    I learned the ins and outs of Unicode from
    Joel Spolsky’s blog and it opened my eyes
    –that and having to deal with a lot of text
    copied and pasted from MS Word into CMS and
    NLP systems (world of pain when encoding not
    known!)

    Ordinary folks have at least one opinion.
    Programmers have many more. Suffice it to
    say: I respect yours, but disagree with
    some of your assertions.

    My main quibble:

    “Never change the character encoding for just
    a few characters throughout the page…”

    I posit that no disadvantage comes from
    ensuring that both the encoded content of
    site as well as the declared encoding are
    UTF-8.

    I reject the argument that the
    number of bytes, rendering speed, or
    lack of browser support limits
    the efficacy or correctness of text
    encoded in UTF-8. For at least a decade
    all modern browsers have fully supported
    UTF-8 within the Basic Multilingual Plane
    (going back to IE 6). Moreover, its a
    variable width encoding 1 to 4 bytes.

    On the other hand, using fonts that
    don’t support UTF-8 (aka fonts that aren’t
    web safe) often results in mangled rendering.

    There some case in which it might not be
    practical attempt the conversion from
    ISO-8859-1 to UTF-8. For example: limited
    resources might prevent the headache of
    converting large amounts of text whose encoding
    remains unknown.

    Furthermore, standards bodies such as the
    W3C and RFC are increasingly opposed to
    character entities and XHTML in favor of
    using a UTF-8 encoded character and the
    meta tag declaration.

    My belief: UTF-8 is almost always the right choice
    for encoding text.

    I’m happy to provide credible sources for all
    of the statements I’ve made here, and appreciate
    an further debate or discussion.

  2. / Reply

    We have major issues with softwares that don`t support UTF-8 like Dreamweaver, we have issues saving the asp/php files cause then it goes with question marks and we have to edit again with Notepad to fix the non-English words.

    Notepad cause issues as well with Hebrew on HTML/whatever files, but can be fixed with some options.

    • Michaux Kelley,
    • July 5, 2011
    / Reply

    Dreamweaver supports UTF-8 as do most all modern IDE’s or text editors.

Leave a Reply

Your email address will not be published. Required fields are marked *

Deals

Iconfinder Coupon Code and Review

Iconfinder offers over 1.5 million beautiful icons for creative professionals to use in websites, apps, and printed publications. Whatever your project, you’re sure to find an icon or icon…

WP Engine Coupon

Considered by many to be the best managed hosting for WordPress out there, WP Engine offers superior technology and customer support in order to keep your WordPress sites secure…

InMotion Hosting Coupon Code

InMotion Hosting has been a top rated CNET hosting company for over 14 years so you know you’ll be getting good service and won’t be risking your hosting company…

SiteGround Coupon: 60% OFF

SiteGround offers a number of hosting solutions and services for including shared hosting, cloud hosting, dedicated servers, reseller hosting, enterprise hosting, and WordPress and Joomla specific hosting.