Do you understand i18n?

If you didn’t already know, “i18n” is the shorthand developers use for “internationalization”, which is the ability of applications to operate using things like international character sets and languages. Many developers assume – sometimes rightly, but more often wrongly – that the frameworks they are using will handle this for them. For example, Juliette Culver cites “character encoding” as number one in her article Mistakes I have made building web applications (and points to this handy article by Joel Spolsky).

However this issue has become even more critical with the emergence of IRIs. An IRI is a URI (such as a URL) that can contain UTF-8 characters. For example:

http://الاعلي-للاتصالات.قطر

Not only is this URL in the Arabic alphabet, the direction of the text changes in the same line. So, “http” is read from left-to-right, while “‎الاعلي-للاتصالات.قطر” is right-to-left. This mixing together of scripts running in different directions is quite common – another common example is where there is a left-to-right acronym (such as “W3C”) in a right-to-left sentence.

In some cases we can use existing tools and frameworks to make sure our applications can handle this, but there are also many gaps. For example, Java still doesn’t have a standard IRI class to accompany its standard URL and URI classes.

Even using good frameworks, there are still many places where it is easy to make mistakes. For example, does your server actually send pages in UTF8 encoding? Some common application servers will default to Western encodings, rendering many alphabets as question marks unless you remember to override each page with explicit UTF8 encoding.

Also, do you take i18n into account when manipulating Strings? Many basic String functions aren’t completely unicode-aware; for example common methods for trimming or normalizing whitespace may not take account of different unicode whitespace characters.

The issue of bi-drectional text (BIDI) can also be problematic – for example, you can either use unicode control characters to represent changes in text direction, or you can use CSS properties (style=”unicode-bidi:rtl”), or you can use semantic markup (“its:dir=’rtl’”). You also need to be able to traverse multiple nesting – for example a <p> tag with RTL, including a <span> with LTR including RTL control characters.

I’ve had quite some fun making this stuff work for Apache Wookie (you can see some of the Java code I’ve created for i18n here.)

About these ads
This entry was posted in development, i18n, standards. Bookmark the permalink.

One Response to Do you understand i18n?

  1. Simon Grant says:

    See http://www.cs.tut.fi/~jkorpela/chars/index.html from a chap I came across a couple of years ago…

    A very pleasant guy I think you would get on well with, Scott :-)

    Simon

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s