Sunday 1 April 2012

Avoid dashes and fancy quotes in blog titles


John Reed pointed me to a post by Jeremiah Owyang, which I failed to retrieve on my phone:



Ironic as it may seem, this is due to another bug which doesn't have clear ownership: let's call it the character translation bug (sorry non-IT folks)

The IT world is a messy place. IT started in the United States, so they developed a codepage to cope with their problems: ASCII, American Standard Code for Information Interchange. It contained 128 characters, enough for taking all the characters then known (I think they looked at a common US typewriter).
Then, the bloody Europeans joined the IT scene, with all their funny French c-cedilles, Scandinavian funny æ and other symbols, Germans with their ß and dozens of useless symbols Americans hadn't thought of (rant intended).
So, the second half of the total ASCII set quickly became populated with those, which came to be known as the Latin-1 extension or ISO 8859-1

Needless to say, that wasn't enough. Eventually Unicode saw the light, making the Asians happy, and offering plenty of space for tens of thousands of characters. Check out this list if you think you don't have much of a choice
But even that wasn't enough: the web kicked in, and HTML and URLs starting causing problems.
Why? Because the people who invented them, had a wide mind with regard to their invention, but not its application: yet another US invention, it was limited to only a few chars (Unicode was an invention by Apple's Joe Becker, so not all American inventions are evil if you might think that I think so)

HTML had special characters that needed to be escaped, and so did URLs - and they didn't use the same ones. What is escaping? Wrapping, translating, converting: different words for the same action.
For example, if you put a space in a URL, it needs to be translated to %20 or the URL doesn't resolve to anything useful. URL encoding is so much fun because it only has less than 70 non-reserved characters, and all the others have to be escaped. How? Haha...
You must escape a reserved character for use in a URL by taking its hexadecimal ASCII-value and preceding that by the percentage sign (%).
And therein lies somewhat of a problem, of course - see above. Characters you read (if you are a machine) have to be decoded, and characters you write have to be encoded. In between decoding and encoding, lots can go wrong.
For the real diehards among you, here is the IETF's latest RFC for generic URI syntax

HTML and XML have similar needs: check out this Wikipedia list that shows some escape examples, and then there's XHTML, and different versions for all, and so on and so on and so on

Is that all? No it ain't. Blog posts get spread across the world via Twitter, Stumbleupon, and many, many other sites that use URL shorteners: those take a URL and turn it into something smaller - of course once again escaping the reserved characters. On top of that, Microsoft Word is somewhat to blame as well, they of course have their own codepage and under da hood everything is XML since Office 2007.
Getting dizzy? Let's get practical

Let's take Jeremiah's post title, in URL presentation:
http://www.web-strategist.com/blog/2012/03/28/coping-with-twitter%E2%80%99s-unfollow-bug/
The sharp observer will see that the %E2%80%99 part is meant to simply represent a single quote:

Coping With Twitter’s Unfollow Bug

(I know the URL doesn't look anything like that, but just copy and paste it into a smart editor and you'll magically see the single quote change to the awkward %E2%80%99)
However, it's not a mere single quote: that would be a simple single %92. No, this is a fancy one and its proper name is Right Single Quotation Mark, and it is pure Unicode: hexadecimal 20 19.
(I could bore you with Big Endian and Little endian which tell which of the two bytes is leading, but let's just leave it at this shall we?).
Taking this Unicode value and translating it to UTF-8, you get E2 80 99. Simply prefix those with the percentage sign, and you get %E2%80%99

If you use longurl.org to check the URL shortened by Twitter, you'll get a 404. That's what I got, on my mobile using Hootsuite. If I use Hootsuite on laptop, I go straight to the page. What URL gets decoded on my Hootsuite mobile account?

http://www.web-strategist.com/blog/2012/03/28/coping-with-twitter%C3%A2%C2%80%C2%99s-unfollow-bug/

It seems that the decoder only took the E2 and completely lost it after that, and I suspect this is an Android issue rather than a Hootsuite mobile one. Nonetheless, the outcome is the same: invalid URL
Did you make it all the way down here? Wow, you must be really into bits and bytes and we should probably meet. Chances are, you do integration work or at least work cross-platform and cross-programming languages
I hope I've shown that there are a lot of dependencies involved and that the same URL might make it via shortener A but not B, or Twitter client A versus Twitter client B, or device A versus device B - and that all this might lead to a lot of confusion and misunderstanding
Want to stay on the safe side, and have your URL work for everyone? Avoid special characters in your post such as dashes, fancy quotes, and stick the dazzling stuff into the content itself

0 reacties:

Post a Comment

Thank you for sharing your thoughts! Copy your comment before signing in...