URL Encoding and character set confusion

A number of interesting things to consider regarding input into a web application. Our application receives and handles incoming SMS messages. Since some of our customers target spanish speakers, we are getting input that fall outside the typical ASCII alphabet. The trouble word of the day is 'Sueño'; specifically the eñe.

This particular letter is supported by a number of different character encodings. Internally, everything is UTF-8. The Eñe in UTF-8 is 0xC3B1, and URL encoded, this becomes %C3%B1.  However, one of our sources of SMS traffic uses a different character encoding that UTF-8.  In that case, we see %F1. After some research, I found a correlation between F1 and ñ in the ISO-8859-1character set.

To deal with this in Ruby (we are using Rails), I did the following...

   smsmsg = Iconv.conv('utf-8', 'ISO-8859-1', params[:smsmsg].to_s) 

This was my first introduction to Iconv, which is part of the Ruby standard library.  Unfortunately, it isn't very well documented.  Maybe this post will help.

It seems to me that this should be something that Rails should be doing for me.  Although, I might be underestimating the task.  It would be nice to have params[:some_value] always returns a consistently encoded string.

 

Meta