in Technology

Funny characters, Webservices and UTF8

I was importing the Foss.In speakers and talks data into Glancer, and it started giving me some strange errors. This is what I got in OpenLaszlo:
error: java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence.

And when I tried using PHP, this was the string:
SOAP-ENV:Server SOAP-ERROR: Encoding: string 'some string here' is not a valid utf-8 string

So I looked up what’s causing this. First I thought it would be MySQL. I had mixed up collation order on different tables. Some were the default latin_swedish_ci, some were utf8_general_ci and some were utf8_unicode_ci. I changed all the tables and fields to utf8_unicode_ci first. But that did not solve the problem!

This is the time when I switched to PHP to test. And found that the problem was because the string in the data was not a proper Unicode / UTF8 string. I checked up if this is due to the size of the string – I had text data types. But well, that wasn’t the problem. Then I discovered the problem was coming up because of some funny characters in the text.

Primarily, the “auto convert” characters that Word typically adds up. E.g. the double quotes getting converted to inverted double quotes, three dots to an ellipsis, the TradeMark, so and so forth. This was not acceptable to the webservice!

This was strange to me, as Unicode should support any character in the world, but I just went ahead and removed all such characters and things started working!

FYI, here are the characters that I replaced:









Write a Comment

Comment

  1. Using utf8_encode() on the results will eliminate this problem. Some other solution that I found on the net (which I have not tried) is to downgrade libxml to 2.6.8.

  2. Using utf8_encode() on the results will NOT eliminate this problem.
    As long as it there are the funny characters, M$ word uses are the problem. (Characters in the above list.)

    I had the prob, writing a small cms where users started copying and pasting MSW content – it was hell.

  3. I was having this problem while the MySQL Client’s encoding was set to latin1. Solved by setting it to utf8 solved the problem.

    mysql_set_charset( ‘utf8’, $this->DB ) or die (“ERROR: Unable to set Client’s character set”);

    Pavel

  4. UTF-8 is not Unicode… Unicode is also known as UTF-16 and every character is 2 bytes. UTF-8 is a mix/stop-gap between ASCII and Unicode where characters can very between 1-3 bytes.

  5. Unicode has a variety of encoding methods of which utf-8 is extreemly common. Some people think that utf-16 is unicode and that’s all there is. Wrong, utf-16 is used a lot on windows like sql server because the 2 bytes is more manageable. utf-8 is much cooler because it uses a variable byte count (1-3 bytes) depending on the 8th bit of the first byte. So, regular ascii strings are stored as one byte per character (with bit 7 off) however the minute you get a fancy character bit 8 is turned on and another byte will follow with the details. But utf-8, utf-16, are flavours of the same concept – there is no need for special language encodings because all languages are handled in the one encoding.