Funny characters, Webservices and UTF8

I was importing the Foss.In speakers and talks data into Glancer, and it started giving me some strange errors. This is what I got in OpenLaszlo:
error: java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence.

And when I tried using PHP, this was the string:
SOAP-ENV:Server SOAP-ERROR: Encoding: string 'some string here' is not a valid utf-8 string

So I looked up what’s causing this. First I thought it would be MySQL. I had mixed up collation order on different tables. Some were the default latin_swedish_ci, some were utf8_general_ci and some were utf8_unicode_ci. I changed all the tables and fields to utf8_unicode_ci first. But that did not solve the problem!

This is the time when I switched to PHP to test. And found that the problem was because the string in the data was not a proper Unicode / UTF8 string. I checked up if this is due to the size of the string – I had text data types. But well, that wasn’t the problem. Then I discovered the problem was coming up because of some funny characters in the text.

Primarily, the “auto convert” characters that Word typically adds up. E.g. the double quotes getting converted to inverted double quotes, three dots to an ellipsis, the TradeMark, so and so forth. This was not acceptable to the webservice!

This was strange to me, as Unicode should support any character in the world, but I just went ahead and removed all such characters and things started working!

FYI, here are the characters that I replaced:
” “ ’ ™ … • ‘ — –

Nirav

June 8, 2006

Using utf8_encode() on the results will eliminate this problem. Some other solution that I found on the net (which I have not tried) is to downgrade libxml to 2.6.8.

Reply to Nirav
mxcd

April 25, 2007

Using utf8_encode() on the results will NOT eliminate this problem.
As long as it there are the funny characters, M$ word uses are the problem. (Characters in the above list.)

I had the prob, writing a small cms where users started copying and pasting MSW content – it was hell.

Reply to mxcd
FOXXFiles

June 20, 2007

You can save the database on sqlscript, remove the collate command, recode sqlscript and restore database. its work.

Reply to FOXXFiles
Pavel Benisek

March 10, 2008

I was having this problem while the MySQL Client’s encoding was set to latin1. Solved by setting it to utf8 solved the problem.

mysql_set_charset( ‘utf8’, $this->DB ) or die (“ERROR: Unable to set Client’s character set”);

Pavel

Reply to Pavel
Tim

April 25, 2008

UTF-8 is not Unicode… Unicode is also known as UTF-16 and every character is 2 bytes. UTF-8 is a mix/stop-gap between ASCII and Unicode where characters can very between 1-3 bytes.

Reply to Tim
Laurents C. R. Meyer

July 1, 2008

Successfully used the “Pavel Benisek Solution”. Works great.

Reply to Laurents
Fesh

July 9, 2008

Thanks for that solution Pavel…worked perfect.

Reply to Fesh
Dan

November 6, 2009

Unicode has a variety of encoding methods of which utf-8 is extreemly common. Some people think that utf-16 is unicode and that’s all there is. Wrong, utf-16 is used a lot on windows like sql server because the 2 bytes is more manageable. utf-8 is much cooler because it uses a variable byte count (1-3 bytes) depending on the 8th bit of the first byte. So, regular ascii strings are stored as one byte per character (with bit 7 off) however the minute you get a fancy character bit 8 is turned on and another byte will follow with the details. But utf-8, utf-16, are flavours of the same concept – there is no need for special language encodings because all languages are handled in the one encoding.

Reply to Dan

Published

November 24, 2005

Nirav Mehta in Technology | November 24, 2005

Funny characters, Webservices and UTF8

Cancel Reply

Write a Comment