Entrepreneur Geek

Nirav Mehta on life, technology and future

Archive for the ‘unicode’ tag

Handling Unicode with PHP

without comments

Unicode characters and webservices always create one or the other problem for me ;-) Working on PlannerX backend was not any different. Spent some good hours fixing Unicode / UTF-8 related issues.

And while I was searching for some solutions, I found an excellent “PHP UTF-8 Cheat Sheet” by Nick Nettleton of DropSend. I highly recommend it if you are going to do anything with PHP and Unicode!

And BTW, don’t use Base64 encoding with UTF-8. It will not work!

Written by Nirav

March 27th, 2009 at 4:32 pm

Posted in PHP, Recommended Reading

Tagged with , ,

Removing Word special characters from text / XML in PHP

without comments

Microsoft Word converts certain characters into "smart characters". Double quotes, dashes (em dash / en dash), bullets and so on.

These characters break PHP's XML handling. (or at least they broke it for me - using simplexml_load_string!).

How do you clean them?

There is an old post that suggests using ereg_replace on a set of characters - essentially converts them to html entities.

Unfortunately, that did not work with me. Since the text is UTF8, the replace logic replaced alphabets too.

I tried a lot to get a solution, but could not find something that would work. Finally, just stripped out all non printing characters except line breaks and tabs.

PHP:
  1. private function cleanWordSpecialCharacters($body)
  2. {
  3.   $body = preg_replace( '/[^[:print:]|\n|\r|\t]/', '', $body );
  4.   return $body;
  5. }

This too breaks with non English characters. Any suggestions?

Written by Nirav

November 15th, 2008 at 5:58 pm

Posted in PHP

Tagged with , ,