Entrepreneur Geek

Nirav Mehta on life, technology and future

Archive for the ‘simplexml’ tag

Removing Word special characters from text / XML in PHP

without comments

Microsoft Word converts certain characters into "smart characters". Double quotes, dashes (em dash / en dash), bullets and so on.

These characters break PHP's XML handling. (or at least they broke it for me - using simplexml_load_string!).

How do you clean them?

There is an old post that suggests using ereg_replace on a set of characters - essentially converts them to html entities.

Unfortunately, that did not work with me. Since the text is UTF8, the replace logic replaced alphabets too.

I tried a lot to get a solution, but could not find something that would work. Finally, just stripped out all non printing characters except line breaks and tabs.

PHP:
  1. private function cleanWordSpecialCharacters($body)
  2. {
  3.   $body = preg_replace( '/[^[:print:]|\n|\r|\t]/', '', $body );
  4.   return $body;
  5. }

This too breaks with non English characters. Any suggestions?

Written by Nirav

November 15th, 2008 at 5:58 pm

Posted in PHP

Tagged with , ,