Entrepreneur Geek

Nirav Mehta on life, technology and future

Removing Word special characters from text / XML in PHP

without comments

Microsoft Word converts certain characters into "smart characters". Double quotes, dashes (em dash / en dash), bullets and so on.

These characters break PHP's XML handling. (or at least they broke it for me - using simplexml_load_string!).

How do you clean them?

There is an old post that suggests using ereg_replace on a set of characters - essentially converts them to html entities.

Unfortunately, that did not work with me. Since the text is UTF8, the replace logic replaced alphabets too.

I tried a lot to get a solution, but could not find something that would work. Finally, just stripped out all non printing characters except line breaks and tabs.

PHP:
  1. private function cleanWordSpecialCharacters($body)
  2. {
  3.   $body = preg_replace( '/[^[:print:]|\n|\r|\t]/', '', $body );
  4.   return $body;
  5. }

This too breaks with non English characters. Any suggestions?

Bookmark and Share

Related posts:

  1. Zooming Text Area in Flex – messes up word wrap – bug?? We have been struggling with zooming TextArea in Flex for...
  2. Auto Complete Text Component for Flex Was looking for ways to skin a combobox and found...
  3. Design a new font – font basics We need to show non printing characters / formatting marks...
  4. Teaching English is not easy Our maid requested if Nikita (my wife) could teach English...
  5. Handling Unicode with PHP Unicode characters and webservices always create one or the other...

Written by Nirav

November 15th, 2008 at 5:58 pm

Posted in PHP

Tagged with , ,

 

Leave a Reply