in Technology

Removing Word special characters from text / XML in PHP

Microsoft Word converts certain characters into “smart characters”. Double quotes, dashes (em dash / en dash), bullets and so on.

These characters break PHP’s XML handling. (or at least they broke it for me – using simplexml_load_string!).

How do you clean them?

There is an old post that suggests using ereg_replace on a set of characters – essentially converts them to html entities.

Unfortunately, that did not work with me. Since the text is UTF8, the replace logic replaced alphabets too.

I tried a lot to get a solution, but could not find something that would work. Finally, just stripped out all non printing characters except line breaks and tabs.

[php]
private function cleanWordSpecialCharacters($body)
{
$body = preg_replace( ‘/[^[:print:]|\n|\r|\t]/’, ”, $body );
return $body;
}
[/php]

This too breaks with non English characters. Any suggestions?

Write a Comment

Comment