Archive for the ‘simplexml’ tag
Removing Word special characters from text / XML in PHP
Microsoft Word converts certain characters into "smart characters". Double quotes, dashes (em dash / en dash), bullets and so on.
These characters break PHP's XML handling. (or at least they broke it for me - using simplexml_load_string!).
How do you clean them?
There is an old post that suggests using ereg_replace on a set of characters - essentially converts them to html entities.
Unfortunately, that did not work with me. Since the text is UTF8, the replace logic replaced alphabets too.
I tried a lot to get a solution, but could not find something that would work. Finally, just stripped out all non printing characters except line breaks and tabs.
-
private function cleanWordSpecialCharacters($body)
-
{
-
return $body;
-
}
This too breaks with non English characters. Any suggestions?
