Removing Word special characters from text / XML in PHP
Microsoft Word converts certain characters into "smart characters". Double quotes, dashes (em dash / en dash), bullets and so on.
These characters break PHP's XML handling. (or at least they broke it for me - using simplexml_load_string!).
How do you clean them?
There is an old post that suggests using ereg_replace on a set of characters - essentially converts them to html entities.
Unfortunately, that did not work with me. Since the text is UTF8, the replace logic replaced alphabets too.
I tried a lot to get a solution, but could not find something that would work. Finally, just stripped out all non printing characters except line breaks and tabs.
-
private function cleanWordSpecialCharacters($body)
-
{
-
return $body;
-
}
This too breaks with non English characters. Any suggestions?
Related posts:
- Auto Complete Text Component for Flex Was looking for ways to skin a combobox and found...
- Live Scribe – Amazing Writing Technology – Never Miss A Word I have trained my eyes to skip all ads on...
- Teaching English is not easy Our maid requested if Nikita (my wife) could teach English...
- Handling Unicode with PHP Unicode characters and webservices always create one or the other...
