in Technology

Fixing bad XML, any recommendations?

I am using Text_Diff classes of PHP to generate differences between two XML documents. The output is not always valid XML – tag nesting is not always correct. This happens because my source files are XML and have their own tags. When Text_Diff inserts its own <ins> and <del> tags around the changed text, it messes up the tag hierarchy at times.

I am looking for a clean, fast and safe way to fix such invalid XML. Do you have any recommendations?

I have looked at Tidy, it’s PHP library and htmLawed. I liked htmLawed since it’s pure PHP implementation, but don’t know how fast it is compared to Tidy. Moreover, I need an XML cleaner, not necessarily XHTML cleaner. So even if I use these libraries, I will have to strip out the HTML parts from the output.

Do you have any suggestions / recommendations?

Write a Comment

Comment