Remove illegal XML characters when converting HTML to XML (original) (raw)
There are certain unicode characters which are prohibited by the XML spec. I've written the following method which should strip remove these characters from a document
String cleanHtml(String source) { Document document = Jsoup.parse(source); document.outputSettings().syntax(Document.OutputSettings.Syntax.xml); return document.html(); }
If I test this using the following HTML input
| Field Value | before after |
The XML entities representing illegal unicode character are removed and the resulting document can be parsed by an XML parser. However, if I add 
then the String returned by cleanHtml throws the following exception when parsed as XML
org.xml.sax.SAXParseException; lineNumber: 10; columnNumber: 17; An invalid XML character (Unicode: 0xb) was found in the element content of the document.