Remove illegal XML characters when converting HTML to XML (original) (raw)

There are certain unicode characters which are prohibited by the XML spec. I've written the following method which should strip remove these characters from a document

String cleanHtml(String source) { Document document = Jsoup.parse(source); document.outputSettings().syntax(Document.OutputSettings.Syntax.xml); return document.html(); }

If I test this using the following HTML input

Field Value before after

The XML entities representing illegal unicode character are removed and the resulting document can be parsed by an XML parser. However, if I add 

before after

then the String returned by cleanHtml throws the following exception when parsed as XML

org.xml.sax.SAXParseException; lineNumber: 10; columnNumber: 17; An invalid XML character (Unicode: 0xb) was found in the element content of the document.