Its the first release of 2014, and its got a nice patch from Rafael that makes it run a lot faster (who knew that just checking whether a String is a valid Double in the XPath processor would cause so much stress?) and another patch from Chris that makes it output proper XML ID attributes in DOM.
My contributions this time around were more enhancements to “foreign markup” handling, which is important when cleaning up HTML that contains valid SVG or schema.org content. HtmlCleaner wasn’t really written with that sort of use in mind when Vladimir started on it back in 2006, so it involved a fair bit of wrangling, but I think we’re nearly there now.
Its good to see the number of contributions going up – last release we had an entire GUI contributed by Marton – and I think having a faster release schedule is helping with that. Hopefully one day I’ll be able to make releases that just consist of having applied other people’s patches 🙂
HtmlCleaner is HTML parser written in Java. It transforms dirty HTML to well-formed XML following the same rules that the most web-browsers use. Download HtmlCleaner 2.8 here, or you can get it from Maven Central.