How are things going with HtmlCleaner?

HtmlCleaner is the FOSS project I’ve been maintaining since 2013. So, how is it going so far?

Downloads and users

I can get download stats from Sourceforge, which hosts the binaries, but perhaps a better perspective is the number of times the project is being downloaded via Maven Central as a component of other projects – Sonatype handily provide these statistics.

Last month (June 2015), there were 1,184 downloads from Sourceforge. There were also 10,468 downloads from Maven Central. Its averaged around that number per month over the year. Sourceforge downloads were at their peak in 2014 when they hit 1500 a month; but have been stable at 1000 or so a month since.

How meaningful these statistics are I’m not sure; they do seem to show a pretty stable level of interest in HtmlCleaner, which is encouraging. Also, a total of 130,000 downloads a year seems like a lot – there must be quite a few users out there!


Although we haven’t added more committers to the project (something I’m keen to do) we have had a lot more patches over the past year being submitted by users and included in releases. Most recent releases have included at least one user-submitted patch.

(My general philosophy on patches is that, if they work and are well tested, they go in – I don’t have any sort of ideological preferences for how code is written or whether I think a feature is necessary; if a user wants something to the extent they create a patch for it, its a valid feature request by definition.)

We have had lots of users submitting bugs and questions too, which is a great sign. (No, really, I like seeing bug reports! Only software with no users has no visible bugs…)


Release frequency has been a bit patchy, something thats entirely my fault as all kinds of other priorities get in the way. We’ve had 3 releases so far in 2015, 3 in 2014, and 6 in 2013. Still before that there was only one in 2010 and two in 2008 so we’re still doing well!

The Code

We added some new features this year, finally updating to full Html 5 tag set support, and adding some much nicer command-line operations. However, I think we’re getting close to the time that I need to strip down and rebuild the engine for a 3.0 version as we’re coming up against the limits of tweaking the existing engine.

If cleaning up shoddy HTML is something that interests you, pop along to HtmlCleaner and help out!

Posted in development, open source | Tagged | Leave a comment

Creating a simple SQL query editor

I recently needed to add a simple visual interface for building SQL queries based on existing schema models – you know the sort of thing, a bit like the UI you get in Access, but running in the browser.

I really, really tried to find something out there that was open source, lightweight, easy to integrate, and preferably client-side only. Well, the best I could find was maybe 3 stars out of 4.

So I ended up building one from scratch using JQueryUI and JsPlumb. I’m quite pleased with the result:

screenshot of visual query editor

You can see a live version over at BitBalloon. The whole thing should be available under an open source license and up on Github soon – it really is a very simple bit of scripting.

Posted in development, open source | Leave a comment

Going old skool with SQL: from Access to PostgreSQL

Well, after several years where I mostly worked on Redis, MongoDB and client-side browser storage, I’m currently working on a project that is most definitely old-skool SQL. So I spent much of today writing code to handle importing Access databases into PostgreSQL with all the nasty little details that involves.

I’ve now got a two-stage import process with preflight checks working with some of the truly awful Access databases my client sent over – I asked for the worst ones they had to hand and they didn’t disappoint. They are full of gnarly nastiness like OLE fields, bad keys, duplicate rows, weird Access-specific stuff like Switchboard, dangling references and more. Still, nothing better to tax my tests.

Jackcess has proved itself very useful, as you might imagine, but a lot comes down to how the transactions are orchestrated to handle all the kinds of errors that happen when real-world access data is turned into Postgres commands.

I contemplated importing the data before setting up constraints, but in the end went with deferrable constraints and running all the batches within transactions. The only downside is I’ve had to opt for a two-stage process where I exclude the worst offending tables in a first pass preflight check that I then rollback, after which I commit the final import in a second pass.

(The reason for this is basically that if a statement throws a batch exception, I have to either commit or rollback the transaction and start again; if I commit then start a new transaction, I lose any data relying on referential integrity checks from the previous transaction. I can’t just try to continue the transaction after an error – the driver quite rightly tells you off for that sort of thing)

Also, the lack of an UPSERT command or IGNORE directive in PostgreSQL is a real pain. Glad thats on its way in 9.5.

Still, there’s time to try a different tack if this proves problematic – right now I’m just happy I’m able to extract a working schema and good quantity of valid data from these basket cases databases.

It also makes a change from all that REST and JSON stuff I mostly seem to have done in the past few years :)

Posted in development | Leave a comment

HtmlCleaner 2.10 released

Get it here!

Posted in Uncategorized | Leave a comment

Tracking student blogs using Google Spreadsheets and WordPress

I’m currently teaching a first year module on Open Source Software, one of the requirements for which is students write their findings up as blog posts. For that reason I thought it might be useful to be able to keep track of how much my students have been writing. Given there are 80 students on the module, some automation here would also be useful!

I remembered an impressive set of techniques developed by Martin Hawksey for tracking Jisc project blogs, and so I decided to use these as a starting point.

I created a Google Spreadsheet, and a form for capturing each student’s name and blog URL. I then added a couple of scripts – based on Martin’s examples above – to retrieve the WordPress RSS feed for each blog, and put the post count and the date of the last post into cells in the same row.

Now, WordPress only gives you a maximum of 10 entries in a feed by default, but for my purposes thats still enough to get a sense of which students are struggling with their writing tasks. I just use some conditional formatting to show me anyone who hasn’t posted anything, and anyone who has made less than 5 posts this semester.

I’ve experimented with some other layouts, for example using D3.js visualisations, but just a list of blogs with some red-amber-green coding seems to be the most practical.

Another benefit of having the list of blogs in a spreadsheet is it made it quite simple to generate an OPML file to share with students to import into WordPress Reader and  follow everyone else on the course.

One limitation is I don’t seem to be able to get the functions to automatically be added to each new row created by the form – I have to paste them over the new rows. Still, overall its not a bad solution.

You can see a copy of my spreadsheet here (without the actual student blogs on it), the form for collecting blog URLs, and you can get the scripts from Github.

Posted in mashups | 1 Comment

HtmlCleaner 2.9

I’ve finally released version 2.9 of HtmlCleaner!

This month I also had to answer my first ever official support request for HtmlCleaner in my “day job” – it appears a researcher at the University of Oxford is using it as part of a project analysing legislation passed by the State Duma of the Russian Federation :)

Posted in Uncategorized | 1 Comment

Looking at Kwan and West’s Open Source Adoption Model

In my last post I looked at the first part of Kwan and West’s paper, which concerns the phases of technology adoption. Now I’m turning to the second part, which concerns the adoption model.

Kwan and West propose the following model for open source adoption:

Open Source Adoption - Kwan and West 2005

The model has two decision points: the first where the choice set of applicable solutions is identified, and the second where a selection is made from the set available.

What I find interesting here is how strongly they position software policy in the first decision point. They I think rightly identify policy as being at least partly reactive; that policy is determined to some extent not by the intentions of the organisation so much as its past history (the feedback loop in the diagram). There is also the contextual influence of the culture of the firm and the industry it sits within.

Within UK higher education we have a reasonable data set for how open source is represented in software policies from 10 years of OSS Watch surveys:

open source in software policies

To some extent “policy” is a little bit misleading in Kwan and West’s paper as they elaborate this as not just formal policies but also the “pattern of attitudes and decisions that constitute a de facto policy”, so could be said to encompass institutional cultural attitudes.

(In practice I’ve found that in higher education, policy is set at an organisational level, but cultural attitudes towards, and awareness of open source can vary by department or even by group within a department.)

The second factor influencing the choice set is the application context, which is where we refer back to the model of phases of strategic importance to determine the relative weighting of features, risks, and costs.

Finally, there is the set of available products that can be considered.

These three factors determine the choice set, which are then compared in the second decision point for selection.

In my experience the first decision  point, of determining the choice set, is where many open source candidate solutions tend to be excluded; for quite a few different reasons:

  • The process of procurement is primarily passive in nature, expecting suppliers to respond to RFPs at their own expense, a route that tends to favour suppliers of proprietary systems
  • The organisation has low awareness of open source options, and does not look beyond its set of usual suppliers when identifying available products
  • The procuring organisation has a set view on the composition of the solution, requiring a single product from a single supplier, for example, rather than considering an integration of different components to perform the same functions.

When we looked at the OSS Watch survey results for 2013, we found that a perception that “there isn’t a solution that meets our needs” was the overall number one supply-side reason given for deciding against open source, while interoperability and migration issues were the number one demand-side reason.

For me this indicates that there is perhaps something missing from Kwan and West’s model, or that perhaps the “policy” component is trying to cover too many different factors. For example, would a passive-RFP approach rather than an active discovery approach fit within “policy” or is it an influence on “available products”? Also, where do constraints such as contractual obligations and interoperability issues (particularly lock-in) reside in the model?

So, thats the first decision point. The second decision point is I think fairly standard, though here the “selection metrics” factor has some special considerations for open source that can be overlooked.

When I talk to organisations about procurement (with my OSS Watch hat on) I tend to talk a fair amount about sustainability, and in particular how to evaluate open source projects for their likely future viability. This isn’t necessarily a criteria used in traditional selection metrics, or if it is, it isn’t one which is easy to evaluate or have good data for.

Another aspect of selection is pre-sales activities such as demonstrations and pilot systems; again there is sometimes a problem here with procuring organisation’s expectations being conditioned to how closed-source companies tend to work, and this isn’t included anywhere in the model.  I guess generally Kwan and West have operated from a presumption of general good procurement practice by the organisation, whereas in the context I’m interested in (UK HE) I’m very much aware of the existence of gaps between policy, process and practice, and the impact this has, and so can’t really exclude them from the model.

Finally, I think the model is suited to procurement of systems at all phases of strategic alignment apart from strategic systems; in these cases I think its more likely that a solution does not already exist, but has to be created in partnership with suppliers as a “pre-procurement” activity. However, as I’ve already pretty much concluded that there is no strategic IT in HE, its not something I need to dwell on.


I think Kwan and West have made a good stab at a useful conceptual model here, and in particular I think the strategic alignment phases concept elaborated in the first half of the chapter is a very useful one. The adoption model has some good features, but seems to miss out or have insufficient emphasis on factors that I know do affect selection practices, so it would require some further tweaking before I could make good use of it. The authors have done a fair amount of follow up work which I’ve also put on my reading list, and hopefully they revisited the model.

Posted in open source, phd, Uncategorized | Leave a comment

Is there such a thing as “strategic IT” in Higher Education?

In their 2005 paper, A conceptual model for enterprise adoption of open source software [1], Kwan and West use as a starting point the “Strategic Grid” developed originally by McFarlan, McKenney and Pyburn (1983) [2] which divides firms or divisions of a company into four categories depending on how valuable IT is to the performance of the division:

  • Strategic
  • Factory
  • Support
  • Turnaround

However, Kwan and West contend that not all IT systems are of equal importance within a division or company – just because IT is “strategic” to a division doesn’t mean all of its IT systems are!

So Kwan and West adapt the model as a set of stages of business alignment for individual systems:

  • Strategic
  • Mission Critical
  • Support
  • Laboratory

Strategic systems are those that provide actual competitive advantage over competitors; mission critical systems are those needed to support the operation of the business, and whose interruption can cause loss of revenue; support systems provide business value through internal efficiency; and finally laboratory systems are non-production systems used to pilot innovations and experiment with new technologies.

Why “stages” and not “categories”? Well, the authors point out that many IT systems in organisation exhibit strategic shifts over time – software originally deployed as a pilot at the laboratory stage may be implemented as a strategic system; then eventually the system becomes commodified in the market and downshifts to being mission critical.

It seems a pretty sensible model – but how does it look when applied to IT in the Higher Education sector? While the model seems reasonably straightforward, applying it to actual deployed systems is not. Some of the systems deployed in a typical HE institution include:

  • Student information system
  • Library management system
  • Research management
  • Learning management system (VLE)
  • Network management system
  • User and account management system
  • Timetabling and enterprise resource management
  • Human resources
  • Finance information system
  • Mail and groupware systems
  • Content management system
  • Institutional repository
  • Business intelligence and analytics system
  • Curriculum management system

… and thats before we get into the various blogs, wikis, specialized research tools and teaching tools.

What I find striking is that while some of these are undoubtedly mission critical (the student information system in particular), its difficult to make a case for any systems as being strategic using Kwan and West’s definition. Do any of these systems provide actual (not imagined) competitive advantage? I think business intelligence and analytics is frequently pitched as such, but in practice would support internal efficiencies in many institutions.

For some institutions that specialise in online delivery the VLE/LMS might be considered strategic; and for the Open University both the LMS and content management may have a more strategic role than for a typical university, but even there its not 100% clear.

So, is it true that “IT doesn’t matter”as Carr (2003) claimed [3]? Carr argues that information technologies, like previous technologies, become ubiquitous and commodified over time. Therefore, they are now strategically unimportant because individual organisations can no longer gain a competitive advantage from IT.

McCredie (2003) considered Carr’s views from a HE perspective [4] and concurred that IT in education is “strategically invisible”, being a requirement to stay in business but not offering competitive differentiation. However, he does contend that “If staying in business is not a strategic concern, I do not understand the meaning of the term” which is certainly true; though it does still place IT systems in the business critical rather than the strategic phase of the Kwan and West model.

However, McCredie also points out that, if current IT systems are not strategic, this doesn’t rule out the development of future systems that will provide competitive advantage. 

(It may be argued that MOOCs are an attempt to elevate learning management to have strategic value within a modified higher education business model. If the business of the organisation  — e.g. Coursera, Udacity, EdX — is fundamentally about the online delivery of teaching, then the system deployed may indeed be strategic.)

So what does this imply?

Well,  Kwan and West go further than just identifying the phases of IT systems, they also look at the tradeoffs made in adoption between features, cost and risk as being dependent of the strategic positioning of the system.

For example, for business critical phase systems, organisations will trade off cost and features to minimise risks; for support systems organisations will trade off features and risk to minimise costs, and for strategic systems they are willing to trade off costs to maximise features. (Features being defined here as “everything that isn’t risk or cost”)

For laboratory systems its less clear cut as it depends on the prospective phase of the system being piloted.

This implies that most IT systems in Higher Education will be either mission-critical systems primarily selected for their relatively low levels of risk, and support systems selected for their low cost (in TCO terms). Or, to put it another way, the characteristics that institutions are looking for most in their IT systems are reliability and cost-effectiveness; so the key innovations will be those that enable lower costs or improved reliability – such as virtualisation and software-as-a-service – rather than new capabilities.

One could also put this the other way around and state that, as HE organisations are typically not interested in radically changing their business models, it is unlikely that they will be looking for IT to deliver strategic value, but instead to improve the reliability and costs associated with their operations. Instead, systems that offer new capabilities will be adopted by organisations that are looking to either disrupt the existing market, or operate within a niche where IT can offer strategic value. (The one potential exception is business intelligence and analytics, but we’ve yet to see examples of actual competitive advantage being realised; its also unclear how long such an advantage would persist before the technology is commodified).

Given this context, it would seem that the future of IT in higher education will be focussed on realising the benefits of commodification.

This includes migrating more support systems from expensive proprietary products, either to open source technologies with lower TCO (e.g. moving more content management to Drupal) or to commodity software-as-a-service offerings. It will also include realising  reliability and cost improvements from moving more systems to take advantage of commodity cloud computing.

I’m about to embark on a PhD looking at adoption of open source technologies in Higher Education, so I’m catching up on my reading by doing some blog posts on interesting papers on the topic. So there’ll be more posts like this in the coming months.

The second half of Kwan and West’s paper deals with creating a model for open source adoption using this background, but I’ll leave that for another post.

[1] Kwan, S. K., & West, J. (2005). A conceptual model for enterprise adoption of open source software. The standards edge: Open season, 51-62.

[2] McFarlan, F. W., McKenney, J. L., & Pyburn, P. (1983). The information archipelago-plotting a course.

[3] Carr, N. G. (2003). IT doesn’t matter. Harvard Business Review, May 2003.

[4] McCredie, J. (2003). Does IT matter to higher educationEducause Review,38(6), 14-22.

Posted in open source, phd | 2 Comments

DOMImplementation hates HTML5

This doesn’t work:

documentType = impl.createDocumentType("html", "", "");
document = impl.createDocument(null, "html", documentType);

assertEquals("head", document.getChildNodes().item(0).getChildNodes().item(0).getNodeName());

In fact, it just silently fails to add any child nodes. No exceptions, nada.

This gives the same result:

documentType = impl.createDocumentType("html", null, null);
document = impl.createDocument(null, "html", documentType);

assertEquals("head", document.getChildNodes().item(0).getChildNodes().item(0).getNodeName());

But *this* does work:

documentType = impl.createDocumentType("html", " ", " ");
document = impl.createDocument(null, "html", documentType);

assertEquals("head", document.getChildNodes().item(0).getChildNodes().item(0).getNodeName());


Posted in Uncategorized | Leave a comment

HtmlCleaner 2.8 is out

Its the first release of 2014, and its got a nice patch from Rafael that makes it run a lot faster (who knew that just checking whether a String is a valid Double in the XPath processor would cause so much stress?) and another patch from Chris that makes it output proper XML ID attributes in DOM.

My contributions this time around were more enhancements to “foreign markup” handling, which is important when cleaning up HTML that contains valid SVG or content. HtmlCleaner wasn’t really written with that sort of use in mind when Vladimir started on it back in 2006, so it involved a fair bit of wrangling, but I think we’re nearly there now.

Its good to see the number of contributions going up – last release we had an entire GUI contributed by Marton – and I think having a faster release schedule is helping with that. Hopefully one day I’ll be able to make releases that just consist of having applied other people’s patches :)

HtmlCleaner is HTML parser written in Java. It transforms dirty HTML to well-formed XML following the same rules that the most web-browsers use. Download HtmlCleaner 2.8 here, or you can get it from Maven Central.

Posted in Uncategorized | Leave a comment