Weekly Update May 8-12

Weekly update time! So what have I been up to? Actually its been a pretty quiet week.

Working on agile business intelligence with one of my Jisc Analytics Labs teams, getting them up to speed with data reshaping in Alteryx during our face-to-face meeting in Salford, and supporting another team with Tableau. I’m off to London next week for their face-to-face meeting, so will do a fair amount of coaching then too.

Discussing a proposal for some open source work with the OECD.

Chatting with some lovely people at Microsoft about our use of the Azure platform. Azure is usually my first choice when it comes to cloud services these days.

Releasing HtmlCleaner 2.21. We had a regression that wasn’t picked up by the rather extensive test suite.

Participating in a video interview for a MOOC being developed for the University of Edinburgh – the whole course is aimed at businesses and non-profits on utilising open resources, but my bit was on the subject of open source and commercialisation. So I spent an afternoon under the lights being interviewed for a while for four weeks worth of course videos.

Watching season 2 of The Killing. Nearly done!

Growing things in the garden. Potted on my two heritage perennial kales, sowing more lettuces, eagerly watching the chillies, peppers and tomatoes grow in the greenhouse.

Preparing our rather splendid new online research data service for launch soon.

Enrolling on the Stanford machine learning MOOC.

Reading the Fables graphic novel.

Meeting (virtually) with my business partners at Cetis LLP, and welcoming some new partners into the cooperative. So much going on in the company right now, and still no time to update the website since we launched over two years ago. We’re also looking at joining up with other technology-focussed coops in the UK.

Ich Lerne noch Deutsch. 45%! Danke, Duolingo.

Visiting my eldest daughter’s school for parents evening. They think she’s awesome, but we knew that already.

In case you don’t know me, I’m a consultant and partner in Cetis LLP, specialising in BI/analytics, open source software, and technology discovery and pre-procurement. Always interested in new clients, so feel free to get in touch.

Posted in Uncategorized | 1 Comment

Weekly Update May 1-5

Another week goes by … this week I’ve been

Working with my Analytics Labs teams once more. I stood in as Scrum Master for my PGT analysis team, took charge of the Trello board, and helped out with various Tableau-related queries.

Organising a video interview for next week on open source.

Tinkering with machine learning in Weka. This week I’ve been creating some classification models. I’ve started training a model using the UNISTATS dataset with the J48 decision tree algorithm to try to predict future NSS scores based on course structure.

Releasing HtmlCleaner version 2.20. I had 12,000 downloads last month from Maven Central, and 445 direct from Sourceforge.

Watching season 2 of The Killing. No spoilers, please!

Learning German. Jetzt ich kann sage “ich habe ein wichtig Ente”. Vielen danke, Duolingo.

Pondering a bid to the UfI VocTech Seed funding challenge. We have a very cool ready-to-go project on employability skills, but is it a good fit?

Developing the venerable Cetis/K-Int Content Transcoder, a service-based tool for converting various types of e-learning content formats, following a tip off from my Cetis colleague @wilm that people still need it. I’ve created a new fork with updated dependencies. Next, I’ll rewrite it to run as a desktop app as well as an online app.

Potting on various plants in the greenhouse.

Preparing our rather splendid new online research data service for launch. More on this soon…

Reading the Heart of Darkness by Joseph Conrad.

Posted in Uncategorized | Leave a comment

Weekly Update Apr 24-28

I’m going to try to blog some weekly updates, inspired by Doug. Lets see how it goes, eh?

This week I’ve been:

Working with a team of HEI data analysts looking into Taught Postgraduate courses as part of Analytics Labs, a fabulous programme from Jisc and HESA that our company is supporting. As usual I’ve been sourcing data and reshaping it for analysis using Alteryx.

Working with another team of HEI data analysts from around the UK looking at apprenticeships, and also estates data. Once again I get to play with the Working Futures datasets! Plus I get to use the estates returns for the first time, possibly to connect up with some open LA planning datasets.

Talking with University of Edinburgh about providing subject matter expertise on FOSS for a new MOOC they’re running.

Talking with researchers at the University of Manchester on making some interesting health data analytics software they’ve developed into a sustainable open source project, under the OSS Watch mantle.

Tinkering with machine learning using Weka.

Working with Manchester Metropolitan University on planning the student systems integration for their major change programme

Developing an improved attribute handling process for HtmlCleaner with input from community members.

Playing with Mastodon (the microblogging platform, not the band.)

Watching The Killing. OK, everyone else has seen it, but we only just got Netflix in our house.

Learning German. (Zufolge nach Duolingo, ich kenne 45%! Ja, wirklich…)

Growing tomatoes, chillies, and peppers in the greenhouse. Potatoes, garlic, shallots and peas are doing nicely in containers.

Wow, I’ve been busier than I thought. Maybe I’ll keep doing this.

Posted in Uncategorized | 1 Comment

Busy busy busy!


OK, I admit it, I’ve been ignoring this blog for way too long! But that doesn’t mean nothing has been happening – quite the opposite.

About a year ago I became one of the founding partners of Cetis LLP, and since then have been extremely busy working for a range of clients on everything from student systems integration and business intelligence, to research data apps and digital credentialing.

I’m still very much involved in open source and open standards – its a key ingredient in almost every solution I’ve been involved in. I’ve also fitted in a small amount of teaching and training on these topics, but haven’t been able to keep up with any academic writing.

My next challenge is working with partners to build our own cloud-based data service offering. Watch this space! Just not too closely, as I may be too busy to write anything for a while 🙂

bee photo by Bill Damon used under CC-BY

Posted in Uncategorized | Leave a comment

How are things going with HtmlCleaner?

HtmlCleaner is the FOSS project I’ve been maintaining since 2013. So, how is it going so far?

Downloads and users

I can get download stats from Sourceforge, which hosts the binaries, but perhaps a better perspective is the number of times the project is being downloaded via Maven Central as a component of other projects – Sonatype handily provide these statistics.

Last month (June 2015), there were 1,184 downloads from Sourceforge. There were also 10,468 downloads from Maven Central. Its averaged around that number per month over the year. Sourceforge downloads were at their peak in 2014 when they hit 1500 a month; but have been stable at 1000 or so a month since.

How meaningful these statistics are I’m not sure; they do seem to show a pretty stable level of interest in HtmlCleaner, which is encouraging. Also, a total of 130,000 downloads a year seems like a lot – there must be quite a few users out there!


Although we haven’t added more committers to the project (something I’m keen to do) we have had a lot more patches over the past year being submitted by users and included in releases. Most recent releases have included at least one user-submitted patch.

(My general philosophy on patches is that, if they work and are well tested, they go in – I don’t have any sort of ideological preferences for how code is written or whether I think a feature is necessary; if a user wants something to the extent they create a patch for it, its a valid feature request by definition.)

We have had lots of users submitting bugs and questions too, which is a great sign. (No, really, I like seeing bug reports! Only software with no users has no visible bugs…)


Release frequency has been a bit patchy, something thats entirely my fault as all kinds of other priorities get in the way. We’ve had 3 releases so far in 2015, 3 in 2014, and 6 in 2013. Still before that there was only one in 2010 and two in 2008 so we’re still doing well!

The Code

We added some new features this year, finally updating to full Html 5 tag set support, and adding some much nicer command-line operations. However, I think we’re getting close to the time that I need to strip down and rebuild the engine for a 3.0 version as we’re coming up against the limits of tweaking the existing engine.

If cleaning up shoddy HTML is something that interests you, pop along to HtmlCleaner and help out!

Posted in development, open source | Tagged | Leave a comment

Creating a simple SQL query editor

I recently needed to add a simple visual interface for building SQL queries based on existing schema models – you know the sort of thing, a bit like the UI you get in Access, but running in the browser.

I really, really tried to find something out there that was open source, lightweight, easy to integrate, and preferably client-side only. Well, the best I could find was maybe 3 stars out of 4.

So I ended up building one from scratch using JQueryUI and JsPlumb. I’m quite pleased with the result:

screenshot of visual query editor

You can see a live version over at BitBalloon. The whole thing should be available under an open source license and up on Github soon – it really is a very simple bit of scripting.

Posted in development, open source | Leave a comment

Going old skool with SQL: from Access to PostgreSQL

Well, after several years where I mostly worked on Redis, MongoDB and client-side browser storage, I’m currently working on a project that is most definitely old-skool SQL. So I spent much of today writing code to handle importing Access databases into PostgreSQL with all the nasty little details that involves.

I’ve now got a two-stage import process with preflight checks working with some of the truly awful Access databases my client sent over – I asked for the worst ones they had to hand and they didn’t disappoint. They are full of gnarly nastiness like OLE fields, bad keys, duplicate rows, weird Access-specific stuff like Switchboard, dangling references and more. Still, nothing better to tax my tests.

Jackcess has proved itself very useful, as you might imagine, but a lot comes down to how the transactions are orchestrated to handle all the kinds of errors that happen when real-world access data is turned into Postgres commands.

I contemplated importing the data before setting up constraints, but in the end went with deferrable constraints and running all the batches within transactions. The only downside is I’ve had to opt for a two-stage process where I exclude the worst offending tables in a first pass preflight check that I then rollback, after which I commit the final import in a second pass.

(The reason for this is basically that if a statement throws a batch exception, I have to either commit or rollback the transaction and start again; if I commit then start a new transaction, I lose any data relying on referential integrity checks from the previous transaction. I can’t just try to continue the transaction after an error – the driver quite rightly tells you off for that sort of thing)

Also, the lack of an UPSERT command or IGNORE directive in PostgreSQL is a real pain. Glad thats on its way in 9.5.

Still, there’s time to try a different tack if this proves problematic – right now I’m just happy I’m able to extract a working schema and good quantity of valid data from these basket cases databases.

It also makes a change from all that REST and JSON stuff I mostly seem to have done in the past few years 🙂

Posted in development | Leave a comment

HtmlCleaner 2.10 released

Get it here!

Posted in Uncategorized | Leave a comment

Tracking student blogs using Google Spreadsheets and WordPress

I’m currently teaching a first year module on Open Source Software, one of the requirements for which is students write their findings up as blog posts. For that reason I thought it might be useful to be able to keep track of how much my students have been writing. Given there are 80 students on the module, some automation here would also be useful!

I remembered an impressive set of techniques developed by Martin Hawksey for tracking Jisc project blogs, and so I decided to use these as a starting point.

I created a Google Spreadsheet, and a form for capturing each student’s name and blog URL. I then added a couple of scripts – based on Martin’s examples above – to retrieve the WordPress RSS feed for each blog, and put the post count and the date of the last post into cells in the same row.

Now, WordPress only gives you a maximum of 10 entries in a feed by default, but for my purposes thats still enough to get a sense of which students are struggling with their writing tasks. I just use some conditional formatting to show me anyone who hasn’t posted anything, and anyone who has made less than 5 posts this semester.

I’ve experimented with some other layouts, for example using D3.js visualisations, but just a list of blogs with some red-amber-green coding seems to be the most practical.

Another benefit of having the list of blogs in a spreadsheet is it made it quite simple to generate an OPML file to share with students to import into WordPress Reader and  follow everyone else on the course.

One limitation is I don’t seem to be able to get the functions to automatically be added to each new row created by the form – I have to paste them over the new rows. Still, overall its not a bad solution.

You can see a copy of my spreadsheet here (without the actual student blogs on it), the form for collecting blog URLs, and you can get the scripts from Github.

Posted in mashups | 1 Comment

HtmlCleaner 2.9

I’ve finally released version 2.9 of HtmlCleaner!

This month I also had to answer my first ever official support request for HtmlCleaner in my “day job” – it appears a researcher at the University of Oxford is using it as part of a project analysing legislation passed by the State Duma of the Russian Federation 🙂

Posted in Uncategorized | 1 Comment