DOMImplementation hates HTML5

This doesn’t work:

documentType = impl.createDocumentType("html", "", "");
document = impl.createDocument(null, "html", documentType);

assertEquals("head", document.getChildNodes().item(0).getChildNodes().item(0).getNodeName());

In fact, it just silently fails to add any child nodes. No exceptions, nada.

This gives the same result:

documentType = impl.createDocumentType("html", null, null);
document = impl.createDocument(null, "html", documentType);

assertEquals("head", document.getChildNodes().item(0).getChildNodes().item(0).getNodeName());

But *this* does work:

documentType = impl.createDocumentType("html", " ", " ");
document = impl.createDocument(null, "html", documentType);

assertEquals("head", document.getChildNodes().item(0).getChildNodes().item(0).getNodeName());


Posted in Uncategorized | Leave a comment

HtmlCleaner 2.8 is out

Its the first release of 2014, and its got a nice patch from Rafael that makes it run a lot faster (who knew that just checking whether a String is a valid Double in the XPath processor would cause so much stress?) and another patch from Chris that makes it output proper XML ID attributes in DOM.

My contributions this time around were more enhancements to “foreign markup” handling, which is important when cleaning up HTML that contains valid SVG or content. HtmlCleaner wasn’t really written with that sort of use in mind when Vladimir started on it back in 2006, so it involved a fair bit of wrangling, but I think we’re nearly there now.

Its good to see the number of contributions going up – last release we had an entire GUI contributed by Marton – and I think having a faster release schedule is helping with that. Hopefully one day I’ll be able to make releases that just consist of having applied other people’s patches ūüôā

HtmlCleaner is HTML parser written in Java. It transforms dirty HTML to well-formed XML following the same rules that the most web-browsers use. Download HtmlCleaner 2.8 here, or you can get it from Maven Central.

Posted in Uncategorized | Leave a comment

5 lessons for OER from Open Source and Free Software

While the OER community owes some of its genesis to the open source and free software movements, there are some aspects of how and why these movements work that I think are missing or need greater emphasis.

open education week 2014

1. Its not what you share, its how you create it

One of the distinctive elements of the open source software movement are open development projects. These are the projects where software is developed cooperatively (not collaboratively, necessarily) in public, often by people contributing from multiple organisations. All the processes that lead to the creation and release of software – design, development, testing, planning – happen using publicly visible tools. Projects also actively try to grow their contributor base.

When a project has open and transparent governance, its much easier to encourage people to voluntarily provide effort free of charge that far exceeds what you could afford to pay for within a closed in-house project. (Of course, you have to give up a lot of control, but really, what was that worth?)

While there are some cooperative projects in the OER space, for example some of the open textbook projects, for the most part the act of creating the resources tends to be private; either the resources are created and released by individuals working alone, or developed by media teams privately within universities.

Also, in the open source world its very common for multiple companies to put effort into the same software projects as a way of reducing their development costs and improving the quality and sustainability of the software. I can’t think offhand of any examples of education organisations collaborating on designing materials on a larger scale – for example, cooperating to build a complete course.

Generally, the kind of open source activity OER most often resembles is the “code dump” where an organisation sticks an open license on something it has essentially abandoned. Instead, OER needs to be about open cooperation and open process right from the moment an idea for a resource occurs.

Admittedly, the most popular forms of OER today tend to be things like individual photos, powerpoint slides, and podcasts. That may partly be because there is not an open content creation culture that makes bigger pieces easier to produce.

2. Always provide “source code”

Many OERs are distributed without any sort of “source code”. In this respect, license aside, they don’t resemble open source software so much as “freeware” distributed as executables you can’t easily pick apart and modify.

Distributing the original components of a resource makes it much easier to modify and improve. For example, where the resource is in a composite format such as a PDF, eBook or slideshow, provide all the embedded images separately too, in their original resolution, or in their original editable forms for illustrations. For documents, provide the original layout files from the DPT software used to produce them (but see also point 5).

Even where an OER is a single photo, it doesn’t hurt to distribute the original raw image as well as the final optimised version. Likewise for a podcast or video the original lossless recordings can be made available, as individual clips suitable for re-editing.

Without “source code”, resources are hard to modify and improve upon.

3. Have an infrastructure to support the processes, not just the outputs

So far, OER infrastructure has mostly been about building repositories of finished artefacts but not the infrastructure for collaboratively creating artefacts in the open (wikis being an obvious exception).

I think a good starting point would be to promote GitHub as the go-to tool for managing the OER production process. (I’m not the only one to suggest this, Audrey Watters also blogged this idea)

Its such an easy way to create projects that are open from the outset, and has a built in mechanism for creating derivative works and contributing back improvements. It may not be the most obvious thing to use from the point of view of educators, but I think it would make it much clearer how to create OERs as an open process.

There have also been initiatives to do a sort of “GitHub for education” such as CourseFork that may fill the gap.

4. Have some clear principles that define what it is, and what it isn’t

There has been a lot written about OER (perhaps too much!) However what there isn’t is a clear set of criteria that something must meet to be considered OER.

For Free Software we have the Four Freedoms as defined by FSF:

  • Freedom 0: The freedom to run the program for any purpose.
  • Freedom 1: The freedom to study how the program works, and change it to make it do what you wish.
  • Freedom 2: The freedom to redistribute copies so you can help your neighbor.
  • Freedom 3: The freedom to improve the program, and release your improvements (and modified versions in general) to the public, so that the whole community benefits.

If a piece of software doesn’t support all of these freedoms, it cannot be called Free Software. And there is a whole army of people out there who will make your life miserable if it doesn’t and you try to pass it off as such.

Likewise, to be “open source” means to support the complete Open Source Definition published by OSI. Again, if you try to pass off a project as being open source when it doesn’t support all of the points of the definition, there are a lot of people who will be happy to point out the error of your ways. And quite possibly sue you if you misuse one of the licenses.

If it isn’t open source according to the OSI definition, or free software according to the FSF definition, it isn’t some sort of “open software”. End of. There is no grey area.

Its also worth pointing out that while there is a lot of overlap between Free Software and Open Source at a functional level, how the criteria are expressed are also fundamentally important to their respective cultures and viewpoints.

The same distinctive viewpoints or cultures that underlie Free Software vs. Open Source are also present within what might be called the “OER movement”, and there has been some discussion of the differences between what might broadly be called “open”, “free”, and “gratis” OERs which could be a starting point.

However, while there are a lot of definitions of OER floating around, there hasn’t emerged any of these kind of recognised definitions and labels – no banners to rally to for those espousing these distinctions .

Now it may seem odd to suggest splitting into factions would be a way forward for a movement, but the tension between the Free Software and Open Source camps has I think been a net positive (of course those in each camp might disagree!) By aligning yourself with one or the other group you are making it clear what you stand for. You’ll probably also spend more of your time criticising the other group, and less time on infighting within your group!

Until some clear lines are drawn about what it really stands for, OER will continue to be whatever you want to make of it according to any of the dozens of competing definitions, leaving it vulnerable to openwashing.

5. Don’t make OERs that require proprietary software

OK, so most teachers and students still use Microsoft Office, and many designers use Adobe. However, its not that hard to develop resources that can be opened with and edited using free or open source software.

The key to this is to develop resources using open standards that allow interoperability with a wider range of tools.

This could become more of an issue if (or rather when) MOOC platforms start to ¬†“embrace and extend” common formats for authors to make use of their platform features. Again, there are open standards (such as IMS LTI and the Experience API) that mitigate against this. This is of course where CETIS comes in!

Is that it?

As I mentioned at the beginning of this post, OER to some extent is inspired by Open Source and Free Software, so it already incorporates many of the important lessons learned, such as building on (and to some extent simplifying and improving) the concept of free and open licenses. However, its about more than just licensing!

There may be other useful lessons to be learned and parallels drawn – add your own in the comments.

Posted in cetis, open education, standards | 6 Comments

Why CKAN’s datastore_search_sql gives you a syntax error, and how to fix it

If you’re using the DataStore extension with CKAN, one of the first things you’re likely to try is to execute a SQL query on your data. However, you’ll likely see something like this:

["(ProgrammingError) syntax error at or near \"-\"\nLINE 1: SELECT * FROM 2da8c567-9d09-4b00-b098-c3e036170a86

This is because by default CKAN tends to create resource IDs that PostgreSQL doesn’t like as table names.

To get around this, just put quotes around the resource ID in the URL like so:

Posted in development | Leave a comment

Sign and encrypt your email. Please.

(This was prompted by the news that Groklaw is shutting down, in large part due to concerns over conducting business by email now that there is no legal or constitutional protection for its privacy. You can find out more about this story here)

Email is wonderful and terrible. Its pretty much the one technology that no business or organisation can live without. Its also, by default, pretty much insecure enough that anyone can snoop it with little more than basic networking tools.

But there are some simple measures that you can take to make it much, much more robust.

Simplest of all is to use servers that use encryption of the communication channel (TLS). This is nice and easy for users because they don’t even need to know about it. It prevents casual eavesdropping over the network. Most providers these days use encrypted communication channels for email.

However, the big hole in this scheme is that, while your communication is encrypted to others using the network, its plain to read for your provider. Not a problem if you trust your provider with you privacy and security . But these days, why would you?

To close this gap, you need to actually encrypt the messages themselves, not just the channel they are sent over. The tool I use for this is GPG, and a handy plugin for Apple’s Mail program called GPGMail. This automatically signs emails you send (preventing forgery) and also automatically encrypts email if you have the public key of the person you are sending it to. (If you’re interested, mine is here).

You can see this working by, for example, sending encrypted email from your GMail account, then looking at the message in the GMail web interface – all you get is a big block of seemingly random characters as Google can’t decipher the message and read it. Even though I’m using their service to deliver it! How cool is that?

The system only really starts to work if more people use it, so that the amount of messages that can be encrypted becomes a significant part of the total traffic. If only a few messages on the network are encrypted, its easy enough for Bad People to just target those and break their encryption. If there are billions of encrypted emails flying around, it becomes an untenable and expensive proposition to break them open, and mining all emails by default looks far less attractive for both companies and governments.

So, even if you are of the “I have nothing to hide” point of view, there is still a good reason to use encrypted communications if you can.

Understanding how keys work is the main education barrier to getting more people using the system.¬†It would be nicer if email applications made using encryption and signing easier by default, but I guess they have plenty of incentives not to…

For a much better guide for how to set it all up, try this article on LifeHacker.

Posted in Uncategorized | Leave a comment


Well this whole Prism business has been quite interesting from a personal perspective, especially this “metadata” business.

A long time ago, though not actually that far away from where I am now, I worked as a junior technical writer at a software company called Harlequin (later Xanalys). I worked in the intelligence systems division, and had been involved in some very cool projects for things like crime mapping, network analysis, and homicide case management. We had some great news clippings on the office walls of crimes solved using our technology.

One day my supervisor informed me that I needed to update the manual for one of the company’s more popular but least-liked products, something called CaseCall. Everyone within the company hated CaseCall as far as I could tell, and after a few days working on it I could tell why.

What CaseCall did was basically automate its way around some otherwise quite sensible restrictions on police extracting metadata from telecommunication providers.

In principle, any investigating officer could get in touch with any provider and ask for details about who-called-whom over a particular time period and analyse the data, but in practice not many did because the law put in place a number of steps you needed to go through to have your request approved.

What CaseCall did was turn things like “The absence of this information will prejudice this investigation because…” into a drop-down list of boilerplate non-answers so that the officer could press the submit button at one end, and the service provider press the accept button at the other, and the metadata could flow into the very clever analysis tools that the company had developed (and indeed still sell today).

Harlequin won the first Big Brother award for a product in 1998 for CaseCall. Sadly, no-one from the company went to collect it, as it would have looked great in the office:

Product:¬†Software by Harlequin that examines telephone records and is able to compare numbers dialled in order to group users into ‘friendship networks’ won this category. It avoids the legal requirements needed for phone tapping.

(Friendship network analysis in 1998! Pretty good, huh? Mark Zuckerberg would have been about 14 around then.)

Basically, what we were doing was avoiding the whole business of phone tapping and collecting content, and instead going after metadata. After all, the metadata was usually sufficient to identify important network nodes, identify useful patterns of behaviour, and corroborate other types of information acquired by other means, such as interviews and field officer reports.

The metadata was actually in many ways more useful than the “data” (the content of the phone calls in this case), which would have taken a lot of work to transcribe and analyse, and may not have actually provided much more analytic value than the metadata alone. (It was great ¬†to read this little¬†sketch by Cory Doctorow about metadata¬†today which kind of makes the same point.)

So don’t underestimate “metadata”..!

Posted in Uncategorized | 3 Comments

Phishing for peer reviewers

Today I got an email from the ICL 2013 conference:

Dear Scott Wilson,

This is a short reminder for you to complete your reviews for the ICL/IGIP 2013 Conference.

Overall 2 submissions were recently assigned to you for reviewing, 2 are not yet completed.

We kindly remind you that we implement a “review-to-present” model. At least¬†one of the authors from each full paper is expected to act as a reviewer of¬†other submissions in order to have their paper(s) published in the conference¬†proceedings.

Please log into your ConfTool account
( On the “Overview” page you can find¬†now a review section where you will be able to download the papers and enter¬†your reviews.

We need your reviews latest until 22 May 2013. We can’t exceed this¬†deadline, because we would like to inform the authors about the acceptance in¬†time.

Thank you for your support.

Best regards,

Your organizers of ICL 2013.

OMG! My two reviews are due in! I’d better go complete them!

Except of course I’ve never been asked, let alone accepted, to be on the review committee for ICL. As far as I can remember,¬†I’ve never even been to one.

Thankfully, as soon as I read this it rang a vague bell. Where had I heard this before? Oh, I know, back in 2012:

Dear Scott Wilson,

We are now ready to start the review process of the Full Paper extended abstracts submissions for the ICL/IGIP Conference 2012.

You have been assinged up to three papers to review. To download the paper(s)¬†and to enter your reviews please login to your ConfTool account. On the¬†“Overview” page you can find now a review section.

For all the authors of full papers we kindly remind you that “Full Paper” is a¬†‚Äúreview-to-present‚ÄĚ submission type. Each paper MUST have at least one¬†author participating in the reviewing process.

We would appreciate to receive your reviews by 14 May 2012. We can’t exceed¬†this deadline, because we would like to inform the authors about the cceptance¬†in time.

Thank you for your support during the ICL/IGIP review process.

Best regards,

Danilo G. Zutin
Technical Program Chair

I’d just ignored it at that time. Why? Well, because back in 2011 I received:

On 1 Jun 2011, at 21:23, ICL Conference Secretariat wrote:

Dear Scott Wilson,

may we kindly remind you, that the deadline for submitting the reviews assigned to you is on Monday, 06 June 2011.

We can’t extend the review phase, because we have to inform authors about the¬†results in time.

Thank you for your understanding and kind regards,

Conference Chairs

14th International Conference on Interactive Collaborative Learning

To which I replied, somewhat snarkily:

As far as I’m aware I have never agreed to have any connection with this conference. Please remove me from your mailing list.

Of course they ignored this, as they had when back in 2010 they sent me this:

On 25 May 2010, at 20:41, ICL Conference Secretariat wrote:

Dear Scott Wilson,

this is a gentle reminder that the paper review deadline for the ICL2010 Conference  is in four days (Saturday, 29 May 2010).

For your convenience the input of your reviews via the ConfTool is still possible until Sunday evening.

Please note that this is a hard deadline, so that the chairs can perform their duties in a timely manner and inform the authors about acceptance/rejection in time.

For your information: We have received more than 140 full paper submissions from over 45 countries and we are looking forward to a successful conference.

Calls for some submission types are still open.

Best regards,

Jeanne Schreurs

Michael Auer

13th International Conference on Interactive Computer aided Learning

To which I’d replied, confusedly:

I have no idea why I received this email – I’m not attending the conference, nor have I volunteered to join the programme committee for it?

Now, fake conferences and academic spam are becoming a real nuisance. However, I think ICL is a real conference because some real people I know have had papers accepted there!  Also, the very real Sandra Schaffert organised a mashup workshop at ICL 2009 for which I actually was on the programme committee!

But the fact is that in its communications ICL behaving like some sort of bizarre phishing scheme rather than an academic event.

You normally don’t allocate reviews for papers to random people on the Internet, you actually invite them onto your programme committee and give them the option of saying “no”.

Maybe this is just a bug in the conference organising software they’re using.¬†Then again, who knows, maybe a spam-bait-phish model of conference peer review actually works?

Posted in Uncategorized | Leave a comment

HtmlCleaner 2.5 is out!

Yesterday we released HtmlCleaner 2.5, which fixes up a few bugs, and also has a rewritten DOCTYPE handler based on the latest guidance issued with HTML5.


Posted in Uncategorized | Leave a comment

Open Source Meets Open Standards

I’ve just published a post over on the OSS Watch team blog on open source and open standards, introducing a new OSS Watch briefing paper on the topic.

This is where my CETIS and OSS Watch roles cross over! The post linked above is from a policy point of view, whereas the briefing paper is more aimed at developers and project managers. But what does this look like from a standards wonk viewpoint?

We often espouse the virtues of having open source implementations (or reference implementations) for driving adoption of standards, however there can also be barriers to open source that may be less obvious.

The paid publishing model often used by de jure standards organisations is definitely a barrier in the sense that developers are less likely to browse the standard and decide to use it, however I think on the whole its far less of an obstacle than a lack of clarity on the issues of patent licensing, copyright, conformance claims, and trademarks. If the standard is critical for interoperability, paying $120 for the specs isn’t as big a deal as potentially getting sued by Oracle or IBM for patent infringement.

For major standards setting organisations like W3C its not much of an issue, as developers generally know they can implement W3C  standards freely, whether or not they really understand the legal detail. However, for less well known standards-setting communities and consortia, its necessary to clearly spell out their position if they want to encourage open source implementations.

Posted in cetis, standards | 2 Comments

Understanding Glass: An SF perspective

When the IPhone (and later the IPad) arrived the first things people started comparing it to were things like the movie Minority Report for its interface, and also to the technologies used in Star Trek, with its many glossy black touch panel devices.

Science Fiction sometimes does a good job of – if not predicting the future – exploring the implications of many possible futures.

I think Google Glass is another good example of this.

My first thought on seeing Google Glass was a fairly recent novel by Vernor Vinge, Rainbows End (2006).

Rainbows End book cover

However, this is a novel about augmented reality, and virtualization. Its well worth reading for an exploration of how an augmented reality layer may affect society and technology, in particular his vision of a locked-down technological future where the only way of interacting with “black boxes” is via virtual interfaces does have some echoes of todays trend towards walled garden networks.

However, the “overlay” aspects of Google Glass are perhaps not the most interesting.

In a recent post, Mark Hurst discusses the “lifebits” recording feature of Glass. ¬†This to me immediately brought to mind two very different SF stories.

In Other Days Other Eyes (1972), Bob Shaw introduced us to a world of “Slow Glass”, a materials that delays the transmission of light from one side to another.


He takes us through the use of the technology as a means of ubiquitous surveillance – find a piece of glass, however small, and look through it to see what happened in the past. Google Glass, though using a very different technology, has similar properties to going around wearing a piece of Slow Glass, in that it offers the capability of looking through it into the past – via lifestreaming of video (and audio – not something Slow Glass could accomplish!).

Shaw does a very thorough job of exploring the use of this technology, and I highly recommend reading the novel. In particular, there is an exploration of how society adapts to ubiquitous surveillance, which has a kind of ring of the K√ľbler-Ross¬†stages of grief. Initially, efforts are made to try to avoid Glass, for example a secret meeting in the book is held in a room whose walls are freshly hosed down with a new layer of plastic each day just in case any fragments of Glass had been placed in it. However by the end of the story, everyone just accepts that anything, anywhere, and anywhen may be seen by others, with the vivid image at the end of particles of Glass suspended in raindrops.

Shaw also touches on the nostalgic aspects of Glass, with a story of a man watching his lost family from outside his house.

This nostalgic theme is also a key part of a very interesting short story by John Crowley, Snow (1985).

Snow is told from the perspective of a widower, whose wife’s first husband had bought her a “wasp” that continually recorded video of her for posterity. In the story, the protagonist visits a futuristic memorial park where he compulsively reviews the recordings, until eventually entropy starts to set in.

Again, the technology that Crowley uses here isn’t very much like Google Glass, but the implications of the story feel quite close to the near future. Crowley presents a compelling use for the lifestreaming (or lifebits, or whatever you want to call it) and also explores some of the potential downsides. Its quite short, and worth a read. Would you use Google Glass to do this?

I’m sure there are many other examples of fiction anticipating a Glass-like invention, if you know of any, let me know in the comments!

Posted in augmented reality, google | 4 Comments