For a while now I’ve been working on the standardisation of XCRI-CAP, an XML specification for exchanging data about courses used in things like the HEAR and the JISC Course Data programme. As part of that process I’ve also been building a parser for XCRI in Java (Xcri4j)- partly as I thought it might be useful, and partly as its helped me to identify ambiguities or problems in the specification from an implementer perspective.  In structure its quite similar to XcriCap-Net, a library in C# for doing the same kind of thing.

What I found interesting is that it is far easier to write a code-based parser than use any kind of schema-driven or schema-generated system. Partly because XML schema is, frankly, crap, but also as many of the business rules you’re interested in from a data viewpoint fall outside its scope.

I’ve also found it much easier to write a more forgiving parser starting from code; while XML purists may scoff, I think its important to create systems that don’t simply throw your data back in your face if you mix up which of the six namespaces in the document apply to which elements, or indeed just give up and don’t bother using namespaces at all. And while XML is supposed to be case-sensitive, in practice I’d rather accept the data rather than get all particular about capital letters in tag names.

So I’ve built my parser so that it can still throws exceptions for these kind common problems but also gives you corrected data, so an application can log an exception or provide feedback, but still actually use the data.  Or you can ask the parser to quietly process stuff and write warnings to your logs and not bother you with any exceptions. In either case you can support the model of being generous when importing, but provide better data validity when exporting.

I’ve abstracted the utility classes that perform these particularly forgiving XML operations into a little project of their own on Github called Laxml (“less anal XML”) as they may be of use in other XML-based projects. They overlay the popular JDom XML library for Java.

I haven’t yet converted the whole parser to use Laxml yet, or yet covered the whole XCRI model, but its on my to-do list.

This entry was posted in development, xcri, xml. Bookmark the permalink.

4 Responses to Xcri4j

  1. Tavis Reddick says:

    Scott, could you please expand on your use cases for an XCRI CAP parser (in Java or another programming language)?
    I would not call myself an XML purist, and have used both client and server-side scripting as part of XML processing solutions. However, I think there may be some advantages in using the XML family of technologies (XML, XML Schema, XPath, XSLT and so on) where you get the benefit of declarative, publishable, documentable (especially in Schema), resuable, accessible and platform-independent code which can form a robust contract with information services with widely-supported standards*.
    There are some shortcomings with some of these standards (although they are intended to evolve through improved versions), and their implementations; perhaps you can say what problems your parser design is intended to solve?
    By the way, as a (web and database) developer, I tend to be a bit wary of non-normative methods that can lead to a double-debugging scenario (is the problem I see before me from a validation error from the XML-or-SQL-schema or a bug in the parser?). Although perhaps you envisage a separate two-step validation anyway, using first schema then advanced coded business rules to check a valid subset.
    * When-oh-when are Microsoft going to support XPath 2.0?!

    • scottbw says:

      Partly this comes from a mismatch between parts of the XML specification stack. So, for example, in some ways DTD is more expressive than XML Schema. And there are valid constructs in XML that are practically impossible to produce or verify when using XML Schema, particularly extensible data models.

      There are some solutions to these sorts of problems in draft revisions to XML Schema, but they seem unlikely to reach adoption, simply because many products that use it have scaled down their work to fit what can work within that framework. This is OK if you have a simple data model produced and consumed within a single company (now the dominant use case for XML web services) but not so much when you have documents in the wild with a fair amount of variety both in terms of systems that produce and consume them, and organisational context.

      I think Schema is probably the worst roadblock to using XML effectively in a standardised fashion as it produces both Type I and Type II errors: it falls over for things that are quite reasonable, and it lets things through that are nonsensical in business terms. In some respects it was a step backwards from DTD.

      Interestingly enough, W3C itself never seems to bother with XSDs any more for XML-based specifications. Which should tell us something…

    • scottbw says:

      Ah… maybe it just varies with W3C WG. I haven’t seen any XSDs from WebApps.

  2. I checked the latest news from W3C, and the latest recommendation-related announcement was for EmotionML. Emotion Markup Language (EmotionML) 1.0 is W3C Candidate Recommendation from 10 May 2012, and declares:

    “Since the Last Call Working Draft in April 2011, an XML Schema and a MIME-type for EmotionML were defined.”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s