14 September 2005

SPARQL support

I thought that I'd write a bit more about the initial end-to-end support of SPARQL in JRDF. While it's no where near complete, it does parse simple queries and apply them to a graph, giving real results. It's taken a while to get it to this point as I've been concentrating on test-driving the functionality, partly as an exercise in test-driving (can it be done with a SableCC parser) and partly to ensure better quality code that's well structured and loosely coupled. This has drawn out a very nice architecture for the connection layer (connection, parser, executor) that means that it's easy to test and easy to swap out implementations when we come to supporting new languages, which is probably never, but the testability makes it all worthwhile even without this feature. So here's a rough sketch of the current architecture in HTML glory.

  • org.jrdf.connection.JrdfConnection - Top level way to send textual queries to a JRDF graph.
    • org.jrdf.query.QueryBuilder - Turns textual queries into org.jrdf.query.Query objects.
      • org.jrdf.sparql.parser.QueryParser - Adapter to compiler (i.e. SableCC) specific class.
        • org.jrdf.sparql.analysis.SparqlAnalyser - Parses the queries into domain objects (e.g. Triple, Query, etc.).
          • org.jrdf.sparql.builder.TripleBuilder - Part of a set of builders that take SableCC org.jrdf.sparql.parser.node.Nodes and turn them into corresponding domain objects, such as ObjectNodes, Triple, etc. As more of SPARQL is implemented, these will be fleshed out more, and probably governed by a higher level class (perhaps called a herder ;).
    • org.jrdf.query.JrdfQueryExecutor - Takes org.jrdf.query.Query objects and executes then against a graph. This is the class that hooks into a query layer, a very bad one at present!
      • Query layer - currently does a select all and filters using an iterator based on the constraints in the query.
      • org.jrdf.graph.Graph - The JRDF implementation of an RDF graph.

So why didn't I just implement it in Kowari? Well, initially I did, however it proved to be too difficult for a number of reasons. Not all of these were my initial drivers, but it's proved to be a good decision in hindsight.

  1. Kowari has a convoluted build system, which while fine for managing a large team of developers, was not agile enough to enable me to develop on the train on my small PowerBook. Kowari's build is broken into components, which is great for pluging, not not so great when you need to build everything from scratch and the dependencies are not always checked. This is one of those issues about being "owned" by the build rather than the other way around. Large Ant-based build fines need constant maintenance and refactoring. On a project as large as Kowari, this could almost a full time job. The same issue also manifested itself in it being quite hard to build Kowari in IntelliJ. It was doable, but required a round-a-bout solution that wasn't optimal.
  2. Slow turn-around on tests. The full test suite (excluding load/performance) in Kowari takes a long time to run (much longer than the full test suite in JRDF) which meant that I'd be spending my entire train ride (and battery) waiting for tests to complete. Most of this was based on the build system, which because of its componentised nature, has lots of dependancies between components, which aren't always checked properly, resulting in recompilations and rebuilding of JARs when they've just been done. I could have taken the time to fix it (I believe Mark was working on this before Tucana ceased operations), however my PowerBook just isn't grunty enough to handle that kind of workload.
  3. The unit tests are not unit tests. Most of the tests in Kowari are integration tests, that is, while they may test only a single class, they test that class with the roles of its dependent class filled by real concrete instances, rather than mocks. So the tests not only test the class, but also every other class hanging off the class under test. Consequently, the tests take a long time to run.
  4. The Kowari code is not test-driven. Even though at Tucana we did our best to ensure we had tests to cover everything (and we did a pretty decent job), these tests are not unit tests and we were not test driving the code, we were testing before at best. I wanted to test-drive the code for this, and unfortunately this was harder to do with Kowari than with a fresh(er) codebase.
  5. The code is tightly coupled. This was the main reason I stopped work in the Kowari codebase and moved to JRDF. It's all over the codebase (see for example recent comments on the Kowari Developers list about this), but the kicker for me was that the code to execute a query was embedded in the parser - ItqlInterpreter (Paul also commented on this a while back). Seeing as I wrote a lot of this, I only have myself to blame, but the job of pulling the query execution code out of the parser was too big a job. I wanted to do this, so that I could pull out the common code between SPARQL and iTQL into a set of shared classes and have the language specific bits being the parsers only. All the parsers should need to do is generate Query objects for sending to the query layer, something else should take care of the execution.
  6. Andrew and I have been bitten by the agile, quality, TDD and refactoring bug, and having our own small project to play with makes this much easier. It also allows us to enforce higher standards on the codebase as there's only two of us that need to agree at the moment.
  7. Andrew had some ideas about building a query layer, learning from what was done in Kowari
  8. We're interested in seeing if we can use some of the DI principles in the code, now that we've decoupled the classes and have clear responsibilities. I don't think you'll see a Spring requirement just yet, but you'll see a lot of comments saying things like // FIXME TJA: Set builder using IoC, which will make the job of plugging in different implementations of things like org.jrdf.query.JrdfQueryExecutor really easy. It should also make JRDF a lot more embedable than it currently is.

So what's next?

  • Completion of the SPARQL grammar as it stands in the candidate recommendation (this may be dependent on the next point).
  • Implementation of a better query layer. The current iterator-based approach will not scale as it reads the entire graph into memory and then filters out results that match the constraints. Andrew has already started work on this, and I think that it'll be a lot simpler than the model used in Kowari. Kowari has more features also, such as resolvers, that JRDF does not need (nor perhaps want), so we can afford to make it simpler, and learn from what was done in Kowari.
  • More long term, and perhaps beyond the scope of JRDF, we could Learn from what guys like Obbie Fernandez are doing in the Ruby arena, especially some of the thought about a Rails-like framework. I'm not convinced as yet, but semweb still needs a killer app (yes, we all know it's WonderGui).
  • Andrew wants to release a 0.4 release soon, so perhaps we'll have a SPARQL parser and a stable API in that release. At the moment we've been a bit eager with the refactoring, which is fine up to a point, but we'll need to decide on a stable API as the release nears.

My plan for Kowari is to complete the SPARQL implementation in JRDF and hopefully just update the JRDF version in Kowari, thus getting SPARQL support for free. Of course this assumes there will be none of the normal problems such an upgrade entails. The downside of this is that Kowari is heavily optimised for speed, something that I haven't spent any time on in the work I've been doing. I'm not overly worried by this as the architecture lends itself to plugging in different implementations and really all I've done is create a parser that turns query strings into Query objects. This parsing does not constitute a significant amount of time in overall query execution. We're lucky in this regard as I fully expect that someone with more skill in this area will be able to implement an efficient query executor that plugs directly into Kowari's session or some other mid-level API. In fact, because of the de-coupling of the components in JRDF, this is very easy to do. Some easy options are:

  • Create a translator that translates org.jrdf.query.Query instances into org.kowari.query.Query instances. These Querys can then be sent directly a org.kowari.server.Session for execution.
  • Create an instance of org.jrdf.query.JrdfQueryExecutor that takes org.jrdf.query.Querys and executes them directly against a Kowari API.
  • Then there's the other option of plugging in the JRDF query layer into Kowari's layer. I'm not sure that this is either the right way to go or the most efficient, it smells like a hack to me.

I hope none of this comes off as criticism for Kowari as this wasn't my intention. It's certainly the best triplestore there is, and the project is moving ahead at great guns now. It's basically a reflection that Kowari is too heavyweight to do what we want to do and that both projects have different goals. Kowari is effectively a scalable fully transactional database, while JRDF is a Java RDF API. The two can (and should) happily coexist, learning off the work each of us do. Fortunately, this is easy, as we all used to work together (bar the NG guys) and know each other well. Oh, it helps that Paul is a committer on JRDF and Andrew and I are on Kowari...

1 Comments:

At 26/9/05 8:33 pm, Blogger Tom Adams said...

Very nice! It's good to see a lot more work in the API area for RDF, I've been saying for years that it's these higher level operations that are what most developers will want to play with. Even better will be when even higher level APIs are built on top of this, ala Hibernate. Although I'm not sure if there's a direct corrolation. Please feel free to offer comments on the direction of the API. I've been trying to learn from what we did with iTQL in Kowari, as well as provide decent abstractions, so any feedback is appreciated! Thanks.

 

Post a Comment

<< Home