Saturday, May 31, 2008

SQL for distributed caches

Once again, I find myself on a project that is struggling to get data in and out of the database fast enough.  Given the steps involved in persisting and retrieving objects from the database, it's hardly surprising.  Okay, I know there are better ways to go about solving the cost of this process, but here are the steps:-

  1. Use XStream to serialise most of the object data to XML.
  2. Take the resulting smaller object, and use Hibernate to map it to a table in the database.
  3. Fetch the next number in a hibernate generated HiLo sequence.
  4. Store the record, with the XML part in a Blob.
  5. Triggers create audit details
  6. Re-query the object, to fetch the timestamp audit data
So that's my data persisted.  Obviously there are similar steps involved in retrieving the data.  So what is the purpose of the database?  Well, we want to be able to deal with large volumes of data, without running out of memory.  We want to retain data once the vm stops.  We want the data to be available to all the other servers either as federated services, or as multiple different services with some overlap in data requirements.  We want to be able to report on the information.  We want to be able to query the data to pull out just the results we are interested in.  All the usual reasons.

Given the time critical nature of the system though, performance is the biggest priority.  If you've got to run thousands of simulations across a complex trade portfolio in order to feed back to customers waiting to find out price for the products they have requested, the faster you can provide results, the more business you can accept.  After analysing where the bottlenecks are, as per usual the database is the slow part.  Now, this isn't intended to be a database rant.  When I say 'It's the database thats the bottleneck', I'm actually covering far more than the database.  The whole object marshalling, XML generation, Hibernate overheads, sub-optimal queries generated by a mis-match between a generic tool (hibernate) and a specific database setup, network overheads, poorly indexed or structured schemas ... the list goes on.  It's all persistence layer stuff.

So I want to remove the overhead of all that object relational mapping stuff, but retain the benefits of a relational database.  I want to be able to use external reporting tools which use sql and expect relational data back.  I want to be able to query data and just retrieve the subset I ask for, using all the usual joins and sub-queries.  I just don't want to have to marshal the objects to and from the database.  I don't want the additional layer in my primary data owning system which is causing so much overhead.

Okay, that's the wants, now where's my solution?  From a storing and retrieving data perspective, I really just want a Map which I can throw my objects in and forget about them, and a filter approach for pulling the matching records out again.  Nothing new, strange or unusual about that.  We can do that with Coherence, or Terracotta.  There's a whole heap more, but in my little corner of the world, they are the main ones.  These are great technologies for eliminating the relational part, but they don't support a SQL like approach.

Is there an existing product that allows me to store my objects in a distributed cache, and permits me to query that cache with SQL?  Sort of like a JDBC driver that reads from Terracotta?  I think that would be the ideal solution.  That allows me to have my relational representation of the data, and not have to marshal to and from the DB all the time.

Any suggestions??