This is Floyd Marinescu at the Spring Experience Conference interviewing Patrick Linskey. Patrick can you tell us a bit about yourself and what you're up to?
Sure. I am the EJB Team lead at BEA and I am involved in the OpenJPA Apache Project. I'm here at the conference to give some talks about JPA and OpenJPA and the way things are going.
Tell us a bit about OpenJPA, how did it come to be?
OpenJPA is an Apache Project in the incubator right now, we are working on getting into a full fledged Apache project as we speak and it's a code base that came to BEA from the BEA\SolarMetric acquisition. So it came from the Kodo codebase which has been around for about four and a half, five years solving object\relational mapping problems and we started the OpenJPA project from that code within BEA about 9 months ago, in the spring of 2006 and we got the initial code drop out there over the summer in June or July; we have been pretty actively working on getting it fully up-to-speed and getting people out there, getting people using it, getting projects adopting it and what not.
What are the goals for OpenJPA?
The goals of the project are to be a fully compliant, compelling JPA implementation for primarily object\relational mapping use. The project itself is a subset of what is available in BEA's Kodo product and so from a BEA standpoint we made the decision to contribute a lot of the bits in OpenJPA when we were looking at how best to work with the open source community as part of BEA's blended strategy; the OpenJPA came out of that and within the Apache community our goals are to have high performance, top-notch enterprise grade, object\relational mapping framework, that people can use in any environment.
What kind of resources is BEA committing to make OpenJPA a success?
From BEA's standpoint OpenJPA is used inside the Kodo product and also used inside of WebLogic server, so obviously it's key to BEA's EJB story. So there are a number of people on the BEA-EJB team who are working roughly full time on OpenJPA and other people within BEA who do performance work or other types of spot work around the company; a lot of those changes make their way back into OpenJPA as well. Outside of BEA there's also a number of other committers and people involved in the project, from other companies too.
How functional is OpenJPA versus the enterprise products? Is it a viable competitor to Hibernate, for example?
Absolutely. It is a fully functional, fully enterprise-grade product. I certainly think it is a wise choice if you are choosing a JPA implementation. There is a lot of high-end, enterprise-level features in there, particularly regarding memory management, scale-building, cluster-building, that we have really put a lot of energy and engineering time into over the years. I think people will be pretty pleased with the characteristics they see in it. The Kodo product focuses on a lot of the areas that BEA tends to focus a lot of their engineering efforts on. We are really trying to look where there's complementary abilities of the app server.
What you will see there is that a lot of things that are in the Kodo product but missing from the OpenJPA product are things to deal with automatic profiling, (data profiling, not method profiling) and management and monitoring capabilities in terms of JMX consoles and some of the other features related to app server integration, scalability concerns with large deployments and environments that really need that full app server environment. What we are looking to do with it is to target OpenJPA at the developer market, at the developers who are looking to take the product and build their solutions with it, and target Kodo at the enterprise market where you are looking at, not just the developers involved in the project but also the people who are going to be maintaining the project over the time, giving them easy consoles rather than making them dig into code paths and giving them easy ways to say: "Find me inefficiencies!" and resolve them. Those are the types of things we are really focusing as the differentiators.
So clearly Hibernate is the market dominating O\R mapper today. How do you see OpenJPA comparing to Hibernate and how do you see the adoption rate for OpenJPA going, considering Hibernate?
I think that this is one of the really great things about the JPA specification, is that standards and specifications, when adopted by the industry, really help drive innovation and keep vendors honest; once you put a spec in place on some product, then if you have a variety of different implementations out there and no one ever really gets the chance to evaluate them closely because by the time they get down their evaluation path to the point where they would like to look their performance they have already baked in their API set. I think that JPA is really going to help make it possible for people to easily evaluate what the right O\R mapping choice for them is. And I think when that when that starts to happen, you have already seen people looking at some of the performance characteristics, memory characteristics of OpenJPA versus products like Hibernate and really being pleased with what they are seeing. So I think that that's going to be a no-brainer as the JPA groundswell continues.
What are some of the features of the JPA spec itself that you find are interesting beyond simply standardizing O\R mapping?
I think there are a couple that come to mind in particular: one is that the JPA interfaces were designed to not require that a relational database to be used for the backend. It has a lot of interesting implications for people who have data stored in other legacy formats and whatnot. Also the JPA spec has done some interesting stuff with locking, adding an optimistic read-lock API which allows people to guarantee consistency in their reference data during the course of a transaction. I think we have also got some cool stuff in the APIs that we've put in place for, in the future, allowing more complicated mapping support, more complicated locking scenarios and things like that. We have done some future design work in the APIs even though we didn't get to the point of standardizing some of that stuff yet.
OK. Let's talk about locking. What are some of the interesting locking support transaction isolation features of JPA?
From a locking standpoint we chose an optimistic locking model instead of a pessimistic model which is new in the EJB spec; previous versions were focused on pessimistic more than optimistic. We left pessimistic support for later because of time constraints, but when we did the optimistic locking support there were some concerns from some of the vendors about consistency issues and isolation issues that would make things difficult for dealing with large amounts of data. So what we ended up doing was putting in place an optimistic read lock method call, where you can explicitly obtain read locks on particular objects (actually locks in general, either read or write locks) but the most particularly interesting one to me is the read lock because it allows you to, we're used to optimistic write locks where you check the consistency at commit time; but the optimistic read lock enlists those objects that you have obtained that read lock on such that at commit time the implementation asserts that no one else has modified that data during the transaction. It doesn't actually increment any counters in the database but it does ensure that nothing has changed, which is really cool for largely reference data, where you might be reading data based on some pricing schedule. And you want to know if someone changed that data but you don't want to actually go and mutate it yourself.
In the past lots of vendors had read-only optimizations, a way to mark your methods as read-only. How does this work now with the optimistic locking?
Because of this optimistic read lock capability and because of the optimistic nature, in general, of the JPA spec, the read-only optimizations that we saw in app server vendors in the past in the EJB world aren't really that necessary because the's are no significant characteristics that are different between how the database interactions happen for a read-only entity versus a read-mostly one. So if you do have read-mostly data then that just means that you are not going to update it very often but the optimistic read lock capabilities and the optimistic checks in general mean that the database\implementation doesn't need to have any extra hints about the fact that the object might be read-only.
What about the optimistic concurrency collisions that can happen for data that spans web requests? Is there any support for that in the spec?
A common paradigm in the JPA spec and O\R mapping in general, is to access and mutate and commit some changes and then take those changes and make further changes over the course of several user interaction screens. The JPA spec has a few different facilities for that; from an optimistic locking standpoint one of the more commonly used ones probably would be the detachment APIs, where when you make some modifications and send them off to a different tier those changes become detached and then when you bring them back into your persistence tier there is a method called "merge()", which allows you to reattach those records and then work with them some more. The spec allows for implementations to check at merge time or at commit time to make sure that there were no changes that happened to those objects concurrently, while you were off doing your additional page changes. So it is not really a full transaction spanning multiple requests, but you do have some consistency checks across those different requests that ensure that your data is still meaningful and correct.
So how does it actually do that check? What kind of bits are being compared?
One of the restrictions in the spec is that, for portability, optimistic locking can either be implemented using a single numeric identifier that gets incremented or a single timestamp or date indicator in a class that gets reset every time an update happens. Within the confines of those two implementation mechanisms, there is actually room where you create in your class a version field, you annotate it as "@Version". And then, when the detachment happens, you don't maintain the version data, the implementation does, but when that data gets sent out over the wire the old version is still available, so when we recommit we have that data available to check. And then OpenJPA has some other facilities for version checking and locking that are a little bit more, well just extensions on what is available in the spec and those operate in similar manners where at commit time or at consistency check time we ensure that the database has the same values as what is in the object.
So what are some limitations of the JPA spec?
I think some of the big limitations right now are largely in the form of absences, not in the form of things that are in the spec but wrong. I think we have done a good job of keeping the spec, what's there, pretty well designed. I think some of the big ones are example pessimistic locking, are not defined in the spec - we plan to come back and revise that in the future. There is also a number of more complex O\R mapping needs like mixing inheritance models within a given hierarchy which is not currently allowed, but we see how to do it, we understand what we need to do in order to make it happen; It is just a matter of writing up the spec work on that. In general I think you are going to see the spec will give you enough for covering 80 to 90-95% of your application use cases. And then you will need to go a bit outside the spec from time to time often in a way that is very common across different vendors, using the same strategies, possibly even just using spec APIs in ways that are not quite fully nailed down in the spec itself. And then typically the data in that extra 20 to 10% of what is not really all there in the spec just yet, that tends to fall into the category of configuration, system configuration and mapping configuration, and looking at different mechanisms for optimizing and tuning and configuring some of the database interaction patterns you want to use at runtime.
You mentioned earlier some additional features of OpenJPA, like some performance and scalability enhancements and caching. Can you tell us more about what does OpenJPA offer beyond the JPA spec?
Sure. One of the things that is missing from the JPA spec in general, and for the most part always will be, is a detailed description of how data caching should work. There is some stuff to do with transactional caching in the spec but details about how an implementation should cache data obtained from the database between requests for performance optimization needs, that stuff is not there in the spec right now. And that is good because there is a lot of vendor innovation and differentiation in how caching happens. So OpenJPA provides an EntityManager factory-level data cache which will cache requests between, data lookups between, different requests. So if you look up a given object by ID twice then you only actually issue SQL once even if it is from different EntityManagers. And then, of course, that also integrates with our query cache, where if you execute a query we cache the results of that query, all the IDs involved in that query, so we can look them up directly in memory also; one of the O\R mapping goals is to get to the point where you only do zero database requests for a given transaction. Obviously that is an optimistic number, it's hard to get quite to 0, but caching is one of the ways we get there.
So, caching is one and I think for the most part a lot of caching will stay outside the spec, although I do expect some cache control to make it's way into the spec. Another OpenJPA performance and scalability-related feature is how we manage memory. We tend to be very memory-friendly because we only hold hard references to data that is "unflushed and dirty", so records that you have changed in the course of the given transaction but have not yet written to the database; and because there is a flush API in the JPA spec, periodically if you invoke flush we get to weaken our references to that data that you haven't yet changed. That means that the memory footprint of a given transaction tends to correspond to the number of objects that you have modified in the transaction; not the number of objects you access, but the number of objects you modify; and as you modify those objects if you call flush then we can weaken some of those references too. Hand in hand with that is that at flush time when it does come time to write to the database we know exactly which objects have been changed, so we don't need to do any state comparisons or any "dirty checking" or anything like that - we know these were the 17 objects that you change in course of the transaction, write them into the database, weaken references, done. So commits tend to be very fast and memory footprint tends to be very friendly.
Tell us more about zero database requests for reading, even queries? How does that work?
It is often a hard number to hit but if you are building a system that is read-mostly then generally speaking most of the time what's happening are reads and you can cache that data. If you execute a query selecting all the employees who are part of a given company for example, or who have a salary between $80,000 and $100,000, then if you issue that query once and then you come back and issue it again, we already have the query cached, so we know that objects with ID 17, 34 and 11 are in that query result and assuming we have also cached those objects when they were loaded then we can materialize that query without having to do any database interaction at all, without even getting a connection to the database necessarily. Obviously if you change data we need to write it back to the database, so as you modify data we will mutate it. In a clustered environment if you change data on one machine, the other machines will need to reload that data at some point, so definitely database still will be involved in most applications, but we can minimize that as much as possible.
So what about clustering, what support is available there?
OpenJPA's data cache. The key thing that tends to be cluster aware in a JPA implementation is cache, because as you make changes to objects in one VM you need to make sure that other VMs maintain consistent views of data over time so OpenJPA's data cache is cluster aware. We push metadata around the network, not data itself, so that means that when you commit a transaction if you change a couple of objects, we notify all other VMs in the cluster that you have changed certain objects. And they then just drop that data from their cache, and then the next time someone requests in those remote VMs there will be a cache miss and they will end up going to the database. The reason that's our default strategy is because if you have a cluster of, say, 50 machines and one machine changes the record you don't want the other 49 machines to immediately hit the database to refresh their data. So usually that lazy eviction style approaches what you want rather than metadata notification plus then centralized pinging back to the store.
We also support plugging into more aggressive cashes like Tangosol's Coherence product or Gemstone's GemFire product. Those products actually ship data around the network. So when you change a person record, to change the first name and the last name, those implementations will actually move that person data that changes the data around the network whereas our cache just notifies other VMs that changes have happened and they need to go and refresh so those caching strategies tend to be a little bit more network friendly, they tend to optimize network interactions and minimize database, avoid the hub-and-spoke syndrome a little bit more fully.
What is in store for the future of the OpenJPA?
I think as we are moving forward there's a number of other projects out there looking at OpenJPA and getting involved in OpenJPA. Spring is shipping with OpenJPA. I expect to see a number of other enterprise environments start to ship with OpenJPA and as that happens I expect to see those different vendors start to chime in and say:" We need this feature, we need this feature"; Well, it's open source, so get involved in the implementation of them. I think you will see even more broadening of the feature sets available in OpenJPA. I hope to see some people contributing non-relational implementations of what we call our "store manager" interface to OpenJPA as well, to get some other LDAP support or CICS mainframe support available to OpenJPA users.