http://highscalability.com/sharding-hibernate-way
To scale you are supposed to partition your data. Sounds good, but how do you do it? When you actually sit down to work out all the details it’s not that easy. Hibernate Shards to the rescue! Hibernate shards is: an extension to the core Hibernate product that adds facilities for horizontal partitioning. If you know the core Hibernate API you know the shards API. No learning curve at all. Here is what a few members of the core group had to say about the Hibernate Shards open source project. Although there are some limitations, from the sound of it they are doing useful stuff in the right way and it’s very much worth looking at, especially if you use Hibernate or some other ORM layer.
Information Sources
Google Developer Podcast Episode Six: The Hibernate Shards Open Source Project. This is the document summarized here.
Hibernate Shards Project Page
Hibernate Shards Dev Discussion Group.
Ryan Barrett’s Scaling on the Cheap presentation. Many of the lessons from here are in Hibernate Shards.
What is Hibernate Shards?
Shard: splitting up your data sets. If your data doesn't fit on one machine you split it up into pieces and each piece is called a shard.
Sharding: the process of splitting up data.
Sharding is used when you have too much data to fit in one single relational database. If your database has a JDBC adapter that means Hibernate can talk to it and if Hibernate can talk to it that means Hibernate Shards can talk to it.
Hibernate was chosen because it's a good ORM tool used internally at Google, but to Google Scale (really really big), sharding needed to be added because Hibernate didn’t support that sort of scale out of the box.
The learning curve for a Hibernate user is zero because the Hibernate API is the same. The shard implementation hasn’t violated the API (yet).
How does it compare to MySQL's horizontal partitioning? Shards is for situations where you have too much data to fit in a single database. MySQL partitioning may allow you to delay when you need to shard, but it is still a single database and you’ll eventually run into limits.
Schema Design for Shards
When sharding you have to consider the general issues of distributed data design for high data volumes. These aren’t Hibernate Shards specific issues, but are general to the problem space.
Schema design is the most important of the sharding process and you’ll have to do that up front.
You need to pick a dimension, a root level entity, that is easily sharded. Users and customers are common examples.
Accept the fact that those entities and all the entities that hang off those entities will be stored in separate physical spaces. Querying across different shards will be difficult. As will management and just about anything else you take for granted.
Control over how data are distributed is determined by a pluggable strategies layer.
Plan for the future by picking a strategy that will last you a long time. Repartitioning/resharding the data is operationally very difficult. No management tools for this yet.
Build simpler models that don't contain as many relationships because you don't have cross shard relationships. Your objects graphs should be contained on one shard as much as possible.
Lots of lots of objects pointing to each other may not be a good candidate for sharding.
Because the shards design doesn’t modify Hibernate core, you can design using shards from the start, even though you only have one database. Then when you need to start scaling it will be easier to grow.
Existing systems with shardable tables shouldn’t take very long to get up and running.
The Sharding Code’s Relationship to Hibernate
Shards doesn't have full support for Hibernate’s query interface. Hibernate has a criteria or a query interface. Criteria interface is robust, but not good for JPA (Java persistence API), which is query based.
Sharding should work across all databases Hibernate works on since shards is a layer on top of Hibernate core beneath the standard Hibernate interfaces. Programmers aren’t aware of it.
What they are doing is figuring out how to do standard things like save objects, update, and query objects across multiple databases using standard Hibernate interfaces. If Hibernate can talk to it they can talk to it.
A sharded session is used to contain Hibernate’s sessions so Hibernate capabilities are preserved.
Can not manage cross shard foreign relationships (yet). Do have runtime checks to detect when cross shard relations are used accidentally. No foreign key constraint checking and there’s no Hibernate lazy loading. From a programming perspective you can have IDs that reference other objects on other shards, it’s just that Hibernate won’t know about these relationships.
Now that the base software is done these more advanced features can be considered. It may take changes in Hibernate core
Pluggable Strategies Determine How Data Are Split Across Shards
A Strategy dictates how data are spread across the shards. It’s an interface you need to implement. There are three Strategies:
* Shard Resolution Strategy - how you will retrieve your objects.
* Shard Selection Strategy – define where objects are saved to.
* Access Strategy – once you figure out which shard you are talking to, how do you want to access those shards (serially, 2 at a time, in parallel, etc)?
Goal is to have Strategies as flexible as possible so you can decide how your data are sharded.
A couple of implementations are provided out of the box:
* Round Robin - First one goes to the first shard, second to the second shard, and then it loops back.
* Attribute Based – Look at attributes in the data to determine which shard. You can shard users by country, for example.
Some Limitations
Full Hibernate HQL is not yet supported (maybe it is now, but I couldn’t tell).
Distributed queries are handled by applying a standard HQL query to each shard, merging the results, and applying the filters. This all happens in the application server so using very large data sets could be a problem. It’s left to the intelligence of the developers to do the right thing to manage performance.
No mirroring or data replication.
No clean way to manage read only data you want on every shard for performance and referential integrity reasons. Say you have country data. It makes sense to replicate that data on each shard so all queries using that data can stay on the shard.
No handling of fail over situations, which is just like Hibernate. You could handle it in your connection pool or some other layer. It’s not considered part of the shard/OR mapping layer.
There’s a need for management tools that work across shards.
It’s possible to shard across different databases as long as you keep the same schema in the same in each database.
Related Articles
An Unorthodox Approach to Database Design: The Coming of the Shard.