Horizontal Database Partitioning with Spring and Hibernate
关键字: Database
Introduction
About a year ago we decided to scale our database horizontally - that is, partition it. We had many millions of users in the database, and we were contemplating allowing a lot more user-generated content on our site, as well as collecting much more data on what our users were doing. We had been burned many times by the vertical-scaling strategy ("buy a bigger box") - it's harder and harder to get the money for the next bigger box, you can only get one or two big boxes at a time, everything ends up on that box ("it's the only box powerful enough"), and when it crashes the entire world goes down. So we decided to partition horizontally with commodity hardware.
大约一年前,我决定水平扩展我的数据库,分割它.在我们的数据库中有数百万的用户,并且我们期望我们网站的用户可以创造更多的内容。同时能收集到更多的用户行为数据。我们被垂直扩展策略困扰了很久,每次购买更高级的硬件设备。这个方法越来越难,因为需要很多的钱来购买更大的设备。你一次只能购买一个或者两个大的设备。所有的东西都依赖那台设备,只要它足够的性能,当它出现问题以后整个系统就崩溃了。所以我们决定通过水平的分割,用普通硬件来实现我们的需要。
A MySql consultant specializing in scalability recommended that we partition horizontally based on user: a user and all her data (profile, user-generated content, etc) would be held in a particular partition. A global user database (GLUD) would be key to this array of databases: GLUD would store each user's primary key and the partition ID that the user resided in.
一个擅长可扩展系统的Mysql顾问建议我们基于用户名水平的分割系统。一个用户以及与其相关的所有的数据(包括个人属性,用户生成的内容等等)应该被放置在一个特殊的分区。
一个用户统一数据库应该是这组数据库的关键。GLUD将存储每个用户的主键和用户所在数据库区域的partition ID
So we went to work. Our initial idea was to create a Hibernate session factory for each partition. Let's say we have two user databases, user1 and user2. Then we'd have two session factories, one for each database. The services that used those databases (eg, the ProfileService) would have one instance created for each database. Profile1Service would connect to profile1Dao, which would use user1SessionFactory. Repeat for N partitions. Calls to the service would encounter a Spring AOP interceptor that would grab the user's identifier, call the GLUD to determine which partition the user's data was in, and then route the call to the correct instantance of the ProfileService.
我们开始实现。我们最初的想法是为每个分区分别创建一个hibernate session factory。我们有2个用户数据库,user1,user2。接下来我们有2个session factories,分别指向2个数据库。Profile1Service连接profile1Dao,连接user1SessionFactory。几个区域就重复几次。调用这个service将遇到Spring AOP 拦截器获得用户的ID,查询GLUD数据库来判断用户数据在哪个区块,然后路由ProfileService调用到正确的数据库。
We implemented a prototype of this, and it worked ok. Then we came across two ideas. First was a blog by Interface21's Mark Fisher where he introduced the AbstractRoutingDataSource. The second was Hibernate Shards. Mark's scheme would have us create only one ProfileService, one ProfileDao, and one UserSessionFactory, and have the datasource be aware of the multiple user databases. Hibernate Shards was a project that just was released that worked similarly to our first idea, creating a separate session factory instance for each database.
我们实现了这个原型并且它工作的还不错。接着我们遇到了2个想法。第一个是Interface21's Mark 的blog中提到的AbstractRoutingDataSource 。第二个是Hibernate Shards。 Mark's的方案使我们只需要创建一个ProfileService,一个ProfileDao,一个UserSessionFactory,并且使数据库知道多个用户表。Hibernate Shards是一个类似我们第一个想法的项目,为每一个数据库创建独立的session factory。
We really wanted to use Shards, rather than write our own partitioning system. But Shards had just been released as a beta. In the end we decided not to use Shards for several reasons: we watched for several weeks, but there was very little activity in the Shards source code repository. We didn't feel safe betting our core infrastructure on a project that was so new and uncertain. Second, the many-session-factory strategy is inherently unscalable: you need a new Hibernate session factory for each new database partition. If you become MySpace-like successful, you'd need a hundred session factories.
我们确实很希望使用Shards,而不是我们自己去实现一个分区系统。但是Shards只是一个beta版本。最终我们没有使用Shards,因为我们观察了一段时间,发现这个项目不太活跃,我们没有安全感,把我们的核心基础架构在这样新并且不确定的项目上。其次,多session-factory策略天生就不可扩展,你需要为每个数据库分区创建Hibernate session factory,如果你变为类似MySpace一样的成功,你需要数百个session factories。
Given all the literature talking about session factories being resource intensive, we weren't comfortable with that thought (this was also the rap on our initial stab at partitioning, above). Finally, looking at the initial Shards docs, it wasn't clear how one would integrate it and configure it with Spring. Spring's LocalSessionFactoryBean wouldn't work. I didn't relish the idea of digging into Spring's transaction infrastructure to build a ShardsSessionFactoryBean that properly integrated with the transaction management in Spring. So we decided to go with the routing-datasource method.
根据文献提到的session factories 消耗资源,我们认为这不是一个好的想法。最终查找最初的Shards 文档,它没有清楚的说明如果集成Shards以及在spring中配置它。Spring的LocalSessionFactoryBean无法工作。我没有深入的探索spring的事务基础结构来创建一个ShardsSessionFactoryBean来合适的集成事务管理。所以我们决定使用Routing datasource方法。
Implementation
I'll take you through how we set this up, and then what we see as the pros and cons.
First is the GLUD database. This database contains the master_user table, which has the primary key and email address of all the users in all the partitions. In truth, it contains all the uniquely constrained atttibutes of a user, as it's the only place where a unique database constraint can be applied to the column, but for this explanation let's assume
the only unique field (other than PK) is email.
实现
我将带你了解我们是如何实现的,以及我们观察到的优势与不足。首先是GLUD数据库。这个数据库包括一个主用户表,包含主键,所有分区用户email地址。事实上,它包含了所有用户唯一强制属性,唯一定义一个用户的属性。但是在这里我们假设我们以email为唯一约束。
Given a user's email address, the master_user table can be used to locate the user's primary key. Another table is the partition_map. This contains a mapping of a hash of the user's primary key (PK) to a partition id. So once you have a user's PK, you hash it, then look up the partition in the partition_map. The hash function we used was just the last three digits of the PK, so that we allocated 1000 virtual partition. The number of physical partitions can be any number from 1 to 1000 in this scheme. For example, you could map paritions 000-499 to user database 1, and 500-999 to user database 2 if you only used two partitions (or you could go even/odd, or whatever). The point is now that once you have the user's PK or email, you can determine the partition id of the database that has her data.
我们通过master_user表email找到这个用户对应的primary key。另一个表是partition_map,这个表包含了一个hash的user primary key到partition id的映射。所以一旦你有一个用户的pk,对pk做hash操作,然后在partition_map表中查找用户的分区。 我们使用的Hash函数就是PK 的后三位,所以我们可以分配1000个虚拟分区。按照这个方案,物理分区可以是1到1000中的任意数字。如果你只有2个分区,你可以映射000-499结尾的user pk至database1,500-999至用户 database2。现在一旦你有用户的pk id或者email,你可以找到它的数据分区的database的id。
So who is in charge of doing this partition location calculation? We wrote a Spring AOP interceptor to wrap all the services that use the partitioned database. The interceptor was able to use the GLUD database (via an intermediary GludService) to determine which partition to route to.
那么谁来管理这个分区计算呢?我们写了一个spring aop拦截器来包装所有使用分区数据库的服务。这个拦截器能使用GLUD数据库,通过一个中间件 GludService来决定路由到哪个分区。
The final question was, how does the interceptor know which user the current operation is associated with? Rather than rely on magic, we made a decision that the first parameter of each method call that should be partition aware would identify the user: it would be either the User object itself, or the user's PK or email.
这最终的问题是拦截器怎么知道当前操作的是哪个用户?在没有魔术的前提下,我们做了一个决定,每个需要知道分区的方法的第一个参数必须标识用户。是User对象本身,或者用户的pk或者email。
Any of these would serve to identify the parition that the data was in. This is a leaky abstraction:
以上的工作主要是为了标识出分区数据在哪。这里是个概要。
the presence of the partitioning system now manifests in goofy method signatures in services that use the partitioned database. Here is how a method in the interceptor might look:
这里是拦截器中可能看到的方法。
public Object selectExistingPartitionWithUser(ProceedingJoinPoint jp, LocatePreexistingUser annotation, User user)
throws Throwable
{
GludEntry gludEntry = getGludService().getGludEntryForExistingUser(user);
int partitionNumber = gludEntry.getDatabasePartition();
datasourceNumberCache.set(partitionNumber);
Object returnValue = null;
try
{
returnValue = jp.proceed();
}
finally
{
datasourceNumberCache.remove();
}
return returnValue;
}
Here datasourceNumberCache is a public static final ThreadLocal<Integer> that holds the partition id for the user associated with this operation. We'll see who reads this ThreadLocal a little later on.We used the AspectJ pointcut language to describe our pointcuts. This allowed us to write type-safe method signatures for our interceptor, as you see above (no Method objects or Object[] of parameters). Also, we realized that many different types of interception would be necessary. Above we see the simplest case, that of looking up data associated with a user. But what if the user is updating her email (or other unique field held in the GLUD database)?
这里datasourceNumberCache一个public static final ThreadLocal<Integer>对象,来保存用户的partition id。我们将看到谁读了threadLocal对象。我们使用了AspectJ pointcut语言来描述我们的切点。这允许我们为我们的拦截器编写类型安全的方法签名。我们认识到许多不同的拦截类型将是必要的。上面我们看到了最简单的例子,如何按照用户查找数据。但是如果用户更新他的email,或者其他在GLUD数据库中唯一的字段。
What if we are creating a new user? What if it's an operation that should be "broadcast" to all partitions (count all the items created by all the users in the last week)? What if we need to load all user-generated content for an indexing process, and we need to batch load? All of these operations require different methods in the interceptor. How to bind the proper methods from the interceptor to the appropriate methods on the services?
如果你创建一个用户呢?如果一个操作需要广播到所有分区(例如需要统计上周所有用户创建的数据数量),如果你需要加载所有用户生成的内容来索引,批量加载?所有这些操作需要拦截器中包含不同的方法。如何使拦截器绑定合适的方法。
For this we used annotations. You can see the annotation instance (yes, there are such things!) being passed in by the Spring infrastructure in the above method signature. Now you are prepared to appreciate the pointcut for the method in all its glory:
@Around(value="@annotation(annotation) && args(user, ..)", argNames="annotation,user")
为了这个问题我们使用annotations。你可以看到annotation的例子
You can poke around the Spring docs and the AspectJ site to completely understand this, but basically it says "bind to any method annotated with "LocatePreexistingUser" and that has a User object as the first parameter". The "argNames" section was necessary to get the annotation and User object to be passed in correctly - something funky that as I remember only occurred if there was more than one argument binding in the pointcut or something like that; I just remember it was really difficult getting that to work properly, until I stumbled across the "argNames".
What's way cool about annotations used this way is that you can pass data from the annotated method to the
interceptor. For example, here's the definition of the above annotation:
@Retention(RetentionPolicy.RUNTIME)
@Target(ElementType.METHOD)
public @interface LocatePreexistingUser
{
public UserIdentifier userIdentifier() default USER_OBJECT;
public boolean userUpdate() default false;
}
Here, UserIdentifier is an enum with values USER_OBJECT, EMAIL, and USER_PK. If you are updating one of the
uniquely constrained fields held in the GLUD database ("email", for example), you can annotate the method on your
UserProfileService like this:
@LocatePreexistingUser(userUpdate=true)
public void updateEmail(User user, String newEmail)
{ ... }
And then the interceptor can contain code like this:
if(annotation.userUpdate)
{
// tell GLUD service to update its master_user record
}
This is really nifty. You can pass in information to the interceptor that tells it how to process invocations of that
particular method, and the annotation that specifies that information is right there at the method definition. Again, I
think this is neat.
When Hibernate is ready to send some SQL to the database, it calls the datasource to get a connection. The
PartitionRoutingDataSource reads the partition id out of the ThreadLocal, and returns a connection to that database. It
extend's Springs AbstractRoutingDataSource and the operative method looks something like this:
protected Object determineCurrentLookupKey()
{
Integer datasourceNumber = DatasourceSwitchingAspect.datasourceNumberCache.get();
return datasourceNumber;
}
Datasource configuration in Spring shows two normal datasources configured (for the two partition case),
user1DataSource and user2DataSource. These are standard lookups (we use JBoss's connection pools, looked up
through JNDI) pointing directly at the physical databases. However, the datasource that we feed to the (single)
Hibernate session factory is configured like this:
<bean id="userDataSource" class="PartitionRoutingDataSource">
<property name="targetDataSources">
<map key-type="java.lang.Integer">
<entry key="1" value-ref="user1DataSource"/>
<entry key="2" value-ref="user2DataSource"/>
</map>
</property>
</bean>
The Hibernate session factory is created in the usual way by Spring using this magical datasource:
<bean id="userSessionFactory" class="org.springframework.orm.hibernate3.LocalSessionFactoryBean">
<property name="dataSource" ref="userDataSource"/>
etc etc
</bean>
Nothing special here. Now, to enable more database partitions, one simply configures the connection pools in the
appserver, adds the datasource to Spring, and then add the reference to the PartitionRoutingDataSource. Done! And
the nice thing is that you can handle arbitrary numbers of database partitions without creating a zillion session
factories.
There are some more interesting complications and details to worry about. For example, you want to make sure that
you've set the partition number before you open a transaction in Spring, as Hibernate sometimes aggressively grabs
a connection. In other words, you need to make sure that the "order" attribute on the DatasourceSwitchingAspect is
lower than the one on the transaction interceptor. Here, the DatasourceSwitchingAspect is set to order=1, and the
transaction interceptor is set to order=2.
<aop:config>
<aop:pointcut id="profileServicePointcut" expression="execution(* *..ProfileService.*(..))"/>
<aop:advisor advice-ref="userTxAdvice" pointcut-ref="profileServicePointcut" order="2"/>
</aop:config>
The "userTxAdvice" is the usual old transaction advice in Spring ("<tx:advice ...>"). The transaction manager is a
normal HibernateTransactionManager.
One of the comments on Mark Fisher's blog indicated that this sort of configuration would be a mess with the second
-level cache in Hibernate unless the id spaces were kept separate. We thought of several ways to do this (assigning
a range to each database, etc) but the DBAs really liked the idea of a high-low table in GLUD, so we decided to
implement that. Hibernate has a high-low primary key generator, but it assumes that the sequence tables are in the
same database that you are inserting into, while ours were going to be in the GLUD database. To implement this
without writing our own high-low key generator required us to write a wrapper class for Hibernate's key generator. The
wrapper class simply grabs a session from the GLUD Hibernate session factory to send the to the key generator. The
GLUD session comes from an ApplicationContextAware Singleton object (gasp!) that holds a reference to the Spring
application context and grabs the GLUD session when necessary (the gludSessionFactory couldn't be dependency-
injected via Spring because Hibernate creates the key generators under the covers at an undisclosed location -
hence the resort to the Singleton (anti)pattern).
public class UserDbIdGenerator implements IdentifierGenerator, Configurable
{
private MultipleHiLoPerTableGenerator generator;
public ProfileIdGenerator()
{
generator = new MultipleHiLoPerTableGenerator();
}
public Serializable generate(SessionImplementor profileSession, Object entity) throws HibernateException
{
SessionFactory gludSessionFactory = getGludSessionFactory();
Session gludSession = gludSessionFactory.openSession();
Transaction txn = gludSession.beginTransaction();
// Pass through to the wrapped id generator
Long key = (Long) generator.generate((SessionImplementor) gludSession, entity);
txn.commit();
gludSession.close();
return key;
}
protected SessionFactory getGludSessionFactory()
{
SessionFactory sessionFactory = SpringContextSingleton.getInstance().getBean("gludSessionFactory");
return sessionFactory;
}
public void configure(Type type, Properties props, Dialect dialect) throws MappingException
{
generator.configure(type, props, dialect);
}
}
In our Hibernate mapping files, now the objects need to use this id generator class:
<class name="Foo" table="foo">
<id name="id" column="id">
<generator class="UserDbIdGenerator">
<param name="primary_key_value">foo</param>
<param name="max_lo">5000</param>
</generator>
</id>
</class>
Problems
So overall, this partitioning scheme works pretty nicely, it's in production and seems to be fairly performant. However,
if you are thinking of implementing horizontal partitioning for your application, I'd like to point out several gotchas that
you need to know about.
问题
这个分区的方案在生产环境中表现还不错。虽然如此,如果你要实现水平分割你的应用程序,我想指出一些你需要知道的事情。
2nd level cache
We have had a fair number of glitches with the Hibernate second-level cache. In short, faking out a Hibernate sesison
factory (which believes with all it's heart and soul that it's connected to a single database) to work with multiple
databases is fraught with peril. For object caching, it's generally ok, as our id space is unique and Foo#1 is only found
in one partition. However, query caching is a nightmare.
Let's say you issue a query "give me all the blog entries since last Sunday". First the query runs against parition 1, the
results are cached in the query cache. Then the interceptor attempts to run the query against partition 2, but since the
query is cached the same result set comes back. Those objects are not in partition 2, but they are in the cache, so in
general you end up getting dupes of everything in the first partition and nothing in later partitions. Think really hard
about any query caching you attempt in this scheme.
In general, you can be operating in a session attached to parition N, but working with object in partition M (because
you found them in the second-level cache). If Hibernate ever decides it wants to go to the db to fill out those objects,
you are hosed, because you are attached to the wrong db.
If you went with a Shards-style solution with one session factory (and hence one second-level cache) per database,
this sort of thing would be completely eliminated.
二级缓存
我们hibernate二级缓存上遇到些小问题。简单的说,因为hibernate session factory从核心上来说就是连接到单数据库上,
现在要工作在多个数据库上充满了危险。对于对象的缓存,二级缓存大体上工作正常。因为我们的id空间是唯一的而且
FOO#1在某一个分区中也是唯一的。但是对于query的cache就是一场噩梦。
来看看一个请求"give me all the blog entries since last Sunday",首先这个请求在分区1运行,结果被缓存在query cache
中。接下来拦截器尝试到分区2中运行此sql请求,但是这个请求已经被query cache了,所以返回的只是第一个分区得到
的内容。在这个方案中实现query cache很难。
简单来说就是你能在一个session中得到多个分区,但是你只能得到分区M的数据,因为你的二级缓存里面已经缓存了。
如果你使用Shards-style解决方案,使用单一session factory,因此每个数据库一个二级缓存,这一系列的问题可以完全
忽略。
JTA
The system of the partitioned user databases and the GLUD database really form a unit: you don't want transactions
committing in one database and not committing in the other. If you want to wrap them in a single transaction, you
might want to use JTA. I'm not convinced JTA will work in this scheme. Imagine this scenario: you open your JTA-
transaction, and then touch the Hibernate session factory and talk to database partition 1. If you make some updates,
Hibernate generally holds onto the SQL until the end of the transaction. Now, in the same JTA transaction you talk to
partition 2. I can see one of two really bad things happening: 1) Hibernate says "hey I already have a connection in this
session" and uses the partition 1 connection, or 2) when the transaction commits, the session has SQL for partition 1
and partition 2 stored up. How does it know which statements to send to which partition?
There are some scattered statements in the Hibernate documentation that lead me think that you can configure
Hibernate to aggressively issue the SQL and release the connection on a statement-by-statement basis. However,
I've not verified this. I once attempted to set up a JTA transaction manager in our app but couldn't get it to work
(Spring's JtaTransactionManager refused to find JBoss's installed transaction manager). I only spent an hour or so
on it, and now I think I know what the problem was (duplicate jta.jar files in the classpath).
Again, in a one-session-factory-per-database style partitioning scheme, JTA should work fine.
JTA
用户分区数据库和GLUD数据库是一个单位:你不想事务提交到一个数据库并且不被提交到其他数据库。如果你希望转换他
们到单一事务,你可能希望使用JTA。我不能确信JTA能在这种方案中工作。想象这样的情景:你带开你的JTA事务,然后
使用Hibernate session factory并且向分区1写入数据。如果你使用更新,Hibernate 保持SQL 直到事务完成。在同一个JTA
事务中你向分区2写入数据。我能看到2个不好的事发生:1)hibernate说:我已经在session中得到了分区1的连接。或者2)
当事务提交的时候,这个session如何知道发送的到哪个分区?
Testing
Testing becomes painful (and this is not specific to our style of partitioning) when partitioning is involved. We've
evolved two types of tests against the partitioned databases: one ROIT (regular old integration tests) that test DAO's
against a single partition. Then we have the partitioned integration tests, which use GLUD and the partitions together.
You have to write these tests at least to test the interceptor, routing datasource, and all the XML config to make sure it
all works properly. However, it takes some creativity to set up the application context for these tests to avoid either
instantiating the entire world or writing duplicate configuration files for everything. Using DbUnit is a bit of a challenge
as well, as you generally need to insert/update data in both GLUD and the partitions (possibly several).
测试
分区的引入让测试变得很痛苦。我们引入了2种类型的测试来对付分区的数据库:一个是ROIT对应单分区。接下来我们有了
一个分区集成测试,使用GLUD并且分区集成。
Shared Objects
Finally, one more pain point that you need to be aware of - objects with shared identity across partitions are a mess.
Let's say our Blog object has a Category, and it's a many-to-many relationship. If you want the Category objects in the
same database as the Blogs, they need to appear in each partitioned database. So Category(id=1, name="java")
appears in two databases. When they are loaded into the second-level cache they will fight to the death and visit pain
and suffering upon all transactions that dare to visit there. You could turn off caching for these things, turn off
optimistic locking (version), put them in another database (GLUD?). Again, if you had separate session factories (with
the concommitant separate second-level caches) this sort of thing wouldn't hurt so bad.
共享对象
最终,更痛苦的一点是我们必须注意的是,跨越多个数据库的共享对象是一个麻烦。我们的blog有一个分类,是一个
many-to-many 关系。如果你希望这个Category 对象和blog对象在同一个数据库中,他们需要在每个分区数据库中存在。
所以Category出现在2个数据库中。接下来他们被加载在二级缓存中,它们会冲突。你可以关闭对这些对象的缓存。关闭
乐观锁,将他们放入另一个数据库GLUD?再次,如果你有分离的session factories 这些问题就不会这么坏了。
Summary
I hope you found this description interesting. If you are going to partition, you might want to consider doing it this way.
However, do be aware of all the problems noted above: the second-level cache (and possibly JTA) will not quite work
correctly. The one-session-factory-per-database configuration will consume more resources (and slow down the
startup of the app) but would most likely solve these issues.
I think if I were to start it again I'd go back to the original way of multiple session factories (but god how I hate watching
them all start up when I bounce my app server). Or I'd check to see what's up with Shards these days, that might even
be better. We may yet change to one of those methods because of the issues noted above. Nevertheless, it's quite an
interesting challenge to implement partitioning with these technologies. Let me know what your experiences are!
概述
我希望你对这些内容感兴趣。如果你将要分区,你可能开始考虑这种实现方式了。必须要注意以上提到的问题:二级缓存
以及JTA将无法很好的工作。一个数据库一个session-factory的配置将消耗更多的资源,减慢启动过程但是能解决以上的
问题。
我想如果我重新开始,我更愿意到原始的多session factories方式,但是我讨厌看到他们在我启动应用服务器时都启动
它们。或者我会看看Shards 这几天怎么样了。然而,通过这些技术实现分区确实是一件有挑战性的任务。期望我能分享
你们的经验。
http://www.jroller.com/kenwdelong/entry/horizontal_database_partitioning_with_spring