Ehcache is a cache library introduced in October 2003 with the key goal of improving performance by reducing the load on underlying resources. Ehcache is not just for general-purpose caching, however, but also for caching Hibernate (second-level cache), data access objects, security credentials, web pages. It can also be used for SOAP and RESTful server caching, application persistence, and distributed caching.
Ehcache 不仅可以用于一般用途的环境,也可用于缓存Hibernate(二级缓存),数据存取对象,安全证书、web页面。也可以用于SOAP和RESTful 服务器缓存,应用程序持久化、分布式缓存。
cache: Wiktionary defines a cache as "a store of things that will be required in future, and can be retrieved rapidly." That is the nub of it. In computer science terms, a cache is a collection of temporary data which either duplicates data located elsewhere or is the result of a computation. Once in the cache, the data can be repeatedly accessed inexpensively.
缓存:缓存指一组临时数据的集合,它是某些数据的副本或某些计算的结果。在缓存中的数据可以被快速轻量访问。
cache-hit: When a data element is requested of the cache and the element exists for the given key, it is referrred to as a cache hit (or simply 'hit').
缓存命中:如果请求数据能够在缓存中根据给定的key获取到,则称为缓存命中(简称hit)。
cache-miss: When a data element is requested of the cache and the element does not exist for the given key, it is referred to as a cache miss (or simply 'miss').
缓存未命中:如果被请求的数据无法在缓存中根据指定key获取到,则成为缓存未命中(简称miss)。
system-of-record: The core premise of caching assumes that there is a source of truth for the data. This is often referred to as a system-of-record (SOR). The cache acts as a local copy of data retrieved from or stored to the system-of-record. This is often a traditional database, although it may be a specialized file system or some other reliable long-term storage. For the purposes of using Ehcache, the SOR is assumed to be a database.
记录系统:缓存实现的核心前提是数据源。其经常被称为system-of-record (SOR)。缓存是SOR的一份本地数据副本,可以从SOR中读取或写入。SOR通常是一个传统型数据库,也可以是指定的文件系统或永久存储设备。为了使用Ehcache,假定SOR是数据库。
SOR: See system-of-record.
While Ehcache concerns itself with Java objects, caching is used throughout computing, from CPU caches to the DNS system. Why? Because many computer systems exhibit "locality of reference". Data that is near other data or has just been used is more likely to be used again.
因为很多计算机系统证实了“引用局部性”。数据具有团簇性,被用过的数据很可能会被再次使用。(译者注:包括时间局部性、空间局部性、顺序局部性,参考http://en.wikipedia.org/wiki/Memory_locality)
Chris Anderson, of Wired Magazine, coined the term "The Long Tail" to refer to Ecommerce systems. The idea that a small number of items may make up the bulk of sales, a small number of blogs might get the most hits and so on. While there is a small list of popular items, there is a long tail of less popular ones.
《连线》杂志主编Chris Anderson在电子商务系统领域提出了长尾理论。其主要理念在于少数项目可以创造大量的销售额,少数博客可以获得大多数的点击量等。如果有一个受欢迎项的短清单,那么将会有一个不那么受欢迎项的长尾巴
The Long Tail is itself a vernacular term for a Power Law probability distribution. They don't just appear in ecommerce, but throughout nature. One form of a Power Law distribution is the Pareto distribution, commonly know as the 80:20 rule. This phenomenon is useful for caching. If 20% of objects are used 80% of the time and a way can be found to reduce the cost of obtaining that 20%, then the system performance will improve.
长尾理论是幂律概率分布的口语化表示。不仅在电子商务领域能看到,在自然界也能发现。幂律分布的一种形式是帕累托分布,也就是常说的28定律。这种现象在缓存领域非常有用。如果20%的的对象在80%的情况下经常被用到,那么找到一种降低维护那20%的成本办法,就可以改善整个系统的性能。
The short answer is that it often does, due to the effects noted above.一句话答案:如上所述,常有的事儿。
The medium answer is that it often depends on whether it is CPU bound or I/O bound. If an application is I/O bound then then the time taken to complete a computation depends principally on the rate at which data can be obtained. If it is CPU bound, then the time taken principally depends on the speed of the CPU and main memory.
三句话答案:经常取决于该程序是CPU限制型的还是I/O限制型的。如果是I/O型的那么完成计算所需的时间主要取决于获取到所需数据的速率。如果是CPU型的则取决于CPU速度和主存速度。
While the focus for caching is on improving performance, it it also worth realizing that it reduces load. The time it takes something to complete is usually related to the expense of it. So, caching often reduces load on scarce resources.
除了改进性能外,缓存还能降低负载,尤其是稀缺资源的负载。
CPU bound applications are often sped up by: CPU限制型应用程序经常通过如下手段来提高性能:
The role of caching, if there is one, is to temporarily store computations that may be reused again. An example from Ehcache would be large web pages that have a high rendering cost. Another caching of authentication status, where authentication requires cryptographic transforms.
缓存在这种情况下扮演的角色主要是作为计算结果的临时存储,以便于再次使用。Ehcache可用于缓存在展现时需要付出较高代价的大网页。另外可用于缓存需要密码转换的授权状况。
Many applications are I/O bound, either by disk or network operations. In the case of databases they can be limited by both.
许多应用程序是I/O限制型的,瓶颈会出现在磁盘或者网络操作时。如果使用数据库,则会同时受两者的限制。
There is no Moore's law for hard disks. A 10,000 RPM disk was fast 10 years ago and is still fast. Hard disks are speeding up by using their own caching of blocks into memory.
摩尔定律不使用与硬盘。一个10000 RPM(每分钟转动次数)的磁盘10年前算快的,现在也还算快。硬盘的提速靠使用缓存。
Network operations can be bound by a number of factors:
网络操作受限于以下因素:
The caching of data can often help a lot with I/O bound applications. Some examples of Ehcache uses are:
缓存数据经常对I/O限制型的程序改进很有用。Ehcache的使用场景包括:
The flip side of increased performance is increased scalability. Say you have a database which can do 100 expensive queries per second. After that it backs up and if connections are added to it it slowly dies.
提高性能的另外一方面是提高伸缩性。假设你有一台每秒可进行100次高代价查询的数据库,如果不断增加连接数则会导致其性能主键下降。
In this case, caching may be able to reduce the workload required. If caching can cause 90 of that 100 to be cache hits and not even get to the database, then the database can scale 10 times higher than otherwise.
在这种情况下,缓存可能可以降低负载。如果缓存能够命中90%的查询,而不用去数据库查询,那么这台数据库将提高10倍负载能力。
The short answer is that it depends on a multitude of factors being: 简明的答案是:这取决于多种因素:
In applications that are I/O bound, which is most business applications, most of the response time is getting data from a database. Therefore the speed up mostly depends on how much reuse a piece of data gets.
In a system where each piece of data is used just once, it is zero. In a system where data is reused a lot, the speed up is large.
The long answer, unfortunately, is complicated and mathematical. It is considered next.
Amdahl's law, after Gene Amdahl, is used to find the system speed up from a speed up in part of the system.
1 / ((1 - Proportion Sped Up) + Proportion Sped Up / Speed up)
The following examples show how to apply Amdahl's law to common situations. In the interests of simplicity, we assume:
A Hibernate Session.load() for a single object is about 1000 times faster from cache than from a database.
A typical Hibernate query will return a list of IDs from the database, and then attempt to load each. If Session.iterate() is used Hibernate goes back to the database to load each object.
Imagine a scenario where we execute a query against the database which returns a hundred IDs and then load each one. The query takes 20% of the time and the roundtrip loading takes the rest (80%). The database query itself is 75% of the time that the operation takes. The proportion being sped up is thus 60% (75% * 80%).
The expected system speedup is thus:
1 / ((1 - .6) + .6 / 1000) = 1 / (.4 + .0006) = 2.5 times system speedup
An observed speed up from caching a web page is 1000 times. Ehcache can retrieve a page from its SimplePageCachingFilter in a few ms.
Because the web page is the end result of a computation, it has a proportion of 100%.
The expected system speedup is thus:
1 / ((1 - 1) + 1 / 1000) = 1 / (0 + .0001) = 1000 times system speedup
Caching the entire page is a big win. Sometimes the liveness requirements vary in different parts of the page. Here the SimplePageFragmentCachingFilter can be used.
Let's say we have a 1000 fold improvement on a page fragment that taking 40% of the page render time.
The expected system speedup is thus:
1 / ((1 - .4) + .4 / 1000) = 1 / (.6 + .0004) = 1.6 times system speedup
In real life cache entrie do not live forever. Some examples that come close are "static" web pages or fragments of same, like page footers, and in the database realm, reference data, such as the currencies in the world.
Factors which affect the efficiency of a cache are:
Ehcache keeps these statistics for each Cache and each element, so they can be measured directly rather than estimated.
Also in real life, we generally do not find a single server? Assume a round robin load balancer where each hit goes to the next server. The cache has one entry which has a variable lifespan of requests, say caused by a time to live. The following table shows how that lifespan can affect hits and misses.
Server 1 Server 2 Server 3 Server 4 M M M M H H H H H H H H H H H H H H H H ... ... ... ...
The cache hit ratios for the system as a whole are as follows:
Entry Lifespan Hit Ratio Hit Ratio Hit Ratio Hit Ratio in Hits 1 Server 2 Servers 3 Servers 4 Servers 2 1/2 0/2 0/2 0/2 4 3/4 2/4 1/4 0/4 10 9/10 8/10 7/10 6/10 20 19/20 18/20 17/20 16/10 50 49/50 48/50 47/20 46/50
The efficiency of a cluster of standalone caches is generally:
(Lifespan in requests - Number of Standalone Caches) / Lifespan in requests
Where the lifespan is large relative to the number of standalone caches, cache efficiency is not much affected. However when the lifespan is short, cache efficiency is dramatically affected. (To solve this problem, Ehcache supports distributed caching, where an entry put in a local cache is also propagated to other servers in the cluster.)
From the above we now have:
1 / ((1 - Proportion Sped Up * effective cache efficiency) + (Proportion Sped Up * effective cache efficiency)/ Speed up) effective cache efficiency = (cache efficiency) * (cluster efficiency)
Applying this to the earlier web page cache example where we have cache efficiency of 35% and average request lifespan of 10 requests and two servers:
cache efficiency = .35 cluster efficiency = .(10 - 1) / 10 = .9 effective cache efficiency = .35 * .9 = .315 1 / ((1 - 1 * .315) + 1 * .315 / 1000) = 1 / (.685 + .000315) = 1.45 times system speedup
What if, instead the cache efficiency is 70%; a doubling of efficiency. We keep to two servers.
cache efficiency = .70 cluster efficiency = .(10 - 1) / 10 = .9 effective cache efficiency = .70 * .9 = .63 1 / ((1 - 1 * .63) + 1 * .63 / 1000) = 1 / (.37 + .00063) = 2.69 times system speedup
What if, instead the cache efficiency is 90%. We keep to two servers.
cache efficiency = .90 cluster efficiency = .(10 - 1) / 10 = .9 effective cache efficiency = .9 * .9 = .81 1 / ((1 - 1 * .81) + 1 * .81 / 1000) = 1 / (.19 + .00081) = 5.24 times system speedup
Why is the reduction so dramatic? Because Amdahl's law is most sensitive to the proportion of the system that is sped up.