GC Tuning -GC优化

Tuning Garbage Collection is no different from any other performance-tuning activities. Instead of directly jumping in to tweak random parts of the application, you need to make sure you understand the current situation and the desired outcome. In general it is as easy as following the following process:

  1. State your performance goals
  2. Run tests
  3. Measure
  4. Compare to goals
  5. Make a change and go back to running tests

So, as a first step we need to set clear performance goals in regards of Garbage Collection. The goals fall into three categories, common to all the performance monitoring and management topics:

  • Latency
  • Throughput
  • Capacity

After explaining the concepts in general, we will apply them in the context of Garbage Collection, so if you are already familiar with the terms, you may decide to skip the part.

Throughput vs Latency vs Capacity

Let us start with an example to illustrate the concepts using an example from manufacturing. For this we monitor an assembly line in action. The line is assembling bikes from ready-made components. Building bikes on this line is done on the line in sequential manner. Walking along the lines, we measure that it takes four hours to complete a bike from the start where the frame enters the line to the finish when the packaging is concluded.

Measuring the very same assembly line in action we can also see that one completed bike is ready after each minute, 24 hours a day, every day. Simplifying the example and ignoring maintenance windows etc. we can forecast that in any given day such an assembly line is going to assemble 1,440 bikes.

Those two measurements give us crucial information about the current performance of the system measured in latency and throughput:

  • Latency of the system: 4 hours
  • Throughput of the system: 1,440 bikes/day

Note that latency is measured in time units suitable for the task so whether you use millenniums, hours or milliseconds is up to you. Throughput of a system is measured in completed operations per time unit. Operations can be anything relevant to the specific system. In this example the operations were the assembled bikes.

It has been a steady business for a while and the supply and demand for bikes has been in balance for a while. Let’s now imagine a situation where the sales team has been successful and the demand for the bikes suddenly doubles. The call from the sales team to the factory immediately raises concerns as the performance of the factory is no longer in line with the demand.

It is important to notice that the four hours it takes to complete a single bike is nothing to worry about but the total number of bikes produced per day is going to be an issue. So, being in the shoes of the business owner, we would immediately take steps to build and set additional capacity.

After finishing the set-up we now have another assembly line ready to be used. The factory now has two equivalent assembly lines both shipping the very same bike every minute every day. By doing this, our imaginary factory has doubled the number of bikes produced per day. Instead of the 1,440 bikes the factory is now capable of shipping 2,880 bikes each day. It is important to note that we have not reduced the time to complete an individual bike by even a millisecond – it still takes 4 hours to complete a bike from the start to finish.

In the example above a performance optimization task was carried out, coincidentally impacting both throughput and capacity. As in any good example we started by measuring the system’s current performance then set a new target and optimized the system only in aspects required to meet the target.

In this example an important decision was made – the focus was on increasing throughput, not in reducing latency. While increasing the throughput, we also needed to increase the capacity of the system – instead of a single assembly line we now needed two assembly lines to produce the required quantity. So in his case the added throughput was not free, we needed to scale out the solution in order to hit the increased throughput need we were facing.

GC Tuning in practice

Now that we have covered the three dimensions of performance tuning, we can start investigating how setting and hitting such goals looks like in practice.

For this purpose, lets take a look at an example code:

//imports skipped for brevity
public class Producer implements Runnable {

  private static ScheduledExecutorService executorService = Executors.newScheduledThreadPool(2);

  private Deque deque;
  private int objectSize;
  private int queueSize;

  public Producer(int objectSize, int ttl) {
    this.deque = new ArrayDeque();
    this.objectSize = objectSize;
    this.queueSize = ttl * 1000;
  }

  @Override
  public void run() {
    for (int i = 0; i < 100; i++) {
      deque.add(new byte[objectSize]);
      if (deque.size() > queueSize) {
        deque.poll();
      }
    }
  }

  public static void main(String[] args) throws InterruptedException {
    executorService.scheduleAtFixedRate(new Producer(200 * 1024 * 1024 / 1000, 5), 0, 100, TimeUnit.MILLISECONDS);
    executorService.scheduleAtFixedRate(new Producer(50 * 1024 * 1024 / 1000, 120), 0, 100, TimeUnit.MILLISECONDS);
    TimeUnit.MINUTES.sleep(10);
    executorService.shutdownNow();
  }
}

The code is submitting two jobs to run every 100 ms. Each job emulates objects with the specific lifespan: it creates objects, lets them leave for a predetermined amount of time and then forgets about them, allowing GC to reclaim the memory.

When running the example with GC logging turned on with the following parameters

-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

we start seeing the impact of GC immediately in the log files, similar to the following:

2015-06-04T13:34:16.119-0200: 1.723: [GC (Allocation Failure) [PSYoungGen: 114016K->73191K(234496K)] 421540K->421269K(745984K), 0.0858176 secs] [Times: user=0.04 sys=0.06, real=0.09 secs] 
2015-06-04T13:34:16.738-0200: 2.342: [GC (Allocation Failure) [PSYoungGen: 234462K->93677K(254976K)] 582540K->593275K(766464K), 0.2357086 secs] [Times: user=0.11 sys=0.14, real=0.24 secs] 
2015-06-04T13:34:16.974-0200: 2.578: [Full GC (Ergonomics) [PSYoungGen: 93677K->70109K(254976K)] [ParOldGen: 499597K->511230K(761856K)] 593275K->581339K(1016832K), [Metaspace: 2936K->2936K(1056768K)], 0.0713174 secs] [Times: user=0.21 sys=0.02, real=0.07 secs] 

Based on the information in the log we can start improving the situation with three different goals in mind

  1. Making sure the worst-case GC pause does not exceed a predetermined threshold
  2. Making sure the total time during which application threads are stopped does not exceed a predetermined threshold
  3. Reducing infrastructure costs while making sure we can still achieve reasonable latency and/or throughput targets.

For this, the code above was run for 10 minutes on three different configurations resulting in three very different results summarized in the following table:

Heap GC Algorithm Useful work Longest pause
-Xmx12g -XX:+UseConcMarkSweepGC 89.8% 560 ms
-Xmx12g -XX:+UseParallelGC 91.5% 1,104 ms
-Xmx8g -XX:+UseConcMarkSweepGC 66.3% 1,610 ms

The experiment ran the same code with different GC algorithms and different heap size to measure the duration of garbage collection pauses with regards to latency and throughput. Details of the experiments and interpretation of results are given in the following chapters.

Note that in order to keep the example as simple as possible only a limited amount of input parameters were changed, for example the experiments do not test on different number of cores or with a different heap layout.

Tuning for Latency

Latency goals are typically expressed in a form similar of any of the following

  • None of the user transactions can respond in less than 10 seconds
  • 90% of the invoice payments must be carried out in under 3 seconds
  • Recommended products must be rendered to purchase screen in less than 100ms.

When facing performance goals expressed similar to any of the above we would need to make sure the duration of an individual GC pause does not contribute too much to exceeding the goal.

To achieve this, we should focus to the relevant part in GC logs, namely to the duration of GC pauses. This information is present in the log snippets in various locations so let’s look into which parts of date/time data are actually relevant:

2015-06-04T13:34:16.974-0200: 2.578: [Full GC (Ergonomics) [PSYoungGen: 93677K->70109K(254976K)] [ParOldGen: 499597K->511230K(761856K)] 593275K->581339K(1016832K), [Metaspace: 2936K->2936K(1056768K)], 0.0713174 secs] [Times: user=0.21 sys=0.02, real=0.07 secs

From the above we can see that the GC event was triggered at 13:34:16 on June 4, 2015, just 2,578 ms after the JVM was started.

The event stopped the application threads for 0.0713174 seconds. In greater detail: it took 210 ms of CPU times on multiple cores, consumed system resources for 20 ms and stopped the application threads for a total of 70 ms (71.3174 to be precise).

Having this information now in our hands we can compare this to our requirement at hand, which for example can derived from “All jobs must be processed in under 1,000 ms”. Knowing that the actual job processing takes just 100 ms we are able to detail this requirement to be used in the context of Garbage Collection pauses. Namely “No GC pause can stop the application threads for longer than 900 ms”. Answering this question is easy, one would just need to parse the log files and find the maximum pause time for an individual GC pause.

Looking again at the three configuration options used in the test:

Heap GC Algorithm Useful work Longest pause
-Xmx12g -XX:+UseConcMarkSweepGC 89.8% 560 ms
-Xmx12g -XX:+UseParallelGC 91.5% 1,104 ms
-Xmx8g -XX:+UseConcMarkSweepGC 66.3% 1,610 ms

we can see that there is one configuration already matching this requirement – running the code with:

java -Xmx12g -XX:+UseConcMarkSweepGC Producer

results in a maximum GC pause of 560 ms which passes nicely the 900 ms threshold set for satisfying the latency requirement. If neither the throughput and capacity requirements are violated we can conclude that we have fulfilled our GC Tuning task and can finish the tuning exercise.

Tuning for Throughput

Throughput requirements are different from latency. Looking again at couple of examples we can immediately see why:

  • The solution must be able to process 1,000,000 invoices/day
  • The solution must support 1,000 authenticated users each invoking one of the functions A, B or C every five to ten seconds
  • Weekly statistics for all customers have to be composed in no more than six hours on the each Sunday night between 12 PM – 06 AM

Instead of setting requirements for a single operation these requirements specify how many operations must the system process in a given time unit.

Again, the information for verifying whether or not GC is posing a threat to fulfilling the goal we would need to take a look at the GC logs:

2015-06-04T13:34:16.974-0200: 2.578: [Full GC (Ergonomics) [PSYoungGen: 93677K->70109K(254976K)] [ParOldGen: 499597K->511230K(761856K)] 593275K->581339K(1016832K), [Metaspace: 2936K->2936K(1056768K)], 0.0713174 secs] [Times: user=0.21 sys=0.02, real=0.07 secs

This time we are interested in User and System times instead of the real time so in current case we should focus on 21 + 2 ms during which the particular GC pause kept CPUs busy. After parsing the logs and aggregating the numbers – taking the number of cores into account – we can again start answering the question.

In this particular case: if processing a single job takes 100ms of CPU time and our goal is to process 13,000,000 jobs/hour on a four-core machine, the example configurations again give us a configuration where the requirement is fulfilled:

Heap GC Algorithm Useful work Longest pause
-Xmx12g -XX:+UseConcMarkSweepGC 89.8% 560 ms
-Xmx12g -XX:+UseParallelGC 91.5% 1,104 ms
-Xmx8g -XX:+UseConcMarkSweepGC 66.3% 1,610 ms

Running this configuration as:

java -Xmx12g -XX:+UseParallelGC Producer

we can see that the CPUs are blocked by GC for 8.5% of the time, leaving 91.5% of the computing power for useful work. Now taking into account that

  1. One job is processed in 100ms by a single core
  2. In one minute thus 60,000 jobs could be processed by one core
  3. In one hour, a single core could thus process 3.6 M jobs
  4. Four cores could thus process 14.4 M jobs in an hour

With this amount of theoretical processing power we can thus make a simple calculation and conclude that during one hour we can in reality process 91.5% of the 14.4 M of the theoretical maximum resulting in 13,176,000 processed jobs/hour, fulfilling our requirement.

It is important to note that if we simultaneously needed to fulfill the latency requirements set in previous section, we would be in trouble as the worst-case latency for this case is close to two times the previous configuration since this time the longest GC pause on record was blocking the application threads for 1,104 ms.

Tuning for Capacity

Capacity goals add icing to the cake, setting additional constraints to the environment where the throughput and latency goals can be met. These requirements might be expressed either in terms of computing resources or in cold hard cash:

  • The system must be deployed on Android devices with less than 512MB of memory.
  • The system must be deployed on Amazon EC2 instances. Maximum required instance size must not exceed c3.xlarge (8G, 4 cores) configuration.
  • The monthly invoice from Amazon EC2 for running the system must not exceed $12,000

Capacity is thus a dimension to be taken into account when fulfilling the latency and throughput requirements. with unlimited computing power available, all kind of latency and throughput targets could be met, yet in the real world the budgets and other reasons have a tendency to set limit on the resources one can use.

If in our case we would need to deploy the solution in Amazon EC2 and the c3.xlarge was the largest instance allowed, we would need to turn our eyes to the third configuration the test was ran on:

Heap GC Algorithm Useful work Longest pause
-Xmx12g -XX:+UseConcMarkSweepGC 89.8% 560 ms
-Xmx12g -XX:+UseParallelGC 91.5% 1,104 ms
-Xmx8g -XX:+UseConcMarkSweepGC 66.3% 1,610 ms

The application is able to run on this configuration as

java -Xmx8g -XX:+UseConcMarkSweepGC Producer

but both the latency and especially throughput numbers fall drastically:

  • Instead of the 92.6% of the time CPUs are available for useful work, this configuration only leaves 66.3% of the CPUs for useful work. This means that the throughput on this configuration would drop from the best-case-scenario of  13,176,000 jobs/hour to a meager 9,547,200 jobs/hour
  • Instead of the 560 ms we are now facing 1,610 ms of added latency in the worst case.

Walking through the three dimensions it is indeed obvious that you cannot just optimize for “performance” but instead you need to think in three different dimensions, measuring and tuning both latency and throughput, and taking the capacity constraints into account.

你可能感兴趣的:(jvm)