Performance Tool Comparison: How LoadRunner,OpenSTA and JMeter stack up at runtime - 2

<!-- begin content -->
jMeter | Mercury LoadRunner | OpenSTA | performance testing
[textile]

In _*The Republic*_, _Plato_ conjectured on the idea of the dual level reality. One of these is known as the divided line:

!http://oregonstate.edu/instruct/phl201/images/philosophers/plato/divided_line.gif!

Above the line are the attributes of objective reality; below the line are the attributes of relative reality. This is not very different from user experienced times and response times measured from engineered tests. The problem, then, is to know whether a tool, even a favourite one, tells an objective truth or a relative truth! :)

So here's the real *_masala_* in our performance testing tool shootout. How did the numbers stack up? We ran tests for *1* virtual user (vUser) and multiple iterations, then *8* virtual users and multiple iterations. We stopped at *8* because our server couldn't handle more requests without timing out - apparently *9* is a magic figure. Still, these figures are "all things being equal" i.e. same application, same hardware.


**1 vUser/20 Iterations**

|_. Transaction|_. LR|_. OpenSTA|_. JMeter|
(dark). |MITOURS_01_HOME|1.152|1.16|0.466|
(dark). |MITOURS_02_LOGIN|1.239|0.7985|0.447|
(dark). |MITOURS_03_CLICK_FLIGHTS|1.598|0.6375|0.669|
(dark). |MITOURS_04_SEARCH_FLIGHT|0.4|0.172|0.223|
(dark). |MITOURS_05_SELECT_FLIGHT|0.222|0.2015|0.22|
(dark). |MITOURS_06_PURCHASE_FLIGHT|0.221|0.21|0.222|
(dark). |MITOURS_07_LOGOUT|0.982|0.4|0.449|

**8 VUser, 10 Minutes**

|*Transaction*|*LR*|*OpenSTA*|*JMeter*|
(dark). |MITOURS_01_HOME|1.343|1.247213115|0.561|
(dark). |MITOURS_02_LOGIN|1.384|0.835081967|0.537|
(dark). |MITOURS_03_CLICK_FLIGHTS|1.778|0.720901639|0.736|
(dark). |MITOURS_04_SEARCH_FLIGHT|0.488|0.071065574|0.266|
(dark). |MITOURS_05_SELECT_FLIGHT|0.354|0.258688525|0.235|
(dark). |MITOURS_06_PURCHASE_FLIGHT|0.349|0.305|0.265|
(dark). |MITOURS_07_LOGOUT|1.114|0.444672131|0.48|

**8 VUsers, 20 Iterations**

|*Transaction*|*LR*|*OpenSTA*|*JMeter*|
(dark). |MITOURS_01_HOME|1.323|1.266125|0.5|
(dark). |MITOURS_02_LOGIN|1.427|0.827|0.484|
(dark). |MITOURS_03_CLICK_FLIGHTS|1.761|0.7168125|0.743|
(dark). |MITOURS_04_SEARCH_FLIGHT|0.48|0.0801875|0.248|
(dark). |MITOURS_05_SELECT_FLIGHT|0.312|0.2435625|0.253|
(dark). |MITOURS_06_PURCHASE_FLIGHT|0.364|0.3183125|0.255|
(dark). |MITOURS_07_LOGOUT|1.16|0.4451875|0.481|

Just to find out how the figures stack up, I got MS Excel to spit out a correlation matrix of all the three tool response times for the **8 vuser, 20 iterations** test. Here's what came out.

||*LR*|*OpenSTA*|*Jmeter*|
(dark). |*LR*|1|-|-|
(dark). |*OpenSTA*|0.746747501|1|-|
(dark). |*Jmeter*|0.961989279|0.650244097|1|

*Simple observation, mine*:

Looks like JMeter and LoadRunner tests are very closely interrelated, but OpenSTA figures are off the mark by about 30%. Could it be because of the type of replay engine used by LoadRunner/JMeter? The reference here is to the native socket engine, of course. Might OpenSTA be using a WinInet engine - a look into OpenSTA source should answer that question.

Hardware Used also determines the results

Hi, found this forum useful for me.

Suresh, Can you please describe about the hardware configuration used. Because i have tried similar things in latest Jmeter build with fairly simple scripts. But Response Times were very high for Jmeter compared to WebLoad[ We used 70-80 users].

I think heap size[on Jmeter] was shooting up very quickly(this may not be the only culprit]. Is the same hardware configuration[ as used by LR / WL] 'sufficient' for Jmeter or Jmeter needs lavish treatment in terms of hardware if scripts are to be complex. Remember here i am talking about real-life performance testing( not the one devs do probably just to get over it).
In our case Jmeter machine did hung-up.

The only problem is that in

The only problem is that in performance testing, vendors push tools into buyers with the promise that tool X closely emulates real-life conditions; that results coming out of it will generate confidence before going live.
I don't think anyone who really understands performance testing really buys the claims that vendors make. Your experiment is a perfect example of that. If you believed what they told you, you never would have bothered with the comparison. In addition, performance test tools are still software, and any tester knows there are bugs lurking in there ready to strike at any moment. I think in the end, regardless of the vendor, a tool is just a tool and the data it produces will have to stand on it's own merits (much like you are forcing it to do).
From the results, one observes between 50% and 75% variance from a single-user experience. Ergo, isn't the verity of the aforementioned premise at stake? Despite the recording technology being similar, despite the replay mechanism being similar, we see such disparity. I suppose the real question we should be asking is exactly how much disparity is an acceptable amount?
Again I don’t know that this matters. Let's assume that all three tools returned the exact same values. What does this tell you? Does it say that all three vendors are measuring accurately or does it tell you that all three of them are measuring inaccurately? How do you really know the time shown is the user experienced time? I think we still make the assumption that the tools are telling us valuable numbers, and I don't know that that is the case. Experiments like yours highlights the fact that the tools don’t match each other, but how do you check to see if they match reality?
We call performance testing an engineering science because we expect that under controlled conditions, we should be able to repeat experiments with analogous results. ... If we strike down the basic premise of repeatability, maybe we should be applying chaos theory and non-linear dynamics instead of engineered tests.
I think this is why we call testing a brain-engaged activity. While we try to add science and controlled experiments to what we do, we still need to question the validity of those experiments and the tools we use (which is what you are doing). All of the good performance testers I know question the data their tools give them. They open a browser while a test is running and they see what the user experience is for them while the application is under load. They use multiple tools and correlate that data. Or they don't rely so much on the number as on the trends in the numbers (as I indicated before).

Many thanks!

Thanks for your appreciation, Mike. My team is exceptional, which is what makes the work both doable and exciting!

Not even twins are alike, I think we all understand and accept there will be difference. The only problem is that in performance testing, vendors push tools into buyers with the promise that tool X closely emulates real-life conditions; that results coming out of it will generate confidence before going live.

From the results, one observes between 50% and 75% variance from a single-user experience. Ergo, isn't the verity of the aforementioned premise at stake? Despite the recording technology being similar, despite the replay mechanism being similar, we see such disparity. I suppose the real question we should be asking is exactly how much disparity is an acceptable amount?

We call performance testing an engineering science because we expect that under controlled conditions, we should be able to repeat experiments with analogous results. This experiment, by the way, was in an isolated lab in a separate network segment and insulated from the corporate network. If we strike down the basic premise of repeatability, maybe we should be applying chaos theory and non-linear dynamics instead of engineered tests.

Excellent!

I really like the comparison, and I appreciate the effort you must have put into getting it all organized and documented. Excellent experiment!

I may be talking above my performance test pay-grade here, but I wanted to comment on your second point: "The implication then, is even more dire, since none of tools will agree with each other nor with usage timings! Who then, do we trust?"

Isn't this always a problem even without doing a side by side comparison? Aren't we always implicitly trusting the tools we use? This is the reason we establish baselines (much like you did in your experiment) and then as we increase the number of users we notice the differences in performance along with the actual performance numbers.

If I work on the assumption that my tool is just generating numbers (any number - does not have to be accurate) using a consistent algorithm, then I can get a feel for performance based on the differences in those numbers. The actual validity of the numbers "might be" irrelevant as long as they are consistently generated.

Page X loads in Y seconds with one user.
Page X loads in Y+n time with twenty users.

Regardless of the value of Y, I should be able to determine the impact of the twenty users on performance.

Does that make sense to anyone other then me?

No conclusions, just observations!

[textile]

Scott, appreciate your taking the time to review my results.

To answer some of your concerns:

1. Manual timings for these as well, taken in IE and Opera. And like you note, they are different. Additionally, I've got timiings in VUGen for Socket and WinInet replay engines.

|*Transaction*|*Socket*|*WinInet*|*IE*|*Opera*|
(dark). |MITOURS_01_HOME|1.25744|1.25136|1.312|1.504|
(dark). |MITOURS_02_LOGIN|1.28014|0.98952|1.546|1.666|
(dark). |MITOURS_03_CLICK_FLIGHTS|1.589075|1.39503|1.344|1.612|
(dark). |MITOURS_04_SEARCH_FLIGHT|0.451935|0.4246|0.954|1.498|
(dark). |MITOURS_05_SELECT_FLIGHT|0.28021|0.26706|0.966|0.92|
(dark). |MITOURS_06_PURCHASE_FLIGHT|0.28361|0.284765|1.05|1.344|
(dark). |MITOURS_07_LOGOUT|1.005285|1.01065|1.254|1.718|

2.The start and end points of measurement in all three tools use a timer mechanism which maps ultimately to the system timer and the usage of ticks.

However, if JMeter brings back TTFB, then yes, we are comparing apples to oranges. The implication then, is even more dire, since none of tools will agree with each other nor with usage timings! Who then, do we trust?

3. LR recording does insert lr_think_time() as does OpenSTA. The scripts had been cleansed of these willfully.

4. We tweaked the scripts in all three tools to bring back exactly the *same* elements. Internally whether the app makes JDBC/ODBC connections is irrelevant to us from a tool standpoint, the only protocol we use is HTTP/S.

My objective here is to showcase results, not conclusions. It remains to be seen how these results are to interpreted.

-Suresh

Not sure I agree with your conclusions...

It looks to me like excluding MITOURS_04_SEARCH_FLIGHT, that OpenSTA looks more reasonable than the the other two. It definately looks like Search_Flight is not being handled properly by OpenSTA, but is that the fault of the program or the script?

Here are some of my thoughts/concerns (obviously without having been there or having seen the scripts)

1) What times did you measure for these same activities manually. Understanding that manual measurements are +/- .2 seconds, but some of these numbers are a second different.
2) Has it been proven that they are actually measuring the same thing? Same start point and end point for measurements? For instance, I BELIEVE that LR's default measures from the start of the first request to the end of delivery of the last requested object. I believe that JMeter defaults from the start of the first request until the START of the delivery of the last requested object.
3) LR & OpenSTA also insert wait times between requests... I know that the inserted waits are different between the two. I thought that JMeter didn't. Have these waits been normalized in some way?
4) It looks to me like Search_Flight has a component that isn't handled by OpenSTA. Remember OpenSTA only supports HTTP/HTTPS while the others support pure socket. If Search_Flight makes a separate JDBC connection for the search, this wouldn't be captured and could explain the difference.

All of these items may have been considered and appropriately handled, I simply don't know. I'm just saying that before I drew too many conclusions I'd want the answers to these questions.

--
Scott Barber
PerfTestPlus
Software Performance Specialist,
Consultant, Author, Educator
[email protected]
www.perftestplus.com

你可能感兴趣的:(performance)