How do you know if a load testing tool is accurate?

There are several different aspects to a question like this. The aspect that I’m going to address is: “How do I know that the load generator is accurate when it’s running a lot of users?” Or put another way, “How do I know that the load generator itself isn’t overloaded and is reporting inaccurate results?”

I could get all theoretical on you, citing a variety of technologies and design approaches used in our software to ensure it is accurate. I’d show some nice, official-looking white papers and report all the successes that our other customers have experienced using our software. But a whole lot of words doesn’t add up to real proof. Instead, I’m going to describe how we test Load Tester, both to assure ourselves that it is accurate and to help plan the resource requirements for tests we run for our customers. The process is not particularly difficult, though it can be a little time-consuming. You can follow this same methodology to test our software and demonstrate to yourself that it is accurate. Then you’ll know it is not only accurate in our test lab, but in yours as well.

How accurate is accurate?

It is worth mentioning that being accurate is not an easy thing to define, particularly in the real of web performance. A single page can exhibit a variety of load times depending on the state of the server(s) and the backing database, the cache states of both the servers and the browser (and any proxies in the middle), the network conditions, the parameters included with the page request and a host of other variables. This means that every measurement of the page might be different – making it challenging to decide what is accurate.

In the past, I’ve blogged about how accurate your load test scenarios need to be. Both that post and this one may seem odd coming from a vendor of load testing software. Indeed, most vendors would consider it heresy to say anything other than “our tool is 100% accurate”. But we live in the real world, where there are varying degrees of accuracy. 100% accuracy is virtually impossible to achieve. And if you could achieve it, the cost would be tremendous. As performance engineers and testers, we must weigh the project requirements and risk factors to determine how accurate we need to be. The more we spend, the more accurate we can get (and spending on software will only get you so far — after a point, more accuracy requires more labor). The methods here can be used to determine how accurate the load test results are for a given configuration of test resources (specifically, the number of load generators).

Step 1: Establish a baseline

In order to determine if load test results are accurate, we must have a baseline to compare it to. For load testing, we define the baseline as the performance of the system under a load of 1 user. It would be easy to assume that a tool is accurate for a single user (particularly one that uses real browsers), but the rest of this proof builds on step 1, so don’t skimp on your effort here.

There are a couple of ways to proceed, depending on the type of tool you are using. If you are using Load Tester’s virtual browsers (or other HTTP-centric traditional web load testing tools), then you might use another page performance analysis tool, such Firebug’s network tab, to look at the response times for the transactions that make up the page.

If you are using Load Tester’s real browsers, then you can use a stopwatch and measure the time it takes a page to load while replaying the testcase. You should be looking for the browser to report the page complete (not when it looks complete), which it typically does by animating an icon while the page loads. Since the precision of the eyeball-to-stopwatch interface is pretty low, you may need to repeat this several times to get a number you can confidently compare to the tool’s result.

Step 2: Prove that a single virtual user is measured accurately

With the baseline established, the next step is to run a single-user load test using a dedicated load engine. While the local load engine built into the Load Tester UI will generally be just as accurate, there’s no reason to take a chance here – you’re going to need some dedicated load engines for the next step, anyway. Compare the durations reported by the load test results to your manual measurements. If the delta between the baseline and the single-VU measurements fall within the range of the eyeball-to-finger interface precision, then you can proceed on to step 2, satisfied that a single user is simulated and measured accurately.

In this example, the deltas are pretty low. I don’t expect human precision to be closer than a few tenths of a second, so these numbers are actually better than I’d expect. I was probably anticipating the page load completions, which is one of the many reasons that human measurements are not reliable. However, in this case it is good enough for me to feel good about the accuracy of the VU.

If the delta does not satisfy you, then you’ll need to track down the source of the discrepancy or choose another tool. If you are not sure where to begin with this step while evaluating Load Tester, please feel free contact us for help.

Step 3: Now that I know 1 user is accurate, what about 10 users?

The step is the real guts of how we determine the tool is accurate at scale. So far, you know that the load engine can simulate 1 user accurately — you’ll use this as the reference point. So the question is: Given what you know, how can you prove that an engine can run 5 or 10 users accurately? Answer: You’ll run 2 tests that will exercise the web application equally, but will exercise the load engines differently – giving you the data you need to determine if the engine performance differs based on the number of users it is running.

The first test will use 10 load engines, each running a single user. Since you know that a single engine running a single user is accurate, you can use this test to accurately measure the performance of the web application at a load level of 10 users.

The second test will use a single load engine running 10 users. Because the previous tests has established the performance of the web application at 10 users, differences in test results between the two tests will most likely be caused by the load engine. If the test results are equal, then you have proven that the load engine can accurately generate 10 users.

The example below comes from two sets of short tests running the same number of users (10 real-browser virtual users) using either a single load engine or ten load engines. The single-engine result is very close to the 10-engine result – within ~5%. This is very good, given that the variations in average load times over the test runs for the 10-engine tests vary from 0.777 – 1.132, the upper limit of that range being >25% from the average of 0.905.

A plot of the data shows some variations and, to a degree, reflect that VUs running together on an engine may report higher load times than VUs running alone. This is entirely expected. The question here is – is this accurate enough for YOUR needs.

If the results do not compare well, then you have found that the engine capacity was exceeded. Or you might have found an obscure performance problem in the web application that is not due to the quantity of the load (number of VUs), but some other condition that is exercised by using a single engine to run multiple users. I’ll leave that topic for another day.

Step 3: Scale up the proof

To scale up this proof, step 2 can be repeated to determine the accuracy of the load engine at a higher level. For example, if you have successfully proved that the engine is accurate at 10 users, you can use that information to prove that it is accurate at 100 users using the same method: run 2 tests (one with 10 engines running 10 users each and the second test with a single engine running 100 users). If the results match, you can again repeat this step, scaling up as needed until you reach your testing goal.

Here is a continuation of the previous example, running 20 real-browser users on either 10 engines or a single engine. The single-engine tests show a few points that are noticeably higher than the range of the 10-engine tests. On the other hand, the fasted result is from a single-engine test. The single-engine result of 0.684 is within 7% higher of the 10-engine result (0.639) – a little higher than the previous test. But that is still well within the 0.602 – 0.716 range of the individual 10-engine tests (the upper limit exceeds 12% of the average).

Since the difference between the 10-engine and single-engine tests is well within the observed variations of performance of the site, this will be acceptable for most projects. If this variation is acceptable for the project at hand, then I can use these results to allocate resources for a load test that will be as accurate as is needed. If the observed variation of the site was much lower and the performance requirements are very strict, then I might either (1) use the previously-established capacity as the maximum for future testing or (2) back down the test a bit to narrow down the possible range of the maximum capacity, perhaps re-testing at a load of 15 users per engine.

Conclusion

We can test the accuracy of a load test configuration (engine hardware, testing software, etc) by running a series of increasingly large test pairs to determine how the engine scales. At each level, the test results from a single-engine test are compared to the results of a multiple-engine test that subjected the application to the same amount of load. Looking for the difference in each pair of tests will indicate when the performance limits of a single load engine have impacted the results and by how much.

As always, if you have any questions, feel free to contact us!

Chris Merrill,
Chief Engineer

Chris

When his dad brought home a Commodore PET computer, Chris was drawn into computers. 7 years later, after finishing his degree in Computer and Electrical Engineering at Purdue University, he found himself writing software for industrial control systems. His first foray into testing software resulted in an innovative control system for testing lubricants in automotive engines. The Internet grabbed his attention and he became one of the first Sun Certified Java Developers. His focus then locked on performance testing of websites. As Chief Engineer for Web Performance since 2001, Chris now spends his time turning real-world testing challenges into new features for the Load Tester product.

How do you know if a load testing tool is accurate?

How accurate is accurate?

Step 1: Establish a baseline

Step 2: Prove that a single virtual user is measured accurately

Step 3: Now that I know 1 user is accurate, what about 10 users?

Step 3: Scale up the proof

Conclusion

Related Posts:

Add Your Comment

Resources

(1) 919-845-7601 9AM-5PM EST

How do you know if a load testing tool is accurate?

How accurate is accurate?

Step 1: Establish a baseline

Step 2: Prove that a single virtual user is measured accurately

Step 3: Now that I know 1 user is accurate, what about 10 users?

Step 3: Scale up the proof

Conclusion

Related Posts:

Add Your Comment

Resources

Get web performance news

(1) 919-845-7601 9AM-5PM EST