Tools, Technologies and Training for Healthcare Laboratories

When Quality depends on the company you keep

A recent report from the CDC identified a number of issues with current proficiency testing. Among those problems, the group standard deviation type of quality requirement. What is it? Why is it a problem? What can be done about it?

A discussion of group-sd Quality Requirements

June 2008

Sten Westgard, MS

Recently, a report on Proficiency Testing was released by the CDC Division of Laboratory Systems, as part of an overall effort to identify Laboratory Best Practices. The report, Review of Proficiency Testing Services for Clinical Laboratories in the United States - Final Report of a Technical Working Group – compiled by experts working with the help of Battelle Memorial Institute, addresses a lot of interesting challenges facing the PT industry. But one of their findings is of particular interest to readers of Westgard Web – quality requirements, their types and value.

Usually on Westgard Web, when we talk about quality requirements, we talk about selecting requirements for test methods and assessing the performance of those methods based on those requirements. However, this discussion is going to go one step deeper – and talk about how the quality requirements themselves are defined and designed.

The Group Standard Deviation Quality Requirement

There are three types of quality requirements presented by the CLIA Proficiency Testing criteria. The first type is called the fixed limits, where the target value produced by the laboratory must be within a specific number of units, (i.e. +/- 6 mg/dL). The second type is a percentage limit, where the target value produced by the laboratory must be within a specific percentage of the value. (i.e. +/- 20%). The third and final type is a standard deviation requirement, where the target value must be within +/- a number of standard deviations of the target value. For example, CLIA states that the Thyroid Stimulating Hormone (TSH) must be within +/- 3 standard deviations.

Historically, CDC seems to have used the +/- 3 SD range to set the original criteria for acceptability. With some tests, they felt the numbers were stable for some of the tests so they could reduce them to a fixed % or concentration. Other tests were still rapidly changing in technology and performance at the time the CDC was determining quality requirements. Not only was there a lot of within-method variation (i.e. the method itself had a lot of variation), there was also significant between-method variation (i.e. the values of one method differed greatly from the values of a different method). For these troublesome methods, neither a fixed limit nor a fixed percentage could contain most of the values encountered during proficiency testing, so +/- 3SD was used to represent the "state of the art." You may recall that in the Stockholm hierarchy of quality requirements, those criteria are last or lowest, therefore the least desirable way to establish criteria for acceptability.

The PT Report further describes this type of quality requirement:

“In instances where definitive or reference methods are not available, or a specific method’s results demonstrate bias not observed with actual patient samples as determined by a defensible scientific protocol, a comparative method or a method group (peer group) may be used…. Despite the requirement that peer group means be used to target value determinations only in instances where a matrix effect has been demonstrated by a defensible scientific protocol, this approach has become commonplace.”[1]

Thus, the laboratory results had only to fit within the a certain number of standard deviations of the testing group to which it belonged. In other words, the quality requirement depended on the values obtained not just by your lab, but on the entire group of labs. As long as you were close to the values by your peers, you were fine.

At first glance, there’s nothing wrong with this type of requirement. After all, this is what proficiency testing is all about. You compare your results with everyone elses results and you want to be. But the problem is that we’re no longer seeking the “right” answer – like an authoritative value established by a reference method. Instead, we’re just seeking a common answer – a value close to what everyone else got.

As we discuss these problems, we’re going to take advantage of some data on six TSH methods collected by Rawlins and Roberts in 2004 [2]. We’re going to pair that data up with some 2004 proficiency testing data from the New York State Wadsworth CLEP PT program and the API PT program. Bear in mind, we’re not conducting a scientific study with this data. We’re just using these numbers to illustrate some of the issues with this type of quality requirement.

We won’t dwell on the debate over whether the upper normal reference range limit for TSH is 2.5 or 4.5 mIU/L. Instead, we will focus on the lower end of the reference range, which is around 0.40 mIU/L. Values below this threshold are often used to differentiate between nonthyroidal illnesses and primary hyperthyroidism.

Problem: Following the Herd is a Moving Target

One of the first problems, you encounter with group standard deviation quality requirements is the variability of the requirement itself. Each PT event has different standard deviations and resulting different quality requirements. For example, with TSH around the level of 0.40 mIU/L level, the quality requirement for the NY PT group varied from 34% (January), to 43% (May) to 45% (September) during 2004. That’s a spread of 11% in the same group of the same proficiency testing program. Remember, analytes like cholesterol have a fixed requirement of just 10%; the spread alone on the TSH quality requirement exceeds the fixed requirement for cholesterol.

When quality requirements jump around frequently from event to event, it becomes difficult to get a clear picture of the quality being achieved by a method. For example, a method with 5%CV and a 15% bias (Not an unlikely bias, since the range of biases from Rawlins and Roberts was 3 to 35%). Given that data, performance is world class with a 45% quality requirement, but falls to marginal performance when the quality requirement is 34%.

In other words, it's hard to benchmark your quality when the bench is moving. What might be a world class method in September might be marginal in January. We already know that medical care suffers during overnight and weekend shifts - do we now have to worry that care suffers during some PT events?

Problem: With Group-based quality requirements, Everything’s Relative

Another problem that occurs is this, “How am I to know if my laboratory is doing well if my performance depends upon my peers?” If we introduce “quality relativism” to laboratory testing, it may matter more how your peer labs perform than how well your lab performs.

In the traditional “grading on a curve” scenario, you might want to be in a group of lower-performing peers, in order to ensure that you get the highest grade. In a proficiency testing scenario, you would want a group with a lot of variation, because that gives the largest standard deviation. Large standard deviations mean larger quality requirements, which means a higher Sigma-metric.

Let’s use our TSH data now as an example. We’ll compare 5 methods (A through B) and use possible proficiency testing groups at or near the 0.4 mIU/L range.

Group
Standard Deviation
Quality Requirement
New York PT
0.06 at level 0.52 mIU/L
34.6%
API
0.07 at level 0.43 mIU/L
48.8%

Notice the calculated allowable errors for the two different groups, NY and API. There’s more than a 10% difference between the two quality requirements. When we apply those quality requirements to the method performance, that will produce some significant differences.

Method
CV
A
6.3%
B
5.7%
C
5.2%
D
3.7%
E
4.3%
F
1.6%

For these methods, an imprecision study was performed 2 days a week for 3 weeks, with 24 replicates total for the control level we’re focusing on, 0.40 mIU/L. Now those CVs aren’t that helpful, although you can make rough comparisons of the ratio of the CV to the quality requirement. Just to convert this into something more tangible, let’s make some simple Sigma-metrics calculations. Assume there is no bias and apply the Sigma equation:

Method
CV
NY Sigma
API Sigma
A
6.3%
5.29
7.75
B
5.7%
6.07
8.57
C
5.2%
6.66
9.39
D
3.7%
9.36
13.20
E
4.3%
8.86
11.36
F
1.6%
21.6
30.5

In this case (note that we’re not taking bias into account) , we’ve got potentially great news. World class performance, no matter how you slice it. But also note how the Sigma-metrics are different from group to group. For method A, if it was as simple to “switch” groups, you could move from good performance to beyond Six Sigma.

Another conclusion you could reach: sometimes the group variation may be so large that it’s “too easy” to achieve acceptable or superb performance. To paraphrase the advice on making it to the Olympics: if you want to achieve world class performance, choose your peer group well.

Problem: What if your peer group is “too good”?

Here’s a different problem that might result from a peer group with excellent precision: the effective “range” of allowable variation may get “too tight.” While an all-participant standard deviation might be wider and more forgiving, if everyone has excellent performance, there is less allowance for error.

For example, if you’re 6 feet tall on an NBA team, you might be considered “out of control” (i.e. too short). In contrast, if you are 6 feet tall compared to a group of office-workers, a group of medical technologist, a group of dentists, and the NBA team, you might find that your height is more likely to be “in” control (you’re neither the tallest nor the shortest).

In a purely abstract sense, the more coherent your peer group, obviously the tighter the quality requirement. If you extrapolate to the extreme case, where every peer is getting exactly the same result, the peer SD will be the same as your individual laboratory SD. Then, by definition, you will only be able to achieve at most 3 Sigma. The calculation works out like this:

Quality Requirement = 3 * YourSD.

Sigma = ( Quality Requirement – Bias ) / YourSD

= ((3*YourSD) – 0) / YourSD

= 3

Back in the real world, we know that even the most similar peer groups have wider group SDs than individual SDs. For example, if we used the peer groups found in API PT (where groups of instruments using the same reagents are grouped together), this is what some of our Sigma metrics would look like:

Method
CV
NY Group Sigma
API Group Sigma
API Peer Quality Requirement
API Peer Sigma
A
6.3%
5.29
7.75
15.00
2.38
B
5.7%
6.07
8.57
no peer group data available
C
5.2%
6.66
9.39
no peer group data available
D
3.7%
9.36
13.20
21.95
5.93
E
4.3%
8.86
11.36
no peer group data available
F
1.6%
21.6
30.5
26.79
16.7

Again, the problem here is that the choice of group has a huge impact on your performance. In this simple example, method A would have a Sigma-metric of 2.38, 5.49, or 7.75, depending on which group you chose to measure it against. The lower variation of the peer group makes for a much-lower quality requirement, resulting in the unacceptable Sigma-metric. But depending on which comparison you chose, Method A could be unacceptable, good, or world class. Obviously, with quality requirements like this, it pays to be careful about your choice of peers.

Problems: What if everyone in the peer group is wrong? Or What if everyone is wrong and you’re right?

The PT Report notes several other critical problems that can result from peer group comparisons in PT testing.

“While it is understandable that PT providers prefer [the peer group] approach since it is more likely that they will achieve 80% participant agreement, this practice can make true analytical bias that may affect patient samples, as well as PT samples. Rej et al reported on two cases seen in the New York State Clinical Laboratory Evaluation Program that illustrate the pitfalls of using peer group means instead of an all-participant mean. In one case, a problem in one manufacturer’s reagent quality would have been overlooked. In the other case, a problem with another manufacturer’s calibrators was uncovered when results for these instruments were compared to the overall mean.” [1]

This is the danger of peer group means: if everyone in your peer group is experiencing the same bias, it won’t look like a problem. If all the values in your peer group figuratively jump off a cliff (due to a problem with reagents or calibrators, etc.), the question isn’t whether you will jump, too. The question is whether you will notice it.

If you take a retrospective look at the stock market, you find can find similar situations where groups can drift into error through a blinding consensus. Take the recurring pattern of economic bubbles. You can go back centuries, to the tulip mania of the 17th century, or you can look more recently at the Internet stock craze of the 1990s and now the collapsing subprime mortgage bubble. In all those cases, reasonable, well-intentioned investors followed the herd of their peers (overwhelmed by “irrational exuberance”) to invest large sums of money in things that weren’t that valuable. Also, in every case, there were nay-sayers who resisted the crowd but were ignored, usually to their peril. In the case of economic bubbles, the damage is usually only financial. In the case of the laboratory testing, the damage could be the waste of repeated testing, misdiagnosis or worse.

Solutions: Better Methods, Different quality requirements

So we’ve seen now the downsides of the standard deviation-type quality requirement. What, if anything, can be done about it? The PT report concludes with some general advice on this topic:

“Where available, target values should be established or verified by reference methods. Comparative methods or peer group targets should be used more rarely and only when it is shown that specific methods demonstrate a bias with PT samples not observed with patient samples. It is recognized that changing the method or target value assignment from peer group to a reference method value could have disruptive effects both for PT providers and participating laboratories. If an instrument has a problem in providing results comparable to the “true” value as determined by a reference method or reference material, the laboratories using it should not be penalized for their choice of instrumentation. Some accommodation for this possible outcome should be made by providing a transitional period that allows manufacturers to correct problems or laboratories to change instrumentation.” [1]

The Proficiency Testing Working Group also tried to reach a more forceful recommendation but couldn’t quite do it:

“The PTWG considered a recommendation that CDC identify reference or standard methods for as many CLIA analytes as possible and establish guidelines for manufacturers and PT providers to utilize these methods to assign target values for analytes. The group was unable to reach a consensus in this discussion, and no recommendation was made. The principal impediment to consensus was that reference or standard methods exist for relatively few clinically important analytes, and establishing such methods is often a difficult, time-consuming, and expensive effort.”[1]

The performance side of this problem seems to be resolving itself, at least with some of the CLIA regulated analytes. The latest generation of methods enjoy better precision and comparability – so methods are getting better and more and more alike. Thus, this improved performance could be the trigger for switching from the standard deviation-type requirement to an absolute requirement.

Unfortunately, not all methods are ready for this switch. Cao, Soldin, and Rej presented a poster at the 2007 AACC conference where they established target values for endocrinology PT results using HPLC-tandem mass spectrometry as a reference method. They found that while failure grades only slightly increased for PT results for T4, free T4 and Cortisol, there was a doubling to quadrupling of failure grades for testosterone, progresterone, and DHEA-S.[3]

Norton-Wenzel, Cao, Russo, and Rej, in another 2007 poster[4], noted another approach that could be made for these methods. While between-method variation was high, they noted that “within-method CVs were reasonably consistent over proficiency test events and specimen concentration, except at low concentrations of analyte.” Thus, they recommended using median values for %CV. While an improvement, again this presents a moving target for a quality requirement.

As a practical matter, more recent and modern regulations tend to favor fixed unit or fixed percentage requirements. Anecdotally, we hear from some proficiency testing providers that they now avoid assigning the standard deviation-type quality requirement to non-regulated analytes. If you look at the Australian RCPA proficiency testing requirements or the German Rilibak, you don’t find that type of quality requirement at all. The German Rilibak specifications are unique in that they specify the allowable “relative deviation” (error) for an individual value and the allowable “relative deviation” for interlaboratory test results. For TSH, Rilibak has set a 13.5% individual allowable error and 24% interlaboratory allowable error.

The final lesson here is that we still need better methods, better requirements for those methods, and most of all, we need better professionals - people in the laboratory with knowledge, experience and the ability to exercise judgment in the definition and use of quality requirements.

References

  1. Review of Proficiency Testing Services for Clinical Laboratories in the United States - Final Report of a Technical Working Group, Division of Laboratory Systems, CDC, April 2008, p. 24. http://www.futurelabmedicine.org/WorkGroupReport.aspx Accessed May 29, 2008.
  2. Mindy L. Rawlins, William L. Roberts, “Performance Characteristics of Six Third-Generation Assays for Thyroid-Stimulating Hormone”, Clin Chem 50:12, 2004.
  3. D-134 “Are external quality assessment programs ready to adopt absolute target values?” Z. Cao, SJ Soldin, R. Rej, 2007 AACC meeting.
  4. D-142, "Ranges of acceptability for proficiency testing/external quality assurance programs", CS Norton-Wenzeil, Z Cao, K Russo, R Rej. 2007 AACC meeting.