This question comes from Robbie Keith of Summit Laboratory We are in the process of evaluating our QC program. Our techs monitor Levy-Jennings charts for shifts and trends weekly. We would like to know what you consider to define a shift or trend (e.g. how many points are required increasing or decreasing to define a trend?) Consider control rules such as 41s, 10mean, etc., as good indicators of shifts and trends. The number of observations needed increases as the limit approaches the mean of the control material in order to keep the false rejections down. Minimum number of consecutive observations above or below the mean should probably be set as 6. There are some recommendations, particularly in the Germany, to use 7 above or below the mean, or 7 trending consecutively in one direction.

Basic Method Validation, 3rd Edition, FAQs

James O. Westgard, PhD

METHOD VALIDATION:

The Frequently-Asked-Questions

James O. Westgard, PhD

Why is it necessary to validate method performance when the manufacturer already has?
What analytical performance is needed for a laboratory test?
Who should perform the validation studies in a laboratory?
In setting up a new method for validation studies, how important is it to calibrate the method using primary standards instead of commerical calibrators?
What performance characteristics are usually validated?
What experiments are usually performed?
Does linearity have to be validated?
Does detection limit have to be validated for all tests?
How many materials need to be analyzed in a replication experiment?
What comparison method should be used in the comparison of methods experiment?
Why is there so much emphasis placed on the comparison of methods experiment?
Why can't the correlation coefficient be used to judge the agreement between methods in a comparison of methods study?
Why are regression statistics still recommended?
What is “Bland-Altman” analysis of method-comparison data?
Isn't Bland-Altman to usethat regression?
What's the proper way to use t-test statistics?
What's the proper way to use regression statistics?
What's Deming regression?
What’s Passing-Bablock regression?
What computer programs are available to calculate Deming and Passing-Bablock regression?
What's the alternative to more complicated regression calculations?
What tests are likely to have a narrow range of data and require more care?
Why can't acceptability be judged by tests of significance, such as t-test and F-test?
How does the "method decision chart" approach compare with the "performance criteria" approach?
Where can I find more detailed protocols and statistical guidelines for method validation?

Why is it necessary to validate method performance when the manufacturer has already performed extensive studies?

It's important to demonstrate that the method performs well under the operating conditions of your laboratory and that it provides reliable test results for your patients. There are many factors that can affect method performance, such as different lots of calibrators and reagents, changes in supplies and suppliers of instrument components, changes in manufacturing from the production of prototypes to final field instruments, effects of shipment and storage, as well as local climate control conditions, quality of water, stability of electric power, and, of course, the skills of the analysts. In US laboratories, method validation studies are actually required by the CLIA regulations.

What analytical performance is needed for a laboratory test?

In the US, CLIA defines minimum standards of analytical quality in the form of the criteria for acceptability in proficiency testing surveys. These criteria define the allowable total error around a target value (TV). For example, acceptable test results for cholesterol are described as TV plus/minus 10%, which means that test results should be within 10% of the correct value. Most countries have proficiency testing or external quality assessment schemes that define standards for analytical quality in a similar manner. Note that use of an allowable total error does not provide a specification for an individual characteristic, such as imprecision, inaccuracy, interference, recovery, etc., but provides a requirement of the total amount of error when all sources are combined.

Who should perform the validation studies in a laboratory?

One possibility is that the manufacturer's technical personnel will perform validation studies when installing a new system in your laboratory. This seems to be a growing trend in the US, probably due both to tight laboratory staffing and also the strategy of purchasing whole systems and holding the manufacturer accountable for all problems. If the manufacturer performs the studies, it's important that you review the experimental design, monitor the data collection, and perform your own statistical analysis and interpretation of the data.

In many other cases, however, the studies will need to be organized and carried out by the laboratory itself. It is advisable to have one analyst organize the studies, monitor the data collection, review the data, perform the statistical analysis of the data, and be responsible for the interpretation and conclusions. Other analysts can participate as operators and perform the tests needed in the different validation experiments.

In setting up a new method for validation studies, how important is it to calibrate the method using primary standards instead of commercial calibrators?

The method should be operated in the way intended under routine service conditions. If routine service operation will make use of commercial calibrators, then those calibrators must be part of the testing process that is validated. It is generally advisable to analyze both commercial calibrators and primary standards together, when possible, to see if they agree. Any disagreement should be resolved prior to performing the recovery, interference, and comparison of methods experiments.
Back to top

What performance characteristics are usually validated?

These almost always include the reportable range, precision (or imprecision), accuracy (or inaccuracy, bias), and the reference interval. Sometimes the studies include detection limit (or sensitivity), interference, and recovery. In US laboratories, the CLIA regulations define which characteristics need to be validated for methods with difference classifications of complexity. Fewer studies are required with less complex methods. More extensive testing is necessary for methods developed by the laboratory or modified by the laboratory.

What experiments are usually performed?

Reportable range is validated by a linearity experiment, imprecision (or random error) determined from a replication study, and inaccuracy (or systematic error) assessed from a comparison of methods experiment, as well experiments for interference (constant systematic error) and recovery (proportional systematic error). Sensitivity is determined by a detection limit experiment. Reference intervals can be verified by testing samples from healthy people.

Does linearity have to be validated?

It's actually the reportable range that must be validated. The objective in determining reportable range is to define the highest value that can be reported without diluting the sample. This is usually done by performing a linearity type of experiment, but there is no strict requirement that the method response has to be linear. However, the readout from instrument systems often is linear in the units that are reported.

Does detection limit have to be validated for all tests?

No, for most tests it is sufficient to validate the reportable range using a linearity type of experiment. A more exact estimate of analytical performance around zero is needed only when there is special significance attached to low values for the test. Drug tests are an obvious example. Tumor markers are another example.

How many materials need to be analyzed in a replication experiment?

Good planning would be to analyze the number of materials that will be used in routine quality control for that test. In US laboratories, CLIA places certain requirements on the number of materials to be used for different tests - e.g., a minimum of 2 levels or materials. Laboratory practices commonly include 3 materials for certain tests, such as blood gases and hematology. When possible, select control materials that can be continued for QC once the test is implemented in your laboratory.

What comparison method should be used in the comparison of methods experiment?

Ideally, the comparison method should be a method that is free of systematic errors, i.e., a method whose accuracy or bias is minimal. In practice, most studies involve the routine service method that is to be replaced by the new method. In such studies, the objective is really to assess whether there will be any systematic changes in test values between the "old" method and the "new" method. If such systematic changes are uncovered, then it is important to document which method has the problem. Interference and recovery experiments are often helpful for pinpointing the problem and the method at fault.

Why is there so much emphasis placed on the comparison of methods experiment?

Probably because this experiment uses real patient samples and reveals the kind of errors that will be encountered when the tests are used for patient care, which is particularly important when a laboratory changes methods. It also reveals different kinds of errors - proportional systematic, constant systematic, random error between methods - therefore providing a lot of quantitative information about method performance. Some of the other experiments seem to test conditions that may not be observed very often - e.g., interference, recovery, and detection limit.

Why can't the correlation coefficient be used to judge the agreement between methods in a comparison of methods study?

Perfect correlation, i.e., a correlation coefficient of 1.000, means that the values by the test method increase directly in proportion to the values by the comparison method increase. However, a value of 1.000 doesn't mean that the test method values are identical to those of the comparison method. Systematic differences can be present, e.g., the test method could be running 100 units higher than the comparison method, or the test method could be providing results that are only half of the values by the comparison method, yet the correlation coefficient could still give a value near 1.000. Because the comparison of methods experiment is performed to validate the accuracy of a method, the statistical analysis must provide estimates of systematic errors, not just the correlation or results.

The best use of the correlation coefficient is to help decide whether ordinary linear regression will provide reliable estimates of slope and intercept. If r=0.975 or greater, it is generally accepted that ordinary linear regression calculations are adequate for estimating the errors between the methods.

Why are regression statistics still recommended, given recent publications that emphasize the use of a difference plot as the primary way to present the data from the comparison of methods experiment?

Remember that the purpose of the comparison of methods experiment is to estimate systematic errors, which may be constant or proportional in nature. Regression statistics can provide estimates of these components of systematic error by the y-intercept and slope, as well as estimation of the overall systematic error or bias at any decision level concentration of interest by calculation from the regression equation. The difference plot, on the other hand, emphasizes the random errors between the methods. You actually need to calculate the average difference or bias from paired t-test statistics to get a good estimate of the systematic error, thus the difference plot by itself (without statistical calculations) does not provide sufficient information about the systematic error of the method. Regression statistics are preferred over t-test statistics in order to calculate the systematic error at any decision level, as well as getting estimates of the proportional and constant components of systematic error.

In 2007, the Clinical Chemistry journal – which originally promoted the use of Bland-Altman approach – provided the following guidance in their “Information for Authors” (accessed February 5, 2007, at www.clinchem.org/info_ar/anal_meth.shtml):

“If regression analysis is used for statistical evaluation of the data, supply slopes and intercepts (and their standard deviations) and standard deviations of residuals (Sy/x, often called standard errors of estimates). Unbiased (e.g., Deming) regression is typically required… The correlation coefficient has limited utility. Residuals plots (e.g., Bland-Altman) are often useful. On the horizontal axis, plot the mean of results by the two studied methods, not the result of one method.”

Thus, regression analysis seems to be coming back into favor!

What is “Bland-Altman” analysis of method-comparison data?

Bland and Altman recommend plotting the difference between the results by the test and comparative methods on the y-axis versus the average of the results by the two methods on the x-axis. They argue that since the true value of a sample is not known, it is best to use the average of the test and comparative values as the estimate of the true value. In the ordinary difference plot discussed in this book, the value on the x-axis is the result from the comparative method, rather than the average of the results by the test and comparative methods as used for the Bland-Altman plot.

Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;i:307-310.
Bland JM, Altman DR. Statistical methods for assessing agreement between measurements. Biochim Clin 1987;11:300-404.

Isn’t Bland-Altman simpler to use than regression?

That’s the claim being made by advocates of the Bland-Altman approach. They think that regression is too complicated because the slope and intercept can be affected by lack of linearity, presence of outliers, and a narrow range of test results. On the other hand, Bland-Altman analysis requires the use of t-test statistics to make a quantitative estimate of systematic error. Remember that this estimate of bias is reliable only at the mean of the data if there is any proportional error present. It turns out that a difference plot must be very carefully constructed if an objective decision is to be made about method performance. See the following discussions in the literature for more detail about the strengths and weaknesses of this approach:

Petersen PH, Stockl D, Blaabjerg O, Pedersen B, Birkemose E, Thienpont L, Lassen JF, Kjeldsen J. Graphical interpretation of analytical data from comparison of a field method with a reference method by use of difference plots. Clin Chem 1997;43:2039-2046.
Hollis S. Analysis of method comparison studies. Editorial. Ann Clin Biochem 1996;33:1-4.
Stockl D. Beyond the myths of difference plots. Ann Clin Biochem 1996;33:575-577.

What's the proper way to use t-test statistics?

There are two cases where t-test statistics will provide reliable estimates of systematic errors.

Case 1: proportional error is absent, therefore the estimate of systematic error or bias is applicable throughout the range of the data.
Case 2: the estimate of systematic error or bias is interpreted at a decision level near the mean of the data.

Plot the data on a comparison plot (test value on the y-axis, comparison value on the x-axis) to assess whether proportional error is present or absent. If absent, then plot the data on a difference plot, i.e., the plot the difference of the test minus comparison values on the y-axis versus the comparison values on the x-axis.

When using t-test statistics, present the following:

bias
standard deviation of the differences
mean of the data,
t-value, and the
difference plot.

What's the proper way to use regression statistics?

Plot the test value on the y-axis versus the comparison value on the x-axis, then inspect the data for:

nonlinearity
outliers
wide range of data

Calculate the correlation coefficient as a measure of the range of data, however, you should first inspect a graph to be sure the data is spread fairly uniformly over the range so the r value is not being influenced by a few high or low points. If r=0.99 or greater, the range of data is wide enough to provide reliable estimates of the slope and y-intercept using ordinary linear regression analysis. If r<0.95, it is generally advised to use an alternate statistical technique, such as t-test statistics, to estimate the overall systematic error; or use an alternate regression technique, such as Deming regression, to calculate the slope and y-intercept.

Calculate the slope, y-intercept, and standard deviation of points about the regression line. Interpret the deviation of the slope from an ideal value of 1.000 as proportional error, the deviation of the y-intercept from an ideal value of 0.00 as an estimate of constant systematic error, and the value of the standard deviation of the points about the regression line as a measure of the random error between the methods.

Calculate the systematic error at medically important decision concentrations (Xc) using the regression equation. SE = Yc - Xc = (a + bXc) - Xc.

Present the following:

slope,
y-intercept,
standard deviation of points about the regression line,
standard deviation of the slope (when available),
standard deviation of the y-intercept (when available),
correlation coefficient, and the
comparison plot.

What's Deming regression?

This refers to an alternate way of calculating regression statistics when the range of data isn't as wide as desired for ordinary linear regression (i.e., the correlation coefficient doesn't satisfy the criterion of being 0.99 or greater). An assumption in ordinary linear regression is that the x-values are well known and any difference between x and y-values is assignable to error in the y-value. In Deming regression, the errors between methods are assigned to both methods in proportion to the variances of the methods. The calculations are not commonly available in standard statistical programs, however, special computer programs for laboratory method validation will often include Deming regression.

For a detailed discussion of Deming regression and the calculations, see Cornbleet PJ, Gochman N. Incorrect least-squares regression coefficients in method-comparison analysis. Clin Chem 1979;25:432-438.

What’s Passing-Bablock regression?

Another alternate regression procedure is called Passing-Bablock regression, after the authors who described the technique. The slopes are calculated for every combination of two points in the data set, then the slopes are ordered and ranked, and the median value is selected as the best estimate. There is no need for additional information about the relative SDs of the test and comparative methods, thus Passing-Bablock might be more generally applicable than Deming regression. However, Stockl, Dewitte, and Thienpont seem to recommend Deming regression over Passing-Bablock regression in a study of comparative regression statistics. The special calculations needed for Passing-Bablock are not usually found in general purpose statistical packages, so a specialized program is needed.

Passing H, Bablock W. A new biometrical procedure for testing the equality of measurements from two different analytical methods. Clin Chem Clin Biochem 1983;21:709-720.
Stockl D, Dewitte K, Thienpont M. Validity of linear regression in method comparison studies: Is it limited by the statistical model or the quality of the analytical input data? Clin Chem 1998;44:2340-2346.

What computer programs are available to calculate Deming and Passing-Bablock regression?

This is by no means an exhaustive list, but reflects computer programs we know are available at the time of publication:

• Analyse-it add-on for MicroSoft Excel spreadsheet, available from Analyse-it Software, Ltd (http://www.analyse-it.com)

• CBStat from Kristian Linnet (http://www.cbstat.com)

• EP Evaluator from David G. Rhoads Associates (http://www.drghoads.com).

• Method Validator from Philippe Marquis at Marquis-soft (http://www.marquis-soft.com)

What's the alternative to more complicated regression calculations (such as Deming or Passing-Bablock regression)?

You can collect the data very carefully to permit the application of other statistical calculations. Consider strategies to:

Expand the analytical range of the test results so ordinary linear regression statistics will be valid.
Reduce the variation of the comparison method by performing duplicate measurements, i.e., reduce the error in the x-value to better satisfy the assumption in ordinary linear regression.
Collect the data around the medically important decision concentrations, then analyze subsets of data using t-test statistics.
Interpret the data only at the mean of the data set to minimize the effect of the regression technique on the estimate of systematic error.

What tests are likely to have a narrow range of data and require more care and attention to data collection and statistical calculations?

Tests that may have a narrow analytical range include analytes such as calcium, chloride, and sodium, where the body itself attempts to maintain a narrow range of concentrations. Other tests, such as creatinine, may have a narrow concentration range in a healthy population and therefore need to be evaluated using a patient population from a hospital. Therapeutic drug levels, of course, will depend on obtaining patient specimens for varying doses and varying times following the doses. As a general strategy, make use specimens from a hospital population to obtain a wide range of concentrations.

Why can't acceptability be judged by tests of significance, such as t-test and F-test?

Tests of significance are useful mainly to assess whether there are sufficient data to support a conclusion that a difference or error exists (statistical significance), not whether that difference or error is large enough to invalidate the usefulness of a test (clinical significance). It is best to judge the acceptability of method performance by comparison of the observed errors to the total error that is allowable (such as defined in the CLIA criteria for acceptability of proficiency testing performance).

How does the "method decision chart" approach compare with the "performance criteria" used in the past to judge the acceptability of a method?

The method decision chart provides a graphical way of comparing the observed errors with standards of performance, whereas the earlier performance criteria provided a mathematical comparison. Therefore, the method decision chart is easier to use. In addition, the method decision chart permits simultaneous assessment against the different definitions of allowable total error, such as bias + 2s, bias +3s, and bias + 4s, which have evolved since the original description of "performance criteria."

For the original discussion of "performance criteria", see Westgard JO, Carey RN, Wold S. Criteria for judging the precision and accuracy in method development and evaluation. Clin Chem 1974;20:825-833.

Where can I find more detailed protocols and statistical guidelines for method validation experiments?

The Clinical Laboratory Standards Institute (CLSI, 90 West Valley Road, Suite 1400, Wayne, PA 19087-1898, phone 610-688-0100) provides a series of documents that provide extensive information about individual experiments:

EP5-A2. Evaluation of precision performance of quantitative measurement procedures, 2004.
EP6-P2. Evaluation of the linearity of quantitative measurement procedures, 2003.
EP7-A. Interference testing in clinical chemistry, 2005.
EP9-A2. Method comparison and bias estimation using patient samples, 2002.
EP10-A3. Preliminary evaluation of quantitative clinical laboratory measurement procedures, 2006.
EP12-A. User protocol for evaluation of qualitative test performance, 2002.
EP14-A2. Evaluation of matrix effects, 2005.
EP15-A2. User verification of performance for precision and trueness, 2005.
EP17-A. Protocols for determination of limits of detection and limits of quantitation, 2004.
EP21-A. Estimation of total analytical error for clinical laboratory methods, 2003.
C28-P3. Defining, establishing, and verifying reference intervals in the clinical laboratory, 2008.

Tools, Technologies and Training for Healthcare Laboratories

Basic Method Validation, 3rd Edition, FAQs

METHOD VALIDATION:

The Frequently-Asked-Questions

James O. Westgard, PhD

Why is it necessary to validate method performance when the manufacturer has already performed extensive studies?

What analytical performance is needed for a laboratory test?

Who should perform the validation studies in a laboratory?

In setting up a new method for validation studies, how important is it to calibrate the method using primary standards instead of commercial calibrators?

What performance characteristics are usually validated?

What experiments are usually performed?

Does linearity have to be validated?

Does detection limit have to be validated for all tests?

How many materials need to be analyzed in a replication experiment?

What comparison method should be used in the comparison of methods experiment?

Why is there so much emphasis placed on the comparison of methods experiment?

Why can't the correlation coefficient be used to judge the agreement between methods in a comparison of methods study?

Why are regression statistics still recommended, given recent publications that emphasize the use of a difference plot as the primary way to present the data from the comparison of methods experiment?

What is “Bland-Altman” analysis of method-comparison data?

Isn’t Bland-Altman simpler to use than regression?

What's the proper way to use t-test statistics?

What's the proper way to use regression statistics?

What's Deming regression?

What’s Passing-Bablock regression?

What computer programs are available to calculate Deming and Passing-Bablock regression?

What's the alternative to more complicated regression calculations (such as Deming or Passing-Bablock regression)?

What tests are likely to have a narrow range of data and require more care and attention to data collection and statistical calculations?

Why can't acceptability be judged by tests of significance, such as t-test and F-test?

How does the "method decision chart" approach compare with the "performance criteria" used in the past to judge the acceptability of a method?

Where can I find more detailed protocols and statistical guidelines for method validation experiments?