# Method Validation

## Points of Care in Using Statistics in Method-Comparison Studies

**A top 10 list of things you should know about method validation. If you're not sure you're using the right statistics, the right regression, or the right plots and graphs, read here.**

- Introduction
- Use statistics to provide estimates of errors, not as indicators of acceptability.
- Recognize that the main purpose of the method-comparison experiment is to obtain an estimate of systematic error or bias.
- Obtain estimates of systematic error at important medical decision concentrations.
- When there is a single medical decision level, make the estimate of systematic error near the mean of the data.
- When there are two or more medical decision levels, use the correlation coefficient, r, to assess whether the range of data is adequate for using ordinary regression analysis.
- When r is high, use the comparison plot along with ordinary linear regression statistics.
- When r is low, improve the data or change the statistical technique.
- When r is low and a difference plot is used, calculate t-test statistics to provide a quantitative estimate of a systematic error.
- When in doubt about the validity of the statistical technique, see whether the choice of statistics changes the outcome or decision on acceptability.
- Plan the experiment carefully and collect the data appropriate for the statistical technique to be used.
- References

*This is an annotated version of an editorial that appeared in the 1998 November issue of Clinical Chemistry, volume 44, pages 2240-2242. This version includes links to supporting materials available on this website.*

As clinical chemists and laboratory scientists, we are often concerned when personnel who have little laboratory training begin to perform laboratory tests, such as in point-of-care applications. It may be easy to perform such tests today with modern analytical systems, but there still are things that could go wrong. We hope that some kind of quality system is used to check that everything is working okay with point-of-care analyses.

Imagine how statisticians might feel about the powerful statistics programs that are now in our hands. It's so easy to key-in a set of data and calculate a wide variety of statistics - regardless what those statistics are or what they mean. There also is a need to check that things are done correctly in the statistical analyses we perform in our laboratories.

In the November 1998 issue of Clinical Chemistry, Stockl, Dewitte, and Thienpont (1) provide an interesting discussion of linear regression techniques in method-comparison studies, pointing out that the quality of the data may be more important than the quality of the regression technique (e.g., ordinary linear regression vs Deming regression vs Passing-Bablock regression). In the Clinical Chemistry journal, the standard method for analyzing the data from a method-comparison experiment has been to prepare a "comparison plot" that shows the test method results on the y-axis and the comparative method results on the x-axis, then calculate regression statistics to determine the best line of fit for the data. Different regression techniques may be appropriate, depending on the characteristics of the data - particularly the analytical range that is covered relative to the test values that are critical for medical applications.

Elsewhere in the literature (2), there is a movement to discourage the use of regression analysis altogether and replace it with a simple graphical presentation of method-comparison data in the form of a "difference plot," which displays the difference between the test and comparative results on the y-axis vs the mean of the test and comparative results on the x-axis. This difference plot has become known as the Bland-Altman plot (3). Hyltoft-Petersen et al (4) have shown that a difference plot must be carefully constructed to make an objective decision about method performance. The difference plot is actually not so simple when an objective interpretation is to be made.

In spite of these recent reports and recommendations on the use of statistics, many analysts and investigators still have difficulties with method comparison data. We studied some of the problems twenty-five years ago (5) [see the downloads section of this website]. For the most part, there are similar problems today - with the exception that the calculations are much easier to perform with today's computer programs. There has not been much improvement, if any, in the basic statistical knowledge and skills available in laboratories today, not only for method validation studies but also for statistical quality control. That doesn't mean there haven't been improvements in the theory and recommendations appearing in the literature, but rather that the practices in laboratories haven't really changed very much.

Therefore, we still need to exercise a great deal of care in collecting, analyzing, and interpreting method comparison data. Here are some points to be careful about.

**Point #1: Use statistics to provide estimates of errors, not as indicators of acceptability.** This is perhaps the most fundamental point for making practical sense of statistics in method validation studies. Remember that the purpose of a method validation study is to estimate or validate claims for method performance characteristics. [See MV - The inner, hidden, deeper, secret meaning.] Application and methodology characteristics should be dealt with during the selection of the method. [See MV - The selection of a method to validate.] The statistics are simply tools for quantitating the size of the errors from data collected in different method validation experiments. [See MV - The data analysis tool kit for an introduction or review of commonly used statistics.]

The statistics don't directly tell you whether the method is acceptable, rather they provide estimates of errors which allow you to judge the acceptability of a method. You do this by comparing the amount of error observed with the amount of error that would be allowable without compromising the medical use and interpretation of the test result. Method performance is judged acceptable when the observed error is smaller than the defined allowable error. Method performance is not acceptable when the observed error is larger than the allowable error. This decision making process can be facilitated by mathematical criteria (6) or by graphic tools (7). [See MV - The decision on method performance.]

**Point #2: Recognize that the main purpose of the method-comparison experiment is to obtain an estimate of systematic error or bias.** The comparison of methods experiment is performed to study the accuracy of a new method. [See MV - The experimental plan.] The essential information is the average systematic error, or bias. It is also useful to obtain information about the proportional and constant nature of the systematic error and to quantify the random error between the methods. The components of error are important because they relate to the things we can manage in the laboratory to control the total error of the testing process (e.g., reduce proportional systematic error by improved calibration). The total error is important in judging the acceptability of a method and can be calculated from the components.

**Point #3: Obtain estimates of systematic error at important medical decision concentrations.** The collection of specimens and choice of statistics can be optimized by focusing on the concentration (or concentrations) where the interpretation of a test result will be most critical for the medical application of the test. [See the summary of medical decision levels from Dr. Bernard Statland.] If there is only a single medical decision concentration, the method comparison data may be collected around that level (i.e., a wide range of data will not be necessary) and a difference plot should be useful (along with an estimate of bias from t-test analysis - see point #8 below). If there are two or more decision levels, it is desirable to collect specimens that cover a wide analytical range, use a comparison plot to display the results, and calculate regression statistics to estimate the systematic error at each of the decision levels.

**Point #4: When there is a single medical decision level, make the estimate of systematic error near the mean of the data.** The main consideration when there is a single medical decision level is to collect the data around that medical decision level. The choice of statistics will not be critical when there is only one medical decision level of interest and it falls near the mean of the data. The bias statistic from paired t-test calculations and the systematic error calculated from regression statistics will provide the same estimate of the error.

[Note of explanation: The bias is the difference between the means of the two methods (bias = Y_{av} - X_{av}), which is also equivalent to the average of the paired differences from paired t-test calculations. With regression statistics, the systematic error (SE) is estimated at a critical concentration, X_{C} , as follows: SE = Y_{C} - X_{C}, where Y_{C} is calculated from the regression statistics by the equation Y_{C} = a +bX_{C}, where a is the y-intercept and b is the slope of the regression line. In ordinary linear regression, the slope is calculated first, then the y-intercept is determined from a = Y_{av} - bX_{av}. When the decision concentration equals X_{av}, then SE = (a +bX_{av}) - X_{av} = Y_{av} -bX_{av} + bX_{av} - X_{av} = Y_{av} - X_{av}, i.e., the same estimate of systematic error will be obtained from regression statistics as from t-test statistics, even if the range of data is narrow and the values for the slope and intercept are not reliable.]

**Point #5: When there are two or more medical decision levels, use the correlation coefficient, r, to assess whether the range of data is adequate for using ordinary regression analysis.** As confirmed by Stockl, Dewitte, and Thienpont (1), when r is 0.99 or greater, the range of data should be wide enough for ordinary linear regression to provide reliable estimates of the slope and intercept. They recommend that when r is less than 0.975, ordinary linear regression may not be reliable and that data improvement or alternate statistics are now appropriate. Note that r is not used to judge the acceptability of method performance here, but to judge the acceptability of the concentration range of the data being used to calculate the regression statistics.

**Point #6. When r is high, use the comparison plot along with ordinary linear regression statistics.** The reliability of the slope and intercept are affected by outliers and non-linearity, as well as the concentration range of the data (5). [See the downloads section of this website.] Outliers need to be identified, preferably at the time of analysis by immediately plotting the data on the comparison graph; discrepant results can then be investigated while the specimens are still available. Non-linearity can usually be identified from visual inspection of the comparison plot, the range can be restricted to the linear portion, and the statistics recalculated. Stockl, Dewitte, and Thienpont (1) recommend using the residual plot that is available as part of regression analysis and inspecting the sign-sequence of the residuals for making this assessment.

**Point #7: When r is low, improve the data or change the statistical technique.** Consider the alternatives of improving the range of data, reducing the variation from the comparison method by replicate analyses, estimating the systematic error at the mean of the data, dividing the data into subgroups whose means agree with the medical decision levels (which can then be analyzed by t-test statistics and the difference plot), or using a more complicated regression technique. Stockl, Dewitte, and Thienpont find that the Deming regression technique is more satisfactory than the Passing-Bablock technique. Note that these regression techniques are not standard in ordinary statistics programs, but they are available in special programs designed for laboratory method evaluation studies.

**Point #8: When r is low and a difference plot is used, calculate t-test statistics to provide a quantitative estimate of systematic error.** Given the objective of estimating systematic error from the method-comparison experiment, the usefulness of the difference plot by itself is questionable since visual interpretation will be mainly influenced by the scatter or random error observed between the methods. The bias, or average difference of paired sample results, should be calculated. Computer routines for calculating t-test statistics will provide this estimate, along with an estimate of the standard deviation of the differences, which gives a quantitative measure of the scatter between the methods. Note that this scatter between the methods depends on the imprecision of the test method, the imprecision of the comparison method, and any interferences that affect individual samples differently by the two methods. The t-value itself is a ratio of systematic to random error, and is mainly useful for determining if sufficient data has been collected to make a reliable estimate of the bias (again, avoid using a statistic as an indicator of the acceptability of the method). While Bland-Altman also recommend calculation of the mean difference and the standard deviation of the differences and suggest that the mean difference plus/minus 2 standard deviation be drawn on the chart (3), it is wrong to judge the acceptability of the observed differences by comparison to themselves. See Hyltoft Petersen et. al (4) for an extensive discussion of judging method acceptability on the basis of the difference plot.

**Point #9. When in doubt about the validity of the statistical technique, see whether the choice of statistics changes the outcome or decision on acceptability.** Given the ease with which the calculations can be performed with computer programs, the effect of the statistical technique on the estimates of performance can be assessed by comparing the results from the different techniques. If the statistical technique affects your decision on the acceptability of the method, then be careful. Usually it will be best to collect more data and be sure these new data satisfy the assumptions of the data analysis technique.

**Point #10: Plan the experiment carefully and collect the data appropriate for the statistical technique to be used.** You can collect the data to fit the assumptions of the statistics, or you can change the statistics to compensate for limitations in the data. An understanding of the proper use and application of the statistics will help you plan the experiment and minimize the difficulties in interpreting the results. [See MV - The comparison of methods experiment for a discussion of the factors that are important in planning the experiment.] If you are establishing a standard method validation process in your laboratory, it may be best to put your efforts into collecting the appropriate data - my personal recommendation as the best approach for most healthcare laboratories. This emphasis on getting good data also involves collecting the right specimens under the right conditions, processing those specimens properly, storing the samples appropriately, operating the method or analytical system under representative conditions, and analyzing the patient samples with a process that is under statistical control. This point requires the most care and should have the highest priority. The statistics really don't matter if you don't take care with the data.

In quality management terms, the proper use of statistics is a chronic problem that will continue to flare up until the process is fixed. The process that needs fixing here is the education and training process in clinical chemistry, clinical pathology, and clinical laboratory science. There's a deficiency in a core competency - the ability to use basic statistics in method validation studies as well as for statistical quality control. Correcting this deficiency requires courses for students in undergraduate programs, continuing education workshops and seminars for professionals already in the field, and periodic articles in the scientific literature to remind investigators of the problems and difficulties. There also is a need for easy-to-use statistical software that is designed specifically to deal with the needs and applications in healthcare laboratories. It might even be appropriate for professional organizations such as the AACC or IFCC to support a continuing education curriculum to deal with this on-going need. With today's Internet technology, basic training courses and improved software tools could be delivered to anyone, anywhere, anytime.

### References

- Stockl D, Dewitte K, Thienpont M. Validity of linear regression in method comparison studies: limited by the statistical model or the quality of the analytical data? Clin Chem 1998;44:2340-6.
- Hollis S. Analysis of method comparison studies [editorial]. Ann Clin Biochem 1996;33:1-4.
- Bland JM, Altman DG. Statistical methods for assessing agreement beween two methods of clinical measurement. Lancet 1986;307-10.
- Hyltoft Petersen P, Stockl D, Blaabjerg O, Pedersen B, Birkemose E, Thienpont L, Flensted Lassen J, Kjeldsen J. Graphical interpretration of analytical data from comparison of a field method with a reference method by use of difference plots [opinion]. Clin Chem 1997;43:2039-46.
- Westgard JO, Hunt MR. Use and interpretation of common statistical tests in method-comparison studies. Clin Chem 1973;19:49-57.
- Westgard JO, Carey RN, Wold S. Criteria for judging precision and accuracy in method development and evaluation. Clin Chem 1974;20:825-33.
- Westgard JO. A method evaluation decision chart for judging method performance. Clin Lab Science 1995;8:277-83.