# Basic Method Validation

## A Review of Predictive Value of Laboratory Tests

Expanding on a previous lesson on Clinical Agreement, Dr. Westgard discusses the Predictive Value of a Laboratory Test

## A Review of Predictive Value of Laboratory Tests

#### James O. Westgard, Sten A. Westgard

May 2020

In an earlier discussion [1], we considered the use of a Clinical Agreement Study to evaluate the performance of a qualitative test. In such a study, the new or candidate test is compared to an established or comparative test for a group of patients who are positive for the disease and another group that are negative for the disease. The results are then tabulated in a 2x2 contingency table, as shown below:

Comparative Method “Gold Std” |
|||

Candidate Method (Test) |
Positive |
Negative |
Total |

Positive |
TP | FP | TP+FP |

Negative |
FN | TN | FN+TN |

Total |
TP+FN | FP+TN | Total |

Where TP = Number of results where both tests are positive;

FP = Number of results where the candidate method is positive, but the comparative is negative;

FN = Number of results where the candidate method is negative, but the comparative is positive;

TN = Number of results where both methods are negative.

See even more stories about COVID-19 Laboratory Challenges...

In this discussion, we are using the terminology True Positives (TP), False Negatives (FN), False Positives (FP), and True Negatives (TN) because our interest is to discuss the Clinical Sensitivity and Clinical Specificity of a test and the predictive value of positive and negative results.

Clinical Sensitivity (Se) and Clinical Specificity (Sp) are calculated as follows:

**Clinical Sensitivity = [TP/(TP+FN)]*100**

**Clinical Specificity = [TN/(TN+FP)]*100**

Keep in mind, these terms correspond to the Percent Positive Agreement (PPA) and Percent Negative Agreement (PNA) in the earlier discussion of the 2x2 Contingency Calculator. The difference is that we are now assuming the comparative method is the “gold standard” for correctly classifying the patients’ disease condition.

### Acceptable Sensitivity and Specificity

CDC provides some guidance for acceptable performance of rapid influenza diagnostic tests, suggesting that they should achieve 80% sensitivity for detection of influenza A and influenza B viruses and recommending they must achieve 95% specificity where the comparative method is RT-PCR [2]. They also discuss the expected test performance for conditions where the prevalence of influenza varies from 2.5% (very low), 20% (moderate), and 40% (high). The criteria for performance are the predictive values of positive and negative test results, i.e., what’s the chance that a positive result indicates the presence of disease and what’s the chance that a negative result indicates the absence of disease. Those conditions can be evaluated by calculating the **predictive value** of test results.

### Predictive Value

The primary performance characteristics are clinical sensitivity and clinical specificity, but the clinical usefulness of a test depends on the expected * prevalence of disease *(Prev) in the population being tested. The subjects in a Clinical Agreement Study seldom represent the real population that will be tested. For example, the CLSI guidance suggests 50 positive and 50 negative patient specimens to provide minimally reliable estimates of Se and Sp, which is a 50% rate of disease prevalence. What if the prevalence of the population were 20%, or 2%, or 0.2%?

**Case with 20% Prevalence.** For example, assume that Se is 80% and Sp is 95%, which would be considered good performance according to the CDC guidance for infectious disease testing. If you tested 1000 subjects in a population that had 20% prevalence of disease, which might be representative of New York City during the COVID-19 pandemic, how would you interpret the test results?

- In our test population, 200 patients have the disease (20% of 1000), 80% or 160 of those would give positive results (TP=0.80*200) and the other 40 would give false negative results (FN).
- For the 800 negatives (1000-200), 95% or 760 patients (0.95*800) would give negative results (TN) and the other 40 would give positive results (FP).

With this information, we can fill in the numbers in the contingency table.

Comparative Method “Gold Std” |
|||

Candidate Method (Test) |
Positive |
Negative |
Total |

Positive |
160 | 40 | 200 |

Negative |
40 | 760 | 800 |

Total |
200 | 800 | 1000 |

- The chance that an individual patient with disease will be correctly classified is determined by the ratio of TP to total number of positives TP+FP, which is 160/200 or 80%, i.e., there is a 80% chance that a positive test result will correctly classify the patient as having the disease.
**PVpositive = TP/(TP+FP) = 160/200 = 80%**

- The chance that an individual patient without disease will be correctly classified is determined by the ratio of TN to total number of negatives TN+FN, which is 760/800, or 95%.
**PVnegative = TN/(TN+FN)**

**Case with 2% Prevalence.** Now consider the case for a prevalence of 2.0%, perhaps representative of California.

- For 20 patients with disease (2% of 1000), the number of TP would be 0.80*20 is 16, which leaves 4 FN patients.
- For the 980 patients without disease (1000-20), the number of TN would be 0.95*980 or 931, which leaves 49 FP.

Comparative Method “Gold Std” |
|||

Candidate Method (Test) |
Positive |
Negative |
Total |

Positive |
16 | 49 | 65 |

Negative |
4 | 931 | 935 |

Total |
20 | 980 | 1000 |

- The chance that an individual patient with disease will be correctly classified is given by TP/(TP+FP), or 16/(16+49) or 25%.
- The chance that an individual patient without disease will be correctly classified is given by TN/(TN+FN) or 980/(980+4), or 99.5%.

This test would clearly be more useful in California for identifying patients without disease rather than identifying patients with disease. In New York, however, a positive test result is more likely a good indication of disease, while a negative result is still useful for excluding disease. In California, a subject with a positive test result has about a 25% chance of having the disease. Out of every 10 positives, 7 to 8 will NOT have the disease.

### Alternate Calculations

PVpositive and PVnegative can be calculated directly from Se, Sp, and Prev using the following equations:

PVpositive = Se*Prev/[(Se*Prev) +(1-Sp)*(1-Prev)]

PVnegative = Sp*(1-Prev)/[(1-Se)*Prev +Sp*(1-Prev)]

In these equations, Se, Sp, and Prev should be proportions between 0.00 and 1.00. You can multiply the figures for PVpos and PVneg by 100 to express as percent, or modify the equations by substituting 100 for 1 and entering Se, Sp, and Prev as percentages. Many find it more informative to reason through the steps for calculating the number of TP, etc., to better understand the effects of sensitivity and specificity. However, these formulas allow you to set up a spreadsheet and easily study the interactions of Se, Sp, and Prev for optimizing the predictive value of tests for different scenarios. Alternatively, MedCalc [3] provides an online calculator that will do all these calculations from the contingency table and an entry for prevalence.

### Trade-off between Sensitivity and Specificity

It is difficult to achieve perfect performance of 100% sensitivity and 100% specificity for any diagnostic test. Sometimes, by adjusting the cutoff or decision limit between the population for non-disease and the population for disease, it is possible to optimize either sensitivity or specificity. Typically, that involves improving sensitivity at the expense of specificity, or alternatively improving specificity at the expense of sensitivity.

### Optimizing Performance for Prevalence

The value of a positive test result improves as the prevalence of disease increases and as specificity increases. By applying a test to patients with symptoms of disease, a higher prevalence population is being selected, which should be a valuable strategy when testing is limited and diagnosis of disease is critical. Increasing sensitivity, perhaps by parallel use of two tests, could also be valuable. That means a patient would be classified as positive if either of the two tests were positive. It has been suggested that for diagnosis of COVID-19 after 5 days of symptoms, parallel testing of viral load and total immunoglobulins might improve sensitivity, i.e., if either test is positive, the patient has the disease.

### The Difficulty with Surveillance

On the other hand, if testing patients as part of surveillance, the prevalence of disease is likely to be very low. This surveillance might utilize tests for IgG or Total IG, with a goal of identifying those people who have already been exposed to the virus and hopefully have developed immunity.

If we assume a prevalence of 0.20% and test 1000 patients, there will be 2 patients with disease and 998 without disease. If the test has an ideal sensitivity of 1.00 or 100%, then both of the patients with disease will be classified as positive (TP=2, FN=0). If the test has a specificity of 95%, there will be 948 TN and 50 FP.

PVpositive = TP/(TP+FP) = 2/(2+50) = 3.8%

PVnegative = TN/(TN+FN) = 948/948 = 100%

It is almost counter-intuitive that a test with perfect sensitivity will not be reliable for identifying subjects with antibodies present because specificity (which is also very high at 95%) allows so many false positives. There is only a 4% chance that a positive test indicates a patient has antibodies to the virus. On the other hand, a negative test result almost certainly means that the subject has not been exposed to the virus. But that is not very useful if the aim of surveillance is to identify those in the population who are potentially immune to the disease!

### An example from the AACC Blog

What is the value of repeat testing of positives when screening for antibodies to COVID-19? Evidently there is some guidance from CDC or FDA that positive antibody tests should be repeated to ensure their accuracy. Opinions of clinical chemists vary, some thinking this is a waste of resources because won’t get paid for doing a second test and some believing that there really won’t be any improvement anyway.

There should be a more objective way of addressing this issue, which was illustrated by Drs. Galen and Gambino in their famous book “Beyond Normality” that was published in 1975 [4]. The important pages are 42-44, where they describe a scenario, Test A has an Se of 95% and Sp of 90% and Test B has an Se of 80% and Sp of 95%, and the prevalence of disease is 1.0%. Note that this example presumes that Tast A and Test B are independent tests, e.g., the tests may employ different synthetic antigens that present different binding sites.

The “trick” in making the calculations is to start with Prev of 1.0% and determine the PVpos of Test A, then use the PVpos as the prevalence of disease in calculating the PVpos for Test B. Remember, you are retesting with Test B all the positives seen from Test A, which means the prevalence of disease in that repeat population is actually the PVpos yielded by Test A. In short, you make 2 passes in calculating predictive value, the first with the starting prevalence of 1.0% and the 2nd with the resulting PVpos as the prevalence for apply Test B.

The PVpos from Test A is 8.76%. The PVpos from Test B is then 60.6%. This means that 6 out of 10 patients from repeat testing (A followed by B) will truly have the disease, compared to only 1 out of 10 patients from Test A. Interesting, if the repeat strategy used Test B first and then Test A, the final PVpos is still 60.6%, the prevalence of disease in the repeat population would be 13.9%, thus there would be fewer patients who needed to be retested.

But, the value of repeat testing does depend on the prevalence of disease in the original patient population, with repeat testing being more useful for low rather than high prevalence, as shown in the table below.

Prevalence |
First Test PVpos |
Repeat Test PVpos |

20% | 70% | 97% |

10% | 51% | 94% |

4% | 28% | 86% |

2% | 16% | 75% |

1% | 8.7% | 61% |

Again, the testing strategy for the situation in New York (20%) should be different from the strategy for California (2%). Repeat testing will be necessary in California, but not in NY.

### What is the point?

In summary, the predictive value of a positive test results depends primarily on the *specificity* of the test, whereas the predictive value of a negative test result depends primarily on the *sensitivity* of the test. This is counter-intuitive, but can be explained by the effects of False Positive and False Negative results, respectively. When Sp is 100%, there are no False Positives. When Se is 100%, there are no False Negatives.

Parallel testing (Test A *OR* Test B) is a strategy is to classify the patient as positive if either test is positive, which improves sensitivity and reduces false negative results. Serial testing (Test A *AND* Test B) is a strategy is to classify the patient as positive only if both tests are positive, which improves specificity and reduces false positive results. There may also be practical issues to consider, such as the relative costs of the tests, the relative number of tests that need to be repeated for A OR B vs B OR A strategy, the time required to reach a diagnostic decision, etc.

To add to the confusion about COVID-19 testing, the objective with diagnostic testing is to identify patients with disease, meaning that a positive result is bad news, but leads to confinement or treatment, whereas a false negative result may lead to further exposure of the community. With antibody testing, a positive result is good news, meaning the patient may have developed immunity, a false negative may confine a healthy worker, but a false positive may lead back to the workplace and further exposure of the community.

### What to do?

You may find it very useful to set up a predictive value calculator in an Excel spreadsheet. Use the equations based on Se, Sp, and Prev to enter these figures as proportions between 0.0 and 1.0. If you want results in %, then set up the equations using 100 instead of 1 and enter Se, Sp and Prev as percentages. You will find it interesting to play with the values for Sp and see its critical importance for population surveillance by antibody testing.

### References

- Westgard JO, Garrett PA, Schilling P. Estimating clinical agreement for a qualitative test: A web calculator for 2x2 contingency test. www.westgard.com/qualitative-test-clinical-agreement.htm
- CDC. Rapid diagnostic testing for influenza: Information for clinical laboratory directors. https://www.cdc.gov/flu/professionals/diagnosis/rapidlab.htm
- MedCalc. Diagnostic test evaluation calculator. Accessed 4/27/2020. www.medcalc.org/calc/diagnostic_test.php
- Galen RS, Gambino SR. Beyond Normality: The Predictive Value and Efficiency of Medical Diagnosis. New York:John Wiley, 1975