Assessing the performance of diagnostic tests for repeated binary measurements is very important in clinical trials and diagnostic medicine. The gastric-emptying studies [1,2] involve 4-hourly measurements of emptying. Empirical results suggest that the gastric emptying results over time are correlated and therefore failure to consider the correlation in statistical measures may not produce satisfactory assessments. If these measurements are highly correlated/associated, early gastric emptying can be used to predict late gastric emptying which offers an opportunity to reduce the number of repeats. Although the likelihood ratio has been the most widely used statistical measures for assessing the test performance, the choice of the compromise between the likelihood ratio positive and the likelihood ratio negative needs investigators' clinical experience. In this article, we propose a correlation statistic to assess the performance of the test for repeated binary measurements. The statistical measure aims to identify an acceptable balance between sensitivity and specificity and produce a unique assessment for the test performance by using the estimator of the correlation coefficient based on a correlated bivariate Bernoulli distribution. Simulation studies are conducted to compare our proposed statistical measure to other measures. The simulated results indicate that our proposed statistical measure performs well. Using our theoretical approach, we find that the greatest correlation statistic always corresponds to the smallest average absolute likelihood ratio and are approximately close to the greatest Youden's index. We further analyze a real gastric emptying data set and identify a threshold using our proposed statistical measure.

Repeated binary measurements, Bivariate bernoulli distribution, Correlation statistic, Gastric emptying

A diagnostic test is an approach used to gather clinical information for making a clinical diagnosis. Assessment of diagnostic tests, especially the likelihood ratios, is very important in clinical trials and diagnostic medicine [3-6]. There exist two versions of likelihood ratios known as positive likelihood ratio and negative likelihood ratio. A positive likelihood ratio is the ratio of the probability that an individual with the disease tests positive to the probability that an individual without the disease tests positive; a negative likelihood ratio is the ratio of the probability that an individual with the disease tests negative to the probability that an individual without the disease tests negative. An asymptotic hypothesis test is constructed to compare the positive and negative likelihood ratios [7]. The likelihood ratios have been paid much attention to measures diagnostic tests for repeated binary outcomes [1,8-10].

Although the likelihood ratios have their many advantages in clinical findings, laboratory tests, imaging studies, etc, they have some limitations [11-16]. Furthermore, identification of an appropriate compromise between the likelihood ratio positive and the likelihood ratio negative may depend on the investigators' knowledge and experience in clinical trials. Youden's index is widely used as a single statistic to evaluate different test algorithms [17], but it may not perform well when a subject has a low prevalence.

Considering the limitations, we propose a correlation statistic based on a bivariate Bernoulli distribution to assess the diagnostic tests for repeated binary outcomes. A bivariate Bernoulli distribution for binary variables was studied [18], but the authors did not give an explicit expression of bivariate distribution. In contrast with the likelihood ratio positive and the likelihood ratio negative, our correlation statistic strikes a sort of balance between sensitivity and specificity. It is based on the difference between the chance of two independent subjects with both true positive and true negative and chance of two independent subjects with both false positive and false negative.

This paper is organized as follows. In Section 2, we present a new form of bivariate Bernoulli distribution with an explicit expression and give the maximum likelihood estimators of the parameters. The correlation statistics proposed as a measure for the evaluation of the test performance in Section 3.

Through the simulated studies in Section 4, we demonstrate that the estimators perform well and that our proposed statistical measure is advantageous when the prevalence is low. The advantage of our statistical measure is further demonstrated by analyzing a data set on gastric emptying in Section 5. Some discussion is given in Section 6. Lemmas and theorems and their proofs are provided in the Appendix.

First, we introduce a motivating example for our paper. Delayed gastric emptying results from a variety of chemical and functional etiologies. Gastric emptying studies are the standard of measuring gastric emptying, the protocol has been standardized [19]. Our aim is to establish the predictors to assess the possibility of decreasing the study time. This study is the retrospective chart review study design from Feb 2009 to May 2011. There are 600 subjects in our study. Their gastric emptying is the percentage of remaining radioactivity in the stomach, measured at 0, 1, 2, 3 and 4 hours after ingestion.

According to the standard protocol, a gastric emptying result for a subject is abnormal (or normal) at 2 and 4 hours if the percentage of remaining radioactivity is greater than 60% and 10% (or less), respectively. Here the gastric emptying results measured at 4 hours are used as subjects' actual results while the gastric emptying results measured at 2 hours are used as predicted results. The gastric emptying results at 2 hours and at 4 hours on the same subject are repeated measurements over time and then are often correlated to each other.

This motivates us to develop a unique and feasible statistical measure that accounts for this correlation in the assessment of test performance. Let Y1 and Y2 be the actual result and predicted result for a subject, respectively, where each of the actual and predicted results is either positive or negative. For simplicity, let 1 and 0 denote a positive result and a negative result, respectively. As we've known that if one of p1 = P(Y1 = 1) and p2 = P(Y2 = 1) is 0 or 1, then Y1 and Y2 are independent. Here to avoid the degenerate cases, we assume from now on that 0 < p1 < 1 and 0 < p2 <1. The correlation coefficient of Y1 and Y2 is given by

ρ = p11−p1p2 p1(1−p1)p2(1−p2)−−−−−−−−−−−−−−−−−√ (1)

Or equivalently,

ρ = p11p00−p10p01 p1(1−p1)p2(1−p2)−−−−−−−−−−−−−−−−−√ (2)

Where py1y2= P(Y1=y1,Y2=y2), y1, y2 = 0,1 and p11 – p1p2 is the probability that two independent subjects have both true positive and true negative and probability that two independent subjects have both false positive and false negative. The correlation coefficient may then be viewed as a standardized difference on a certain range. The following lemma shows the range for the correlation coefficient.

(i)−min {p1p2(1−p1)(1−p2)−−−−−−−−−√, (1−p1)(1−p2)p1p2−−−−−−−−−√}≤ρ≤ min {p1(1−p2)(1−p1)p2 −−−−−−√, (1−p1)p2p1(1−p2)−−−−−−√}

(ii) ρ = min{p1(1−p2)(1−p1)p2 −−−−−−√, (1−p1)p2p1(1−p2)−−−−−−√} if and only if p11 = min {p1, p2}.

In particular, ρ = 1 if and only if p11 = p1 = p2.

(iii) ρ=−min{p1p2(1−p1)(1−p2)−−−−−−−−−√, (1−p1)(1−p2)p1p2−−−−−−−−−√} if and only if p11 = max {0, p1 + p2 -1}.

In particular, ρ = -1 if and only if p10 = p1 = 1 - p2.

(iv) ρ = 0 if and only if Y1 and Y2 are independent.

Proof. See the Appendix. The results from Lemma 1 inspire us to consider the following bivariate function

p(y1,y2,p1,p2,ρ)= py11(1−p1)1−y1py22(1−p2)1−y2+ (−1)y1+y2ρp1(1−p1)p2(1−p2),−−−−−−−−−−−−−−−−−√ (3)

Where y1, y2 = 0, 1, p1, p2 and ρ are defined as above. To make this bivariate function to become a probability distribution and use it statistical inference, we need to show the following theorem holds true.

This bivariate function given in equation (3) is a well-defined and identifiable probability distribution.

The proof is given in the Appendix. Any two correlated categorical variables may be assumed to have the bivariate probability distribution with the parameters p1, p2 and ρ, which are the marginal probabilities of the variables and their correlation coefficient. Such distribution may be viewed as a generalization of the well-known univariate Bernoulli distribution to 2-dimension and we call it as a bivariate Bernoulli distribution denoted by (Y1,Y2)~BB (p1, p2, ρ).

Now we present the maximum likelihood estimators of the parameters p1, p2 and ρ. Let (yi1, yi2), i = 1, ..., n, be the sample from the bivariate Bernoulli distribution BB (p1, p2, ρ). The likelihood function is given by

L= ∏ni=1p(y1i,y2ip1,p2,ρ) =(p(1,1,p1,p2,ρ))N11(p(1,0,p1,p2,ρ))N10 (p(0,1,p1,p2,ρ))N01(p(0,0,p1,p2,ρ))N00, (4)

Where N11, N10, N01 and N00 are the numbers of subjects with true positive, false negative, false positive and true negative, respectively. The standard maximum likelihood methods often involve optimization of the log-likelihood function. The optimization is reached by taking the derivative of the log-likelihood function with respect to parameters. The Newton-Raphson iteration is frequently used to find the maximum likelihood estimators. However, this iteration finds an approximation to the maximum likelihood estimators. More importantly, it is difficult to show this iteration converges and lead to time-consuming computation. All these motivate us to find an explicit and closed-form expression for the maximum likelihood estimators. Now we use the following lemma to derive the maximum likelihood estimators.

Suppose that k is a positive integer and for t = 1, .., k, jt is a nonnegative integer such that ∑kt=1jt< n. Then

(θ1,...,θk)max∈Θk(∏kt=1θjtt)(1−∑kt=1θt)n−∑kt=1jt=(∏kt=1(jtn)jt)⎛⎝1−∑kt=1jtn⎞⎠n −∑kt=1jt, (5)

Where Θk={(θ1,...,θk):∑kt=1θt< 1,0 < θt < 1,t =1,..,k}.

The proof is given in the Appendix. From this lemma, we can derive the following theorem.

The maximum likelihood estimators of the parameters are P∧1=N11+N10n, P∧2=N11+N01n and ρ∧=N11N00−N10N01(N11+N10)(N01+N00)(N11+N01)(N10+N00)√, where N11, N10, N01 and N00 are the numbers of subjects with true positive, false negative, false positive and true negative, respectively.

The proof is given in the Appendix.

The repeated binary outcomes are often correlated and failure to consider their correlation in statistical measures may lead to poor assessments for association analysis. Now we see that correlation coefficient determines the strength and direction of association between two categorical variables.

Let Y1 and Y2 be the actual and predicted results for a subject, respectively. Assume that (Y1, Y2) has bivariate Bernoulli distribution BB (p1, p2,ρ), where p1 = P (Y1-1), p2 = P (Y2 = 1) and ρ is the correlation coefficient. We discuss correlation coefficient in the following three cases. (i). A zero-correlation coefficient implies that P (Y2 = 1|Y1 = 1) = P (Y2 = 1|Y1 = 0) = P (Y2 = 1), which suggests any actual result has no effect on a positive predicted result. Also, a zero-correlation coefficient implies that P (Y2 = 0|Y1 = 1) = P (Y2 = 0|Y1 = 0) = P (Y2 = 0), which suggests any actual result has no effect on a negative predicted result. Thus, a zero-correlation coefficient implies that the two results are independent. (ii). A positive correlation coefficient implies that P (Y2 = 1|Y1 = 1) = P (Y2 = 1|Y1 = 1) > P (Y2 = 1) and P (Y2 = 0|Y1 = 0) > P (Y2 = 0), which suggests the actual result has an increased effect on the predicted result when the two results is the same. Also, a positive correlation coefficient implies that P (Y2 = 1|Y1 = 0) < P (Y2 = 1) and P (Y2 = 0|Y1 = 1) < P (Y2 = 0), which suggests the actual result has an decreased effect on the predicted result when the two results is different. Thus, a positive correlation coefficient implies that the two results are positively associated. (iii). A negative correlation coefficient implies that P (Y2 = 1|Y1 = 1) < P (Y2 = 1) and P (Y2 = 0|Y1 = 0) < P (Y2 = 0), which suggests the actual result has an decreased effect on the predicted result when the two results is the same. Also, a negative correlation coefficient implies that P (Y2 = 1|Y1 = 0) > P (Y2 = 1) and P (Y2 = 0|Y1 = 1) > P (Y2 = 0), which suggests the actual result has an increased effect on the predicted result when the two results is different. Thus, a negative correlation coefficient implies that the two results are negatively associated. Therefore, the correlation coefficient provides the direction of association between the actual and predicted results.

However, for a nonzero correlation coefficient, whether the correlation coefficient is positive or negative, its absolute value measures the strength of the association between the actual and predicted results. The greater the absolute value of the correlation coefficient is, the stronger the association between the actual and predicted results becomes. Thus, the correlation coefficient provides the strength and direction of association between categorical variables.

Since the correlation coefficient is unknown, we use its maximum likelihood estimator

CS=N11N00−N10N01(N11+N10)(N01+N00)(N11+N01)(N10+N00)−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−√6 (6)

as a statistical measure for the performance of the diagnostic test and call it as correlation statistic. Such measure may have a potential to evaluate the association between correlated binary variables.

Compared with other measures such as likelihood ratio positive, likelihood ratio negative, average absolute likelihood ratio and Youden's index, our CS shows advantages for assessing the test performance. As we know, the CS is a statistical measure for estimating correlation between the actual result and predicted result from a general bivariate Bernoulli model. Therefore, the CS has potential to provide a direct evaluation of association between true and predicted results. However, other measures depend on sensitivity and specificity, which is an indirect evaluation of association between true and predicted results. These may lead to a conclusion that the CS would achieve better prediction of true result than other measures, which is shown from simulation studies in Section 4.

To evaluate the performance of the proposed P∧1, P∧2 and P∧, we conduct a moderate scale simulation experiment. We generate n = 60, 120, 150 observations from the bivariate Bernoulli distribution:

Py1y2= Py11(1−p1)1−y1Py22(1−p2)1−y2+ (−1)y1+y2ρ P1(1−p1)P2(1−p2)−−−−−−−−−−−−−−−−−√, y1,y2= 0,1.

We consider the following four scenarios: (a) p1 = 0.2, p2 = 0.4, ρ = 0.3. (b) p1 = 0.6, p2 = 0.8, ρ = 0.5. (c) p1 = 0.2, p2 = 0.4, ρ = -0.3. (d) p1 = 0.6, p2 = 0.8, ρ = -0.2. We generate 1000 data sets in each of the configurations. The simulation results on the simulated means, standard errors and mean squared errors are summarized in Table 1.

To compare the performance between our proposed statistical measure and the likelihood ratio, we conduct a moderate scale simulation experiment. We generate n = 300 in (a) and 600 in (b) observations from the bivariate normal distribution:

f(y1,y2) = 12πσ1σ21−r2√exp{−12(1−r2)((y1−μ1)2σ21−2r(y1−μ1)(y2−μ2)σ1σ2+(y2−μ2)2σ22)}

We consider the following two scenarios: (a) µ1 = 8, µ2 = 8, σ1 = 3, σ1 = 3, r = 0.8. (b) µ1 = 30, µ2 = 30, σ1 = 6, σ1 = 6, r = 09. For each of (a) and (b), we classify Y1 and Y2 into abnormal or normal results based on the threshold, that is, Y1 (or Y2) is abnormal if it is less than or equal to the threshold and otherwise, it is normal. Also, we treat Y1 as the predictor and treat Y2 as actual status. We generate 1000 data sets in each of the configurations. For each simulation run, the thresholds for Y1are the quantiles which correspond to the probabilities 10%, 20%, ..., 90%, respectively while the thresholds for Y2 are the quantiles which correspond to the probabilities 30% and 60%, respectively. We use our proposed method to identify the appropriate threshold which leads a good predictor. The simulation results on the sensitivity and specificity (SENS and SPEC), likelihood ratio positive and likelihood ratio negative (LRP and LRN), average absolute likelihood ratio (AALR), Youden's index (YOUDEN), correlation statistic (CS) and assigned threshold values (THRES 1 and THRES 2) are summarized in Table 2. In the table, an entry is the average over 1000 simulations for the corresponding measure. For the CS, YOUDEN and LRP, the greater the measure becomes, the better the test is. But for the measures LRN and AALR, the smaller the measure becomes, the better the test is. From Table 2, we can see that in all the simulated cases, the CS is able to identify the correct threshold. Also, the CS may draw an acceptable balance between sensitivity and specificity. The greatest CS approximately corresponds to the greatest sensitivity (or specificity) given that specificity (or sensitivity) is greater than or equal to around 0.8. The well-known YOUDEN which is defined as SENS + SPEC-1, is often used to identify the threshold. However, it does not perform well to make prediction especially for the cases (Scenario (a), n (300), THRES 1 (8.76)) and (Scenario (a), n (600), THRES 1 (8.76)). It should be noted that the greatest YOUDEN does not achieve a good balance between sensitivity and specificity. For example, for the case (Scenario (a), n (300), THRES 1 (8.76)), the greatest YOUDEN approximately corresponds to a relatively low specificity (0.663). If we use likelihood ratio positive or likelihood ratio negative to identify the threshold individually, it misses the correct value in each case. Also, the greatest LRP or smallest LRN does not achieve an acceptable balance between sensitivity and specificity in many cases. It still remains challenge to compromise between the likelihood ratio positive and the likelihood ratio negative for balance between sensitivity and specificity. One disadvantage of the likelihood ratio positive is that its value may be infinity in some extreme cases. Although the smallest AALR corresponds to the greatest CS, surprisingly, the average absolute likelihood ratio, which is conceptually an average of likelihood ratios over a population/sample, is also able to identify the correct threshold, if we are willing to use it in a counterintuitive way: The smaller the measure, the better the test.

Gastric emptying study is the standard of measuring gastric emptying and the aim of this study is to establish the predictors and to assess the possibility of shortening the study time [2,19,20,].

In our data analysis, we consider the gastric emptying at 2 hours as predictor and the gastric emptying at 4 hours as actual status. Although the gastric emptying results at 2 hours and at 4 hours are count variables, we classify them based on the thresholds. The gastric emptying results at 4 hours is defined as abnormal if the percentage of remaining radioactivity is greater than 10% and otherwise it is defined as normal. For the gastric emptying results at 2 hours, the possible thresholds are 30%, 40%, ..., 80%. The gastric emptying results at 2 hours are defined as abnormal if it is greater than a threshold and otherwise it is defined as normal. We use our proposed method to identify the threshold of 50% to classify the gastric emptying results at 2 hours. The results about sensitivity and specificity shows the threshold of 50% is feasible. Some simulation results are summarized in Table 3 where THRES is the threshold value. The area under the ROC curve (AUC) is the average sensitivity of the biomarker over the range of specificities, which is often used as an evaluation of the overall performance of the biomarker. Our data show the estimated AUC is 0.816, which provides evidence that our correlation statistic is useful for correctly classifying subjects with predicted positive and subjects with predicted negative. From the table and resulting AUC value above, we conclude the threshold of 50% is best choice. As we have known, the sensitivity and specificity are true positive rate and are true negative rate, respectively. The positive likelihood ratio considers only predicted positive and ignore predicted negative while the negative likelihood ratio considers only predicted negative and ignore predicted positive. However, our correlation statistic provides comprehensive and powerful measures and has a potential to correctly and reliably assess a diagnostic test. It can be used as alternative to average absolute likelihood ratio and Youden's index when they are not applicable. Also, our correlation statistic could provide the direction of association between two categorical variables.

Table 3: Comparison between the correlation statistic and likelihood ratio measure in the gastric emptying study. View Table 3Due to the importance of assessment of test in medicine and clinical trials, many researchers have paid attention to statistical measures. The likelihood ratio positive and the likelihood ratio negative are the most commonly used measures of diagnostic test. However, the likelihood ratio positive and the likelihood ratio negative often present the reverse tendencies and to determine the compromise between the likelihood ratio positive and the likelihood ratio negative relies on the investigators 'sound knowledge and experience in medicine and clinical trials. In this article, we proposed a statistical measure. Compared with the likelihood ratio positive and the likelihood ratio negative, our measure could have a unique solution and produce an acceptable balance between sensitivity and specificity.

We should also point out that the bivariate Bernoulli distribution for correlated binary variables focuses on construction of statistical measure for test performance, but it is important to explore the applicability of our proposed bivariate Bernoulli distribution in bivariate interobsrver agreement since the correlation coefficient can be used to indicate the direction and strength of the difference between the chance of two uncorrelated individuals with both true positive and negative rates and the chance of the two individuals with false positive and negative rates. It can be used as alternative to average absolute likelihood ratio (AALR) and Youden's index when they are not applicable. In addition, we can explore the applicability of our proposed bivariate Bernoulli distribution in the other literature such as logistic regression for two binary outcomes, repeated measurements, etc., and extend the bivariate distribution to higher-dimension.

Dr. Rai is grateful to generous support from Dr. DM Miller, Director James Graham

Brown Cancer Center and Wendell Cherry Chair in Clinical Trial Research.

The authors have declared no conflict of interest.