For two independent random variables Yj (j = 1, 2), let p(x) = P(Y1 < Y2|X = x), where X is some covariate of interest. For m values of the covariate, x1 , . . . , xm, the paper deals with the goal of testing H0 : p(xj) = 0.5 (j = 1, . . . , m) in a manner that controls Family Wise Error Rate (FWE), the probability of one or more Type I errors. If m is relatively small, extant multiple comparison methods can be used to control FWE. But if m is relatively large, the actual level can be substantially smaller than the nominal level, raising concerns about relatively poor power. The paper describes a method for addressing this issue when p(x) is estimated via a running interval smoother.

Effect size, Smoothers, Multiple comparisons

For two independent groups, let Yj (j = 1, 2) be some random variable associated with the jth group. As is evident, one approach to comparing these groups is in terms of some measure of location. As is well known, there is vast literature regarding how this might be done. Another approach is to focus on

p = P (Y1 < Y2) (1)

the probability that a randomly sampled observation from the first group is less than a randomly sampled observation from the second group. In the event tied values can occur, (1) is replaced with

p = P(Y1<Y2) + 0.5P(Y1=Y2) (2)

Cliff [1], Acion, et al. [2], Kraemer and Kupfer [3], and Vargha and Delaney [4], among others, summarize arguments for focusing on p when comparing groups. Indeed, it seems fairly evident that p provides a useful perspective.

Certainly the best-known method for making inferences about p is the Wilcoxon-Mann-Whitney (WMW) test. It is well known, however, that under general conditions, the WMW method uses an incorrect estimate of the standard error. Numerous methods have been derived for dealing with this issue. Wilcox [5] summarizes the relevant literature.

This paper is focused on comparing two independent groups based on p when there is a covariate, say X. Let p(x) denote the value of p given that X=x. Here, no particular parametric model is assumed regarding the nature of p(x). Rather, p(x) is estimated with a particular nonparametric regression estimator, which is described in section 3. Let pk(xk)=P(Y1<Y2|X=xk) (k=1, . . . , m), where x1, . . . , xm are m covariate values to be determined. The basic goal is to test

H0:pk(xk)=0.5 (3)

for each k=1, . . . , m such that the family wise error rate (FWE), meaning the probability of one or more Type I errors, is equal to some specified value, α. If the number of covariate values, m, is reasonably small, one can simply proceed along the lines in Wilcox [5]. A concern, however, is that important differences might be missed due to using too few covariate values. A guess is that a method in Wilcox [5] is readily adapted to the situation at hand when dealing with a large number of covariate values, but preliminary simulations made it clear that this approach is unsatisfactory. The actual FWE can be substantially smaller than the nominal level.

The goal is this paper is suggesting a modification of the method in Wilcox [5] that performs better in simulations. The proposed method is based in part on Cliff's [1] method for testing (1). Cliff's method was chosen based on results in Neuhauser, et al. [6] where several techniques were compared [7]. Particularly important here, for reasons made clear in section 2, is that Cliff's method has been found to perform relatively well even with sample sizes as small as eight.

The paper is organized as follows. Section 2 reviews Cliff's method. Section 3 describes the proposed method and section 4 reports simulation results. Section 5 illustrates the method using data from a study dealing with the emotional and physical well being of older adults.

Momentarily ignoring the covariate, Cliff's method for making inferences about p is applied as follows. Let Yij(i = 1, . . . , nj; j = 1, 2) be a random sample of nj observations from the jth group. Let

dih = ⎧⎩⎨⎪⎪−1 if Yi1 < Yh20 if Yi1 = Yh21 if Yi1 > Yh2⎫⎭⎬⎪⎪

An estimate of δ = P(Yi1>Yi2) − P(Yi1<Yi2) is

δˆ=1n1n2∑i=1n1∑h=1n2dih. Let d¯i.=1n2∑h=1n2dih, d¯.h=1n1∑i=1n1dih, s21=1n1−1∑i=1n1(dih−δˆ)2, s22=1n2−1∑h=1n2(d.h−δˆ)2, σ˜21n1n2−1∑∑(dih−δˆ)2

Then

σ˜2=(n1−1)s21+(n2−1)s22+σ˜2n1n2

estimates the squared standard error of δˆ. Let z be the 1−α/2 quantile of a standard normal distribution. Rather than use the more obvious confidence interval for δ, Cliff [1] recommends

δˆ−δˆ3± zσˆ(1 −δˆ2)2+z2δˆ2−−−−−−−−−−−−−−√1−δˆ2 + z2σˆ2

Cliffs confidence interval for δ is readily modified to give a confidence for p. Letting

Clδˆ−δˆ3− zσˆ(1 −δˆ2)2+z2σˆ2−−−−−−−−−−−−−−√1−δˆ2 + z2σˆ2 and

Cuδˆ−δˆ3+ zσˆ(1 −δˆ2)2+z2σˆ2−−−−−−−−−−−−−−√1−δˆ2 + z2σˆ2 a1 − α confidence interval for p is (1−Cu2,1−Cl2).

Now let (Yij , Xij) (i=1, . . . , nj ; j=1, 2) be a random sample with the covariate, X, included. The proposed method is based in part on a basic component of the running interval smoother, which has been studied extensively [5]. Momentarily focus on a single value of the covariate x, in which case the goal is to test H0: p(x)=0.5. The basic strategy is quite simple: Compute a confidence interval for p based on the Yij values such Xij is close to x.

The precise details are as follows. For the jth group, let Mj be the usual sample median and compute the median absolute deviation estimator MADj, which is the median based on

|X1j−Mj|, . . . ∣∣Xnjj−Mj ∣∣. Let MADNj=MADj/0.6745. Under normality, MADNj estimates the standard deviation. Let

Nj(x)={i : |Xij− x| ≤ f× MADNj},

where the constant f, called the span, is to be determined. That is, for fixed j, Nj(x), indexes the observed covariate values that are close to x. The Xij values for which i∈Nj(x) are called the nearest neighbors. Let pˆ(x) be the estimate of p(x) based on the Yij values for which i∈Nj(x). Choices for the span that generally perform well are f=0.8 or 1 (e.g., Wilcox) [5]. Here f = 1 is used. As is evident, H0:p(x)=0.5 can be tested simply by applying Cliff's methods using the Yij values for which i∈Nj(x).

There remains the problem of choosing the covariate values. In some situations there might be substantive reasons for using particular values. But otherwise it is clearly prudent to choose a reasonably wide range of covariate values. Here, two approaches are described and there relative merits are discussed in section 4.

To describe the first approach, let Nj(x) denote the cardinality of the set Nj(x). The basic strategy is to focus on covariate values where Nj(x)≥η, where η is some constant to be determined. That is, focus on situations where the sample sizes are sufficiently large when applying Cliff's method. Given η, let zlj be the smallest Xij value such that simultaneously Nj(Xi1)≥η and Nj(Xi2)≥η. In a similar manner, let zuj be the largest Xijvalue such that simultaneously Nj(Xi1)≥η and Nj(Xi2)≥η. Let zl=max{zl1 , zl2} and zu=min{zu1, zu2}. The covariate values are taken to be the m values evenly spaced between zl and zu, inclusive, which are labeled x1 , . . . , xm c. Here, m=25 is used. Extant results suggest that Cliff's method performs reasonably well when η≥8, so η=8 is used henceforth unless stated otherwise. This approach to choosing the covariate values will be called method S henceforth.

The second approach to choosing the covariate values is as follows. Let γˆqj be an estimate of the qth quantile of the covariate associated with the jth group, q<0.5. Let wl= max{γˆq1 , γˆq2} and wu=min{γˆ1−q1, γˆ1−q2}. The covariate values are taken to be m values evenly spaced between wl and wu. Here, both q = 0.1 and 0.05 are considered. This alternative approach to choosing the covariate values will be called method Q henceforth. The relative merits of methods S and Q are summarized in section 6.

There remains the issue of controlling FWE. There is a wide range of techniques that might be used (e.g., Wilcox [5]). One possibility is to use the sequentially rejective technique derived by Hochberg [8], which has been found to perform well in simulations when m = 5. But as m increases, this approach results in FWE levels well below the nominal level. A similar concern arises using a critical value based on Studentized maximum modulus distribution (with infinite degrees of freedom) as well the method derived by Hommel [9]. This is not surprising for the following reason. Note that in various situations, the set Nj(xk)∩Nj(xl), k≠l, will not be empty, in which case the test statistics corresponding to these two covariate values will be correlated. In terms of controlling FWE, what is needed is a method that takes this into account. Results in Wilcox [5] suggest how to proceed. The basic idea is to determine a critical p-value, pα, when both Y and X have a normal distribution and there is no association between Y and X. Then simulations are used to check on the impact of non-normality as well as situations where there is an association between Y and X.

To be a bit more precise, let pk be the p-value when testing H0: p(xk)=0.5 (k=1, . . . , m). Let pmin denote min{p1 , . . . , pm} and let pα denote the α quantile of pmin. So for any k, H0: p(xk)=0.5 is rejected when pk≤pα, in which case FWE will be equal to α. So the strategy is to use simulations to determine pα when both X and Y are independent and both have a standard normal distribution. Then simulations are used to determine the impact on FWE when X and Y are dependent and when Y has a non-normal distribution.

Table 1 shows estimates of pα when using method S, m = 25 and α = 0.05 and there is a common sample size n ranging between 30 and 800. (The same pα values are used by method Q). These estimates are based on 2000 replications. When n is small, execution time is reasonably low, but as n increases, execution time becomes an issue. Note that generally the estimates of pα decrease as n increases Overall, the rate of the decrease is very small, particularly for n≥200. Here, for sample sizes not included in Table 1, Clevelands [10] smoother is used to estimate pα based on 1/n and the values in Table 1. (The R function l plot. pred in Wilcox, 2017, is used here) [5] For unequal sample sizes, pα is determined for both n1 and n2 and the results are averaged.

This section reports simulation results on the ability of the methods in the previous section to control FWE under non-normality and when there is an association between Y and X.

Data were generated based on

Y = Xa+∈, (6)

where is some random variable having a median of zero and a = 1 or 2. The distribution for the error term, ∈, was taken to be a one of four g-and-h distributions [11] one of which is the standard normal distribution. If Z has a standard normal distribution, then by definition

V= {(exp(gZ )−1)gexp(hZ2/2), if g>0Zexp(hZ2/2), if g=0}

has a g-and-h distribution where g and h are parameters that determine the first four moments. The four distributions used here were the standard normal (g=h=0.0), a symmetric heavy-tailed distribution (h=0.2, g=0.0), an asymmetric distribution with relatively light tails (h=0.0, g=0.2), and an asymmetric distribution with heavy tails (g=h=0.2). Table 2 shows the skewness (κ1) and kurtosis (κ2) for each distribution. Figure 1 shows plots of these distributions. Additional properties of the g-and-h distribution are summarized by Hoaglin [11]. The results reported here are for situations where the distribution of X was taken to be standard normal. A few simulations were run where X and ∈ have the same g-and-h distribution, no new insights were found, so the results are not reported.

It is noted that if there is no covariate, transforming to some g-and-h distribution does not alter the results based on Cliff's method because it depends only on the ranks of the data. But when there is a covariate and there is an association between Y and X, this is no longer the case. So an issue understands the impact on FWE when there is an association.

Table 3 shows the results for a = 1 and 2 and sample sizes (n1, n2) = (30, 30), (30, 60) and (150, 200), where the covariate values were chosen based on method S. Again 2000 replications were used. Although the importance of a Type I error depends on the situation, Bradley [12] suggested that as a general guide, when testing at the 0.05 level, the actual level should be between 0.025 and 0.075. Based on this criterion, the proposed method is satisfactory for all of the situations considered when a = 1. However, when a = 2, this is no longer the case. For (n1, n2) = (30, 60) and g=h=0, the estimate is 0.079. For (n1, n2) = (150, 200), estimates are approximately equal to 0.1. For sample sizes (n1, n2) = (40, 100), and (n1, n2) = (50, 100), not shown in Table 3, again estimates exceeded 0.075. Lowering the span to 0.8 and even 0.6 does not correct this problem. Using η = 14 was found to be unsatisfactory as well. A closer examination of the simulation results revealed that when a = 2, inferences based on the more extreme covariate values result in FWE values greater than the nominal level. This motivated the second approach to choosing the covariate values, method Q in the previous section.

Some additional simulations were run where Y=Ix<0 X+∈, where the indicator function IX>0=1 if X>0; otherwise IX>0=0. The results were very similar to those where a = 1. So there are indications that method S can perform reasonably well when the regression line is not straight. But if there is sufficient curvature for the more extreme covariate values, this is no longer the case.

Table 4 shows the simulation results when using method Q. As can be seen, now the largest estimate is 0.066. Note that given the sample sizes, altering the distribution of the error term has an even smaller impact on the estimate of FWE compared to method S. The main difficulty is that the estimate drops below 0.025 in some situations, the lowest estimate being 0.020. For a = 2 and (n1, n2) = (600, 600), the estimates are nearly identical to those when (n1, n2) = (150, 200). For (n1, n2) = (50, 600), g=h=0 and a = 2 the estimate is 0.050. Simulations were also run using q = 0.05. The estimates differed from those in Table 4 by at most three units in the third decimal place.

Data from the Well Elderly 2 study [13] are used to illustrate the proposed method. A general goal in the Well Elderly 2 study was to assess the efficacy of an intervention strategy aimed at improving the physical and emotional health of older adults. (The data are available at http://www.icpsr.umich.edu/icpsrweb/landing.jsp). A portion of the study was aimed at understanding the impact of intervention on a Measure of Meaningful Activities (MAPA). A covariate of interest was the Cortisol Awakening Response (CAR), which is defined as the change in cortisol concentration that occurs during the first 30-45 minutes after waking from sleep. (CAR is taken to be the cortisol level upon awakening minus the level of cortisol after the participants were awake). Extant studies [14,15] indicate that the CAR is associated with various measures of stress. Here, a control group is compared to a group that received intervention.

Figure 2 shows the estimate of P(Y1<Y2|X), where Y1 is the MAPA score before intervention, Y2 is MAPA after intervention, and X is CAR. (Leverage points, outliers among the CAR values, were removed). Selecting covariate points via method S, significant differences are found for CAR ranging between -0.36 and -0.27. Plots of the regression lines, using Cleveland's [10] method, suggested that there is little or no curvature, which in turn suggests that method S controls FWE reasonably well. So there is an indication that when the CAR is sufficiently negative (cortisol increases after awakening), MAPA scores tend to be higher for the group receiving intervention. The covariate values used by method S range between -0.38 and 0.32. In contrast, when using method Q with q = 0.1, they range between -0.19 and 0.15 and no significant results are found. For q = 0.05 the covariate values range between -0.27 and 0.22 with a single significant result when the CAR is equal to -0.24.

In summary, all indications are that method S for choosing the covariate values performs reasonably well except when there is a sufficient amount of curvature near the extreme ends of the covariate. Method Q avoids FWE well above the nominal level in situations where method S breaks down. But method S has the potential of providing comparisons for a wider range of covariate values. As was illustrated, this can make practical difference.

A criticism of method S is that there is no formal method for justifying the assumption that curvature among the more extreme covariate values is not an issue. For now, the best that can be done is to inspect the plot returned by some nonparametric regression estimator. So an argument for using method Q might be that it is safer in terms of controlling FWE.

Finally, R functions for applying methods S and Q are available at Dornsife.usc.edu/cf/labs/wilcox/wilcox-faculty-display.cfm and are stored in the file Rallfun- v34. The function ancdetwmw applies method S and ancdetwmwQ applies method Q.

^{th}edn), Academic Press, San Diego, CA. https://www.elsevier.com/books/introduction-to-robust-estimation-and-hypothesis-testing/wilcox/978-0-12-804733-0