We propose and study a method for partial covariates selection, which only select the covariates with values fall in their effective ranges. The coefficients estimates based on the resulting data is more interpretable based on the effective covariates. This is in contrast to the existing method of variable selection, in which some variables are selected/deleted in whole. To test the validity of the partial variable selection, we extended the Wilks theorem to handle this case. Simulation studies are conducted to evaluate the performance of the proposed method, and it is applied to a real data analysis as illustration.

Covariate, Effective range, Partial variable selection, Linear model, Likelihood ratio test

Variables selection is a common practice in biostatistics and there is vast literature on this topic. Commonly used methods include the likelihood ratio test [1], Akaike information criterion, AIC [2] Bayesian information criterion, BIC [3], the minimum description length [4,5] stepwise regression and Lasso [6], etc. The principal components model linear combinations of the original covariates, reduces large number of covariates to a handful of major principal components, but the result is not easy to interpret in terms of the original covariates. The stepwise regression starts from the full model and deletes the covariate one by one according to some statistical significance measure. May, et al. [7] addressed variable selection in artificial neural network models, Mehmood, et al. [8] gave a review for variable selection with partial least squares model. Wang, et al. [9] addressed variable selection in generalized additive partial linear models. Liu, et al. [10] addressed variable selection in semiparametric additive partial linear models. The Lasso [6,11] and its variation [12,13] are used to select some few significant variables in the presence of a large number of covariates.

However, existing methods only select the whole variable(s) to enter the model, which may not the most desirable in some bio-medical practice. For example, in two heart disease studies [14,15] there are more than ten risk factors identified by medical researchers in their long time investigations, with the existing variable selection methods, some of the risk factors will be deleted wholly from the investigation, this is not desirable, since risk factors will be really risky only when they fall into some risk ranges. Thus deleting the whole variable(s) in this case seems not reasonable, while a more reasonable way is to find the risk ranges of these variables, and delete the variable values in the un-risky ranges. In some other studies, some of the covariates values may just random errors which do not contribute to the influence of the responses, and remove these covariates values will make the model interpretation more accurate. In this sense we select the variables when their value falls within some range. To our knowledge, method for this kind of partial variable selection hasn't been seen in the literature, which is the goal of our study here. Note that in existing method of variable selection, some variables are selected/deleted, while in our method, some variable(s) are partially selected/deleted, i.e., only some proportions of some variable observations are selected/deleted. The latter is very different from the existing methods. In summary, traditional variable selection methods, such as stepwise or Lasso, some covariate(s) will be removed either wholly or none from the analysis. This is not very reasonable, since some of the removed covariates may be partially effective, removing all their values may yield miss-leading results, or at least cost information loss; while for the variables remaining in the model, not all their values are necessarily effective for the analysis. With the proposed method, only the non-effective values of the covariates are removed, and the effective values of the covariates are kept in the analysis. This is more reasonable than the existing methods of removing all or nothing.

In the existing method of deleting whole variable(s), the validity of such selection can be justified using the Wilks result, under the null hypothesis of no effect of the deleted variable(s), the resulting two times log-likelihood ratio will be asymptotically chi-squared distributed. We extended the Wilks theorem to the case for the proposed partial variable deletion, and use it to justify the partial deletion procedure. Simulation studies are conducted to evaluate the performance of the proposed method, and it is applied to analyze a real data set as illustration.

The observed data is (yi,xi)(i=1,...,n), where yi is the response and xi∈Rd is the covariates, of the i-th subject. Denote yn=(y1,…,yn)' and Xn=(x′1,…,xn')'. Consider the linear model

yn=Xnβ+εn, (1)

where β=(β1,…,βd)' is the vector of regression parameter, εn=(ε1,…,εn)' is the vector of random errors, or residual departure from the linear model assumption. Without loss of generality we consider the case the εi's are independently identically distributed (iid), i.e. with variance matrix Var(ε)=σ2In, where In is the n-dimensional identity matrix. When the εi's are not iid, often it is assumed Var(ε)=Ω for some known positive-definite Ω, then make the transformation y˜n=Ω−1/2yn, X˜n=Ω−1/2Xn and ε˜=Ω−1/2ε, then we get the model y˜n=X˜nβ+ε˜, and the ε˜i's are iid with Var(ε˜)=In. When Ω is unknown, it can be estimated by various ways. So below we only need to discuss the case the εi's are iid.

Summary of existing work

We first give a brief review of the existing method of variable selection. Assume the model residual ϵ=y−x′β has some known density function f(⋅) (such as normal), with possibly some unknown parameter(s). For simple of discussion we assume there are no unknown parameters. Then the log-likelihood is

ln(β)=∑ni=1logf(yi−x′iβ).

Let β∧ be the Maximum Likelihood Estimate (MLE) of β (when f(⋅) is the standard normal density, β∧ is just the least squares estimate). If we delete k(≤d) columns of Xn and the corresponding components of β, denote the remaining covariate matrix as X−n and the resulting β as β-, and the corresponding MLE as β−∧. Then under the hypothesis H0: the deleted columns of Xn has no effects, or equivalently the deleted components of β are all zeros, then asymptotically [1].

2[ln(βˆ)−ln(βˆ−)]→Dχ2k

where χ2k is the chi-squared distribution with k -degrees of freedom. For a given nominal level α, let χ2d(1−α) be the (1−α)-th upper quantile of the χ2k distribution, if 2[ln(βˆ)−ln(βˆ−)]≥χ2d(1−α), then H0 is rejected at significance level α, and its not good to delete these columns of Xn; otherwise we accept H0 and delete these columns of Xn.

There are some other methods to select columns of Xn, such as AIC, BIC and their variants, as in the model selection field. In these methods, the optimal deletion of columns of Xn corresponds to the best model selection, which maximize the AIC or BIC. These methods are not as solid as the above one, as may sometimes depending on eye inspection to choose the model which maximize the AIC or BIC.

All the above methods require the models under consideration be nested within each other, i.e., one is a sub-model of the other. Another more general model selection criterion is the Minimum Description Length (MDL) criterion, a measure of complexity, developed by Kolmogorov [4], Wallace and Boulton [16], etc. The Kolmogorov complexity has close relationship with the entropy, it is the output of a Markov information source, normalized by the length of the output. It converges almost surely (as the length of the output goes to infinity) to the entropy of the source. Let G={g(⋅,⋅)} be a finite set of candidate models under consideration, and Θ={θj:j=1,…,h} be the set of parameters of interest. θi may or may not be nested within some other θj, or θi and θj both in Θ may have the same dimension but with different parametrization. Next consider a fixed density f(.|θj), with parameter θj running through a subset Γj⊂Rk, to emphasize the index of the parameter, we denote the MLE of θj under model f(⋅|⋅) by θˆj (instead of by θˆn to emphasize the dependence on the sample size), I(θj) the Fisher information for θj under f(⋅|⋅), |I(θj)| its determinant, and kj the dimension of θj. Then the MDL criterion (for example, Rissanen [17] and the review paper by Hansen and Yu [5], and references there) chooses θj to minimize

−∑i=1nlogf(Yi|θˆj) + kj2logn2π +log∫Γj|I(θj)|−−−−−√dθj, (j=1,…,h). (3)

This method does not require the models be nested, but still require select/delete some whole columns. The other existing methods for variable selection, such as stepwise regression and Lasso, etc., are all for deleting/keeping some whole variables, and does not apply to our problem.

Now come to our question, which is non-standard and we are not aware of a formal method to address this problem. However, we think the following question is of practical meaning. Consider deleting some of the components within fixed k (k≤d) columns of Xn, the deleted proportions for these columns are γ1,...,γk(0<γj<1). Denote X−n for the remaining covariate matrix, which is Xn with some entries replaced by 0's, corresponding to the deleted elements. Before the partial deletion, the model is

yn=Xnβ+εn

After the partial deletion of covariates, the model becomes

yn=X−nβ−+εn

Note that here β and β- have the same dimension, as no covariate is completely deleted. β is the effects of the original covariates, β- is the effects of the covariates after some possible partial deletion. It is the effects of the effective covariates. As an over simplified example, we have n=5 individuals, with five responses yn=(y1, y2,y3,y4,y5) and covariate vectors x1=(1.3, 0.2, −1.5)', x2=(−0.1, 0.9, −1.3)', x3=(1.1, 1.4, −0.3)′, x4=(0.8, 1.2, −1.7)', x5=(1.0, 2.1, −1.1)' and Xn=(x1, x2, x3, x4,x5). Then β is the effects of the regression of yn on Xn. If we remove some seemingly insignificant covariate components, for example, let x−1=(1.3, 0, −1.5)', x−2=(1.1, 1.4, 0)', x−3=(1.1, 1.4, 0)', x−4=(0.8, 1.2, −1.7)', x−5=(1.0, 2.1, −1.1)′ and X−n=(x−1, x−2, x−3, x−4,x−5). In this case β- is the effects of yn regressing on Xn. Thus, though β and β- have the same structure, they have different interpretations. The problem can be formulated as testing the hypothesis:

H0:β=β−vs H1: β≠β−

If H0 is accepted, the partial deletion is valid.

Note that different from the standard null hypothesis that some components of the parameters be zeros, the above null hypothesis is not a nested hypothesis, or β- is not a subset of β, so the existing Wilks' theorem for likelihood ratio statistic does not directly apply here.

Denote l−n(β) be the corresponding log-likelihood based on data (yn, X−n), and the corresponding MLE as βˆ−. Since after the partial deletion, βˆ− is the MLE of β under a constrained log-likelihood, while βˆ is the MLE under the full likelihood, we have l−n(βˆ−)≤ln(βˆ). Parallel to the log-likelihood ratio statistic for (whole) variable deletion, let, for our case,

Λn=2[ln(βˆ)−l−n(βˆ−)]

Let (j1,...,jk) be the columns with partial deletions, Cjr={i:xjr,i is deleted 1≤i≤n} be the index set for the deleted covariates in the jr-th column (r=1,...,k); ∣∣Cjr∣∣ be the cardinality of Cjr, thus γr=∣∣Cjr∣∣/n(r=1,...,k). For different jr and js, Cjr and Cjs may or may not have some common components. We first give the following Proposition, in the simple case in which the index sets Cjr's are mutually exclusive. Then in Corollary 1 we give the result in more general case in which the index sets Cjr's are not need to be mutually exclusive.

For given Xn, there are many different ways of partial column deletions, we may use Theorem 1 to test each of these deletions. Given a significance level α, a deletion is valid at level α if Λn < χ2(1−α), where χ2(1−α) is the (1−α)- th upper quantile of the ∑kj=1γjχ2j distribution, which can be computed by simulation for given (γ1,...,γk).

The following Theorem is a generalization of the Wilks Theorem [1]. Deleting some whole columns in Xn corresponds to γj=1 (j=1,...,k) in the theorem, and then we get the existing Wilks' Theorem.

Theorem 1: Under H0, suppose Cjr∩Cjs=ϕ, the empty set, for all 1≤r≠s≤k, then we have

Λn→D∑kj=1γjχ2j .

where χ21,...,χ2k are iid chi-squared random variable with 1-degree of freedom.

Note that in Wilks problem the null hypothesis is that, the coefficients corresponding to some variables are zero. The null hypothesis is nested within the alternative; while the null hypothesis in our problem is: The coefficients correspond to some partial variables, and the null hypothesis is not nested within the alternative. So the results of the two methods are not really comparable.

The case the Cjr's are not mutually exclusive is a bit more complicated. We first re-write the sets Cjr's such that

∪kr=1Cjr=∪kr=1∪j1,...,jrDj1,...,jr

where the Dj1,...,jr's are mutually exclusive, Dj1,...,Djk are index sets for one column of Xn only; the Dj1,j2's are index sets common for columns j1 and j2 only; the Dj1,j2,j3's are index sets common for columns j1,j2 and j3 only,.... Generally some of the Dj1,...,jr's are empty sets. Let γj1,...,jr=∣∣Dj1,...,jr∣∣ be the cardinality of Dj1,...,jr and γj1,...,jr=∣∣Dj1,...,jr∣∣/n (r=1,...,k).

By examining the proof of Theorem 1, we get the following corollary which gives the result in the more general case.

Corollary 1: Under H0, we have

Λn=2[ln(βˆ)−l−n(βˆ−)]→D∑kr=1∑j1,…,jrγj1,…,jrχ2j1,…,jr

where the χ2j1,...,jr's are all independent chi-squared random variables with r-degrees of freedom (r=1,...,k).

Below we give two examples to illustrate the usage of Proposition.

Example 1: n=1000, d=5, k=3. Columns (1,2,4) has some partial deletions with C1={201,202,....,299,300}, C2={351,352,...,549,550}, C3={601,602,...,849,850}, the Cj's have no overlap; γ1=1/10, γ2=1/5, γ3=1/4. So by the Proposition, under H0 we have

2[ln(βˆ)−l−n(βˆ−)]→D110χ21+15χ22+14χ23

where all the chi-squared random variables are independent, each has 1 degree of freedom.

Example 2: n=1000, d=5, k=3. Columns (1,2,4) has some partial deletions with C1={101,102,....,299,300;651,652,...,749,750}, C2={201,202,...,349,350}, C3={251,252,...,299,300;701,702,...,799,800}. In this case the Cj's have overlaps, the Proposition can not be used directly, so we use the Corollary. Then D1={101,102,...,199,200}, D2={301,302,...,349,350}, D3={701,702,...,799,800}, D1,2={201,202,...,249,250}, D1,3={701,702,...,749,750}, D2,3=ϕ, D1,2,3={251,252,...,299,300}; γ1=1/5, γ2=1/20, γ3=1/10, γ1,2=1/20, γ1,3=1/20, γ2,3=0, γ1,2,3=1/20. So by the Corollary, under H0 we have

2[ln(βˆ)−l−n(βˆ−)]→D15χ21+120χ22+110χ23+120χ21, 2+120χ21,3+120χ21,2,3

where all the chi-squared random variables are independent, with χ21, χ22 and χ23 are each of 1 degree of freedom, χ21, 2 and χ21,3 are each of 2-degrees of freedom, and χ21,2,3 is of 3-degrees of freedom.

Next, we discuss the consistency of estimation of βˆ− under the null hypothesis H0. Let x−=x−r with probability γr(r=0,1,...,k), where x−r is an i.i.d. copy of the x−i,r's, whose components with index in Cjr, in particular Cj0 is the index set for those covariates without partial deletion.

Theorem 2: Under conditions of Theorem 1,

i) βˆ−→β0(a.s.).

ii) n−−√(βˆ−−β0)→DN(0,Ω)

where

Ω=Eβ0[l˙(β0) l˙'(β0)]=E[(x−−μ−)(x−−μ−)']∫f˙2(∈)f(∈)dϵ.

To extend the results of Theorem 2 to the general case, we need the following more notations. Let be an i.i.d. copy of data in the set Dj1,...,jk. Let x−=x−j1,…,jr with probability γj1,...,jr(r=0,1,...,k), where x−j1,…,jr is an i.i.d. copy of the x−i,j1,...,jr's, whose components with index in Cj1,...,jr.

Corollary 2: Under conditions of Corollary 1, results of Theorem 2 hold with x− given above.

Computationally E[(x−−μ−)(x−−μ−)'] is well approximated by

E[(x−−μ−)(x−−μ−)']≈∑r=0k∣∣Dj1,…,jr∣∣n1∣∣Dj1,…,jr∣∣∑(i,j)∈Dj1,…,jr(x−i,j−μˆ−j1,…,jr)(x−i,j−μˆ−j1,…,jr)′,

where the notation Σ(i,j)∈Dj1,...,jr means summation over those x−i,j's with deletion index in Dj1,...,jr, and (μˆ−j1,…,jr)=1∣∣Dj1,…,jr∣∣Σ(i,j)∈Dj1,...,jrx−i,j.

We illustrate the proposed method with two examples, Examples 3 and 4 below. The former rejects the null hypothesis H0 while the latter accepts. In each case we simulate n=1000 i.i.d. data with response yi and with covariates xi=(xi1,xi2,xi3,xi4,xi5) (i=1,...,n). We first generate the covariates, sample the xi's from the 5-dimensional normal distribution with mean vector μ=(3.1,1.8,−0.5,0.7,1.5)' and a given covariance matrix Γ.

Then we generate the response data, which, given the covariates. The yi's are generated as

yi=x′iβ0+ϵi, (i=1,…,n)

β0=(0.42,0.11,0.65,0.83,0.72)', the ∈i's are i.i.d. N(0,1).

Hypothesis test is conducted to examine if the partial deletion is valid or not. Significant level is set as α=0.05. The experiment repeated 1000 times, Prop represents the proportion Λn>Q(1−α), where Q(1−α) is the (1−α)-th upper quantile of the distribution ∑kj=1γjχ2j given in Theorem 1, computed via simulation.

Example 3: In this example, five data sets are generated according to the mentioned method, with five different values of β0. We are interested to know whether covariates with |xij|< 110 can be deleted. Five data set with different β0 values are simulated. The proportion γ=(γ1,…,γk) of xijs with |xij|< 110 are shown for each data set, the results are shown in Table 1. The five rows in Table 1 are the results for the five data sets. For each data, the parameter β is estimated, a and test is conducted using the given γ, the Λn is computed, Q(1−α) is given, and the corresponding p-value is provided. Note that for our problem, a p-value smaller than α means a significant value of Λn, or significant difference between the regression coefficients of original covariates and those of the covariates after partial deletion, which implies in turn that the null hypothesis should be rejected, or the partial deletion should not be conducted (Table 1).

We see that the p-values of rejecting H0, are all smaller than 0.05 in the five set of β0. This suggests that covariates with |xij|< 110 should not be deleted at significance level α=0.05.

Example 4: In this example, the original X as in Example 3, but now we replace the entries in first 100 rows and first three columns by noise ∈, where ϵN(0,19). The delete proportion γ=(0.1,0.1,0.1) is fixed with xij's having absolute values smaller than the lower 0.1 percent being deleted. We are interested to see in this case whether these noises can be deleted, i.e. H0 can be rejected or not. The results are shown in the following (Table 2).

We see that the p-values of rejecting H0 are all greater than 0.95 for the five sets of β0. It suggests that the data provided strong evidence to conclude that the deleted values are noises and they are not necessary to the data set at 0.05 significance level.

We analyze a data set from the Deprenyl and Tocopherol Antioxidative Therapy of Parkinsonism, which is obtained from The National Institutes of Health (NIH). (For detailed description and data link, https://www.ncbi.nlm.nih.gov/pubmed/2515723). It is a multi-center, placebo-controlled clinical trial that aimed to determine a treatment for early Parkinson's disease patient to prolong their time requiring levodopa therapy. The number of patients enrolled was 800. The selected object were untreated patients with Parkinson's disease (stage I or II) for less than five years and met other eligible criteria. They were randomly assigned according to a two-by-two factorial design to one of four treatment groups: 1) Placebo 2) Active tocopherol 3) Active deprenyl 4) Active deprenyl and tocopherol. The observation continued for 14±6 months and reevaluated every 3 months. At each visit, Unified Parkinson's Disease Rating Scale (UPDRS) including its motor, mental and activities of daily living components were evaluated. Statistical analysis result was based on 800 subjects. The result revealed that no beneficial effect of tocopherol. Deprenyl effect was found significantly prolong the time requiring levodopa therapy which reduced the risk of disability by 50 percent according to the measurement of UPDRS.

Our goal is to examine whether some of the covariates can be partially deleted. If traditional variable selection methods are used, such as stepwise or Lasso, it will end up with some covariate(s) been removed wholly from the analysis. This is not very reasonable, since some of the removed covariates may be partially effective, removing all their values may yield miss-leading results, or at least cost information loss. We use the proposed method to examine three of the response variables, PDRS, TREMOR and PIGD, and three covariates, Age, Motor and ADL for all these responses. The deleted covariates are the ones with values below the γ-th data quantile, with γ=0.01,0.02,0.03 and 0.05. We examine each response and covariate one by one. The results are shown in Table 3, Table 4 and Table 5 below.

In Table 3, response TREMOR is examined. For covariable Age, the likelihood ratio Λn is larger than the cut-off point Q(1−α) at all the deletion proportions, it suggests that for Age, no partial deletions with these proportions should be removed. For covariable Motor, Λn is smaller than the cutoff point Q(1−α) at the 0.01 proportion, this covariable can be partially deleted at this proportion. In other words, the covariate Motor with values smaller than 1%-th of its quantile have no impact on the analysis, or can be treated as noise and should be removed from the analysis. For covariable ADL, with deletion proportions 0.01-0.1, the likelihood ratio Λn is smaller than Q(1−α) which suggest that the lower percentage of 1%-10% of this covariate have no impact on the analysis and should be deleted. After removing the corresponding proportions of Motor and ADL, the model is re-fitted to get the parameter estimates shown there. These estimates have better meaning than the ones based on the whole covariates data, since now the noise values of covariates are removed, and only the effective covariates entered the analysis. However, if traditional variable methods are used, such as stepwise regression or Lasso, it may end up with the whole covariate Motor, ADL, or both to be removed, and leads loss of information or even miss-leading results.

In Table 4, response PIGD is investigated. For covariable age, Λn is larger than the cut-off point Q(1−α) at the 0.02, 0.03 and 0.05 proportions, suggests that partial deletion with these proportions are not appropriate. For covariate Motor, Λn is smaller than cut-off point Q(1−α) at the deletion proportions 0.02 and 0.03, suggests that the lower percentage of 2-3% should be deleted from the analysis. For the variable ADL, Λn is larger than the cut-off point Q(1−α) at the delete proportions 0.02, 0.03 and 0.05, hence partial deletion at these proportions are not valid. After deleting 3% of the smallest values of Motor, the model is re-fit to get the parameter estimates shown in the Table 4. The new estimates are more meaning full since the on-effective values of covariate Motor are removed from the analysis.

In Table 5, the response is PDRS. The likelihood ratios Λn of Age, Motor and ADL all are larger than χ2(1−α) at the deletion proportions of 0.01, 0.02, 0.03 and 0.05. Thus the null hypothesis are rejected at all these proportions, or no deletion is valid at these proportions, and the analysis should be based on the original full data, with the parameter estimates shown in the Tables (Table 3, Table 4, and Table 5).

Note that the coefficient for Age is insignificant, and hence the corresponding Λn values with deleted proportions are senseless.

We proposed a method for partial variable deletion, in which only some proportion(s) of covariate(s) values are to be deleted. This is in contrast to the existing methods either select or delete the entire variable(s). Thus this method is new and is a generalization of the existing variable selection. The question is motivated from practical problems. It can used to find the effective ranges of the covariates, or to remove possible noises in the covariates, and thus the corresponding estimated effects are more interpretable. The proposed test statistic is a generalization of the Wilks likelihood ratio statistic, the asymptotic distribution of the proposed statistic is generally a chi-squared mixture distribution, the corresponding cut-off point can be computed by simulation. Simulation studies are conducted to evaluate the performance of the method, and it is applied to analyze a real Parkinson disease data as illustration. A drawback of the current version of the method is that it needs to specify the proportions of possible deletions for the variables, this makes the optimal proportions are not easy to find. In our next step research we will try to implement an algorithm which finds the optimal proportions automatically, and more easy to use. As suggested from a reviewer, simulation studies should be performed for statistical significance test between the proposed method and existing variable selection method(s) to address the contribution of the proposed method. This will be potential for our future research work (Appendix).

This research was supported by the Intramural Research Program of the National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.