We propose and study a method for partial covariates selection, which only select the covariates with values fall in their effective ranges. The coefficients estimates based on the resulting data is more interpretable based on the effective covariates. This is in contrast to the existing method of variable selection, in which some variables are selected/deleted in whole. To test the validity of the partial variable selection, we extended the Wilks theorem to handle this case. Simulation studies are conducted to evaluate the performance of the proposed method, and it is applied to a real data analysis as illustration.
Covariate, Effective range, Partial variable selection, Linear model, Likelihood ratio test
Variables selection is a common practice in biostatistics and there is vast literature on this topic. Commonly used methods include the likelihood ratio test [1], Akaike information criterion, AIC [2] Bayesian information criterion, BIC [3], the minimum description length [4,5] stepwise regression and Lasso [6], etc. The principal components model linear combinations of the original covariates, reduces large number of covariates to a handful of major principal components, but the result is not easy to interpret in terms of the original covariates. The stepwise regression starts from the full model and deletes the covariate one by one according to some statistical significance measure. May, et al. [7] addressed variable selection in artificial neural network models, Mehmood, et al. [8] gave a review for variable selection with partial least squares model. Wang, et al. [9] addressed variable selection in generalized additive partial linear models. Liu, et al. [10] addressed variable selection in semiparametric additive partial linear models. The Lasso [6,11] and its variation [12,13] are used to select some few significant variables in the presence of a large number of covariates.
However, existing methods only select the whole variable(s) to enter the model, which may not the most desirable in some bio-medical practice. For example, in two heart disease studies [14,15] there are more than ten risk factors identified by medical researchers in their long time investigations, with the existing variable selection methods, some of the risk factors will be deleted wholly from the investigation, this is not desirable, since risk factors will be really risky only when they fall into some risk ranges. Thus deleting the whole variable(s) in this case seems not reasonable, while a more reasonable way is to find the risk ranges of these variables, and delete the variable values in the un-risky ranges. In some other studies, some of the covariates values may just random errors which do not contribute to the influence of the responses, and remove these covariates values will make the model interpretation more accurate. In this sense we select the variables when their value falls within some range. To our knowledge, method for this kind of partial variable selection hasn't been seen in the literature, which is the goal of our study here. Note that in existing method of variable selection, some variables are selected/deleted, while in our method, some variable(s) are partially selected/deleted, i.e., only some proportions of some variable observations are selected/deleted. The latter is very different from the existing methods. In summary, traditional variable selection methods, such as stepwise or Lasso, some covariate(s) will be removed either wholly or none from the analysis. This is not very reasonable, since some of the removed covariates may be partially effective, removing all their values may yield miss-leading results, or at least cost information loss; while for the variables remaining in the model, not all their values are necessarily effective for the analysis. With the proposed method, only the non-effective values of the covariates are removed, and the effective values of the covariates are kept in the analysis. This is more reasonable than the existing methods of removing all or nothing.
In the existing method of deleting whole variable(s), the validity of such selection can be justified using the Wilks result, under the null hypothesis of no effect of the deleted variable(s), the resulting two times log-likelihood ratio will be asymptotically chi-squared distributed. We extended the Wilks theorem to the case for the proposed partial variable deletion, and use it to justify the partial deletion procedure. Simulation studies are conducted to evaluate the performance of the proposed method, and it is applied to analyze a real data set as illustration.
The observed data is , where is the response and is the covariates, of the i-th subject. Denote and . Consider the linear model
where is the vector of regression parameter, is the vector of random errors, or residual departure from the linear model assumption. Without loss of generality we consider the case the 's are independently identically distributed (iid), i.e. with variance matrix , where is the -dimensional identity matrix. When the 's are not iid, often it is assumed for some known positive-definite , then make the transformation , and , then we get the model , and the 's are iid with . When is unknown, it can be estimated by various ways. So below we only need to discuss the case the 's are iid.
We first give a brief review of the existing method of variable selection. Assume the model residual has some known density function (such as normal), with possibly some unknown parameter(s). For simple of discussion we assume there are no unknown parameters. Then the log-likelihood is
Let be the Maximum Likelihood Estimate (MLE) of (when is the standard normal density, is just the least squares estimate). If we delete columns of and the corresponding components of , denote the remaining covariate matrix as and the resulting as , and the corresponding MLE as . Then under the hypothesis : the deleted columns of has no effects, or equivalently the deleted components of are all zeros, then asymptotically [1].
where is the chi-squared distribution with -degrees of freedom. For a given nominal level , let be the -th upper quantile of the distribution, if , then is rejected at significance level , and its not good to delete these columns of ; otherwise we accept and delete these columns of .
There are some other methods to select columns of , such as AIC, BIC and their variants, as in the model selection field. In these methods, the optimal deletion of columns of corresponds to the best model selection, which maximize the AIC or BIC. These methods are not as solid as the above one, as may sometimes depending on eye inspection to choose the model which maximize the AIC or BIC.
All the above methods require the models under consideration be nested within each other, i.e., one is a sub-model of the other. Another more general model selection criterion is the Minimum Description Length (MDL) criterion, a measure of complexity, developed by Kolmogorov [4], Wallace and Boulton [16], etc. The Kolmogorov complexity has close relationship with the entropy, it is the output of a Markov information source, normalized by the length of the output. It converges almost surely (as the length of the output goes to infinity) to the entropy of the source. Let be a finite set of candidate models under consideration, and be the set of parameters of interest. may or may not be nested within some other , or and both in may have the same dimension but with different parametrization. Next consider a fixed density , with parameter running through a subset , to emphasize the index of the parameter, we denote the MLE of under model by (instead of by to emphasize the dependence on the sample size), the Fisher information for under , its determinant, and the dimension of . Then the MDL criterion (for example, Rissanen [17] and the review paper by Hansen and Yu [5], and references there) chooses to minimize
This method does not require the models be nested, but still require select/delete some whole columns. The other existing methods for variable selection, such as stepwise regression and Lasso, etc., are all for deleting/keeping some whole variables, and does not apply to our problem.
Now come to our question, which is non-standard and we are not aware of a formal method to address this problem. However, we think the following question is of practical meaning. Consider deleting some of the components within fixed columns of , the deleted proportions for these columns are . Denote for the remaining covariate matrix, which is with some entries replaced by 0's, corresponding to the deleted elements. Before the partial deletion, the model is
After the partial deletion of covariates, the model becomes
Note that here and have the same dimension, as no covariate is completely deleted. is the effects of the original covariates, is the effects of the covariates after some possible partial deletion. It is the effects of the effective covariates. As an over simplified example, we have individuals, with five responses and covariate vectors , , , , and . Then is the effects of the regression of on . If we remove some seemingly insignificant covariate components, for example, let , , , , and . In this case is the effects of regressing on . Thus, though and have the same structure, they have different interpretations. The problem can be formulated as testing the hypothesis:
If is accepted, the partial deletion is valid.
Note that different from the standard null hypothesis that some components of the parameters be zeros, the above null hypothesis is not a nested hypothesis, or is not a subset of , so the existing Wilks' theorem for likelihood ratio statistic does not directly apply here.
Denote be the corresponding log-likelihood based on data , and the corresponding MLE as . Since after the partial deletion, is the MLE of under a constrained log-likelihood, while is the MLE under the full likelihood, we have . Parallel to the log-likelihood ratio statistic for (whole) variable deletion, let, for our case,
Let be the columns with partial deletions, is deleted be the index set for the deleted covariates in the -th column ; be the cardinality of , thus . For different and , and may or may not have some common components. We first give the following Proposition, in the simple case in which the index sets 's are mutually exclusive. Then in Corollary 1 we give the result in more general case in which the index sets 's are not need to be mutually exclusive.
For given , there are many different ways of partial column deletions, we may use Theorem 1 to test each of these deletions. Given a significance level , a deletion is valid at level if , where is the - th upper quantile of the distribution, which can be computed by simulation for given .
The following Theorem is a generalization of the Wilks Theorem [1]. Deleting some whole columns in corresponds to in the theorem, and then we get the existing Wilks' Theorem.
Theorem 1: Under , suppose , the empty set, for all , then we have
where are iid chi-squared random variable with 1-degree of freedom.
Note that in Wilks problem the null hypothesis is that, the coefficients corresponding to some variables are zero. The null hypothesis is nested within the alternative; while the null hypothesis in our problem is: The coefficients correspond to some partial variables, and the null hypothesis is not nested within the alternative. So the results of the two methods are not really comparable.
The case the 's are not mutually exclusive is a bit more complicated. We first re-write the sets 's such that
where the 's are mutually exclusive, are index sets for one column of only; the 's are index sets common for columns and only; the 's are index sets common for columns and only,.... Generally some of the 's are empty sets. Let be the cardinality of and .
By examining the proof of Theorem 1, we get the following corollary which gives the result in the more general case.
Corollary 1: Under , we have
where the 's are all independent chi-squared random variables with r-degrees of freedom .
Below we give two examples to illustrate the usage of Proposition.
Example 1: , , . Columns has some partial deletions with , , , the 's have no overlap; , , . So by the Proposition, under we have
where all the chi-squared random variables are independent, each has 1 degree of freedom.
Example 2: , , . Columns has some partial deletions with . In this case the 's have overlaps, the Proposition can not be used directly, so we use the Corollary. Then , , , , , , ; , , , , , , . So by the Corollary, under we have
where all the chi-squared random variables are independent, with , and are each of 1 degree of freedom, and are each of 2-degrees of freedom, and is of 3-degrees of freedom.
Next, we discuss the consistency of estimation of under the null hypothesis . Let with probability , where is an i.i.d. copy of the 's, whose components with index in , in particular is the index set for those covariates without partial deletion.
Theorem 2: Under conditions of Theorem 1,
where
To extend the results of Theorem 2 to the general case, we need the following more notations. Let be an i.i.d. copy of data in the set . Let with probability , where is an i.i.d. copy of the 's, whose components with index in .
Corollary 2: Under conditions of Corollary 1, results of Theorem 2 hold with given above.
Computationally is well approximated by
where the notation means summation over those 's with deletion index in , and .
We illustrate the proposed method with two examples, Examples 3 and 4 below. The former rejects the null hypothesis while the latter accepts. In each case we simulate i.i.d. data with response and with covariates . We first generate the covariates, sample the 's from the 5-dimensional normal distribution with mean vector and a given covariance matrix .
Then we generate the response data, which, given the covariates. The 's are generated as
, the 's are i.i.d. .
Hypothesis test is conducted to examine if the partial deletion is valid or not. Significant level is set as . The experiment repeated 1000 times, represents the proportion , where is the -th upper quantile of the distribution given in Theorem 1, computed via simulation.
Example 3: In this example, five data sets are generated according to the mentioned method, with five different values of . We are interested to know whether covariates with can be deleted. Five data set with different values are simulated. The proportion of 's with are shown for each data set, the results are shown in Table 1. The five rows in Table 1 are the results for the five data sets. For each data, the parameter is estimated, a and test is conducted using the given , the is computed, is given, and the corresponding p-value is provided. Note that for our problem, a p-value smaller than means a significant value of , or significant difference between the regression coefficients of original covariates and those of the covariates after partial deletion, which implies in turn that the null hypothesis should be rejected, or the partial deletion should not be conducted (Table 1).
Table 1: The simulation result of , , and p-value according to . View Table 1
We see that the p-values of rejecting , are all smaller than 0.05 in the five set of . This suggests that covariates with should not be deleted at significance level .
Example 4: In this example, the original as in Example 3, but now we replace the entries in first 100 rows and first three columns by noise , where . The delete proportion is fixed with 's having absolute values smaller than the lower 0.1 percent being deleted. We are interested to see in this case whether these noises can be deleted, i.e. can be rejected or not. The results are shown in the following (Table 2).
Table 2: The simulation result of , , and p-value according to . View Table 2
We see that the p-values of rejecting are all greater than 0.95 for the five sets of . It suggests that the data provided strong evidence to conclude that the deleted values are noises and they are not necessary to the data set at 0.05 significance level.
We analyze a data set from the Deprenyl and Tocopherol Antioxidative Therapy of Parkinsonism, which is obtained from The National Institutes of Health (NIH). (For detailed description and data link, https://www.ncbi.nlm.nih.gov/pubmed/2515723). It is a multi-center, placebo-controlled clinical trial that aimed to determine a treatment for early Parkinson's disease patient to prolong their time requiring levodopa therapy. The number of patients enrolled was 800. The selected object were untreated patients with Parkinson's disease (stage I or II) for less than five years and met other eligible criteria. They were randomly assigned according to a two-by-two factorial design to one of four treatment groups: 1) Placebo 2) Active tocopherol 3) Active deprenyl 4) Active deprenyl and tocopherol. The observation continued for months and reevaluated every 3 months. At each visit, Unified Parkinson's Disease Rating Scale (UPDRS) including its motor, mental and activities of daily living components were evaluated. Statistical analysis result was based on 800 subjects. The result revealed that no beneficial effect of tocopherol. Deprenyl effect was found significantly prolong the time requiring levodopa therapy which reduced the risk of disability by 50 percent according to the measurement of UPDRS.
Our goal is to examine whether some of the covariates can be partially deleted. If traditional variable selection methods are used, such as stepwise or Lasso, it will end up with some covariate(s) been removed wholly from the analysis. This is not very reasonable, since some of the removed covariates may be partially effective, removing all their values may yield miss-leading results, or at least cost information loss. We use the proposed method to examine three of the response variables, PDRS, TREMOR and PIGD, and three covariates, Age, Motor and ADL for all these responses. The deleted covariates are the ones with values below the -th data quantile, with and 0.05. We examine each response and covariate one by one. The results are shown in Table 3, Table 4 and Table 5 below.
Table 3: Response TREMOR: values and estimated regression coefficients. View Table 3
Table 4: Response PIGD: values and estimated regression coefficients. View Table 4
Table 5: Response PDRS: values and estimated regression coefficients. View Table 5
In Table 3, response TREMOR is examined. For covariable Age, the likelihood ratio is larger than the cut-off point at all the deletion proportions, it suggests that for Age, no partial deletions with these proportions should be removed. For covariable Motor, is smaller than the cutoff point at the 0.01 proportion, this covariable can be partially deleted at this proportion. In other words, the covariate Motor with values smaller than 1%-th of its quantile have no impact on the analysis, or can be treated as noise and should be removed from the analysis. For covariable ADL, with deletion proportions , the likelihood ratio is smaller than which suggest that the lower percentage of of this covariate have no impact on the analysis and should be deleted. After removing the corresponding proportions of Motor and ADL, the model is re-fitted to get the parameter estimates shown there. These estimates have better meaning than the ones based on the whole covariates data, since now the noise values of covariates are removed, and only the effective covariates entered the analysis. However, if traditional variable methods are used, such as stepwise regression or Lasso, it may end up with the whole covariate Motor, ADL, or both to be removed, and leads loss of information or even miss-leading results.
In Table 4, response PIGD is investigated. For covariable age, is larger than the cut-off point at the 0.02, 0.03 and 0.05 proportions, suggests that partial deletion with these proportions are not appropriate. For covariate Motor, is smaller than cut-off point at the deletion proportions 0.02 and 0.03, suggests that the lower percentage of should be deleted from the analysis. For the variable ADL, is larger than the cut-off point at the delete proportions 0.02, 0.03 and 0.05, hence partial deletion at these proportions are not valid. After deleting 3% of the smallest values of Motor, the model is re-fit to get the parameter estimates shown in the Table 4. The new estimates are more meaning full since the on-effective values of covariate Motor are removed from the analysis.
In Table 5, the response is PDRS. The likelihood ratios of Age, Motor and ADL all are larger than at the deletion proportions of 0.01, 0.02, 0.03 and 0.05. Thus the null hypothesis are rejected at all these proportions, or no deletion is valid at these proportions, and the analysis should be based on the original full data, with the parameter estimates shown in the Tables (Table 3, Table 4, and Table 5).
Note that the coefficient for Age is insignificant, and hence the corresponding values with deleted proportions are senseless.
We proposed a method for partial variable deletion, in which only some proportion(s) of covariate(s) values are to be deleted. This is in contrast to the existing methods either select or delete the entire variable(s). Thus this method is new and is a generalization of the existing variable selection. The question is motivated from practical problems. It can used to find the effective ranges of the covariates, or to remove possible noises in the covariates, and thus the corresponding estimated effects are more interpretable. The proposed test statistic is a generalization of the Wilks likelihood ratio statistic, the asymptotic distribution of the proposed statistic is generally a chi-squared mixture distribution, the corresponding cut-off point can be computed by simulation. Simulation studies are conducted to evaluate the performance of the method, and it is applied to analyze a real Parkinson disease data as illustration. A drawback of the current version of the method is that it needs to specify the proportions of possible deletions for the variables, this makes the optimal proportions are not easy to find. In our next step research we will try to implement an algorithm which finds the optimal proportions automatically, and more easy to use. As suggested from a reviewer, simulation studies should be performed for statistical significance test between the proposed method and existing variable selection method(s) to address the contribution of the proposed method. This will be potential for our future research work (Appendix).
This research was supported by the Intramural Research Program of the National Institutes of Health. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.