Several efforts have been made to solve the multicollinearity problem, which arises from correlated regressors in the linear regression model. This is because the Ordinary Least Squares (OLS) Estimator becomes inefficient in the presence of multicollinearity. In this paper, a new modified Liu Ridge-Type estimator called Liu-Dawoud-Kibria is proposed in place of the OLS in estimating the parameters of the general linear model. The theoretical comparison and simulation study results show that the proposed estimator outperforms others under some conditions, using the mean squares error criterion. A real-life dataset is used to bolster the findings of the paper.
Liu Dawoud-Kibria, Multicollinearity, OLS estimator, Ridge regression estimator, Monte carlo simulation, Mean square error
One of the multiple linear regression model assumptions is that the explanatory variables are independent of each other. However, according to Frisch [1], this independence assumption is often violated in real-life situations leading to the Multicollinearity problem. The ordinary least squares (OLS) estimator has been regarded as one of the most important ways of estimating the parameters of the general linear model since it has minimum variance. But, in the presence of Multicollinearity, the OLSE is no longer a good estimator. Multicollinearity is generally agreed to be present if there is an approximately linear relationship (i.e., shared variance) among some predictor variables in the data [2]. The term multicollinearity refers to a situation where there is an exact (or nearly exact) linear relation among two or more explanatory variables [31]. One of the tests to ascertain the presence of multicollinearity in a dataset is the variance inflation factor (VIF). The VIF values ranging between one and five are moderate and represent a medium level of collinearity. Values of more than five are said to be highly collinear, while values above ten are considered to be extreme. Pairwise correlation of two independent variables can also be used to detect if multicollinearity is present in a dataset. An absolute correlation value of 0.7 and above indicates the presence of multicollinearity.
However, many methods have been proposed to solve this problem by various researchers (Hoerl and Kennard [4]; Liu [5]; Kibria [6]; Sakallioglu and Kaciranlar [7]; Baye and Parker [8]; Yang and Chang [9]; Wu and Yang [10]; Dorugade [11]; Lukman, et al. [12]); Lukman, et al. [13]; Kibria and Lukman [14]; Dawoud and Kibria [15]) and recently Owolabi, et al. [16,17]. This study proposes a new two-parameter estimator to circumvent the problem of Multicollinearity. Aside from multicollinearity, a few terms are defined in the table 1 below.
Table 1: Definition of some basic terms. View Table 1
The organization of the paper is as follows. The model, the proposed and existing estimators are given in Section 2. A comparison of the proposed estimator with some existing ones is shown in section 3. The biasing parameter was obtained in Section 4, while a Monte-Carlo simulation and a numeric example were conducted in Section 5. Section 6 contains concluding remarks.
Consider the following linear regression model
where y is a n × 1 vector of response, X is a n × p full column rank matrix, where n and p refer to the sample size and numbers of explanatory variables., β is a p × 1 vector of unknown parameters, ε is a n × 1 vector of errors supposed to be distributed with mean vector 0 and variance-covariance
Based on the Gauss-Markov theorem, the ordinary least squares estimator (OLSE) is given as:
The canonical form of Eq. (1) is rewritten as:
where Z = XQ, α = Q'β and Q is the orthogonal matrix whose columns constitute the eigenvectors of X'X. Then Z'Z = Q'X'XQ = Λ = diag(λ1,..., λp),
where λ1 ≥ λ2 ≥ ...λp > 0 are the ordered eigenvalues X'X. The ordinary least square estimator (OLSE) of equation (2) can be defined as:
And the Mean Square Error matrix (MSEM) of is defined as
The Ordinary Ridge Regression Estimator by Hoerl and Kennard [4] is given as:
where A = (Λ + kI)-1 and k is a non-negative biasing parameter. And the MSEM is given as:
The Liu estimator is defined as:
where E = (Λ + I)-1(Λ + dI) and d is a biasing parameter of Liu Estimator. The MSEM of is defined as:
The Kibria-Lukman (KL) estimator is defined as:
where P = (Λ + kI)-1(Λ - kI) and the MSEM of is given as
The Modified Ridge Type (MRT) estimator is defined as:
where H = Λ(Λ + k(1 + d)I)-1 and the MSEM is given as
The Dawoud Kibria (DK) estimator is defined as:
where K = (Λ + k(1 + d))-1, M = (Λ - k(1 + d)) and the MSEM is given as
In this paper, we proposed a New Biasing Estimator called Liu Dawoud-Kibria Estimator by following a method similar to that proposed by Liu [5]; Kaciranlar, et al. [20]; Yang and Chang [9], and Dawoud, et al. [21]. The proposed New Biasing Liu Dawoud-Kibria estimator for α is obtained by replacing with in the Liu estimator, and it becomes as follows:
Where W = (Λ + I)-1(Λ + dI) S = (Λ + k(1 + d)I)-1(Λ – k(1 + d)I), d and k are the biasing parameters.
The properties of the new estimator, namely; the bias vector, covariance, and mean squared error matrix (MSEM) of the proposed estimator, are given as follows:
The following lemmas will be used to make some theoretical comparisons among estimators in the next section.
Lemma 1: Let n x n matrices M > 0, N > 0 (or N > 0), then N > M if and only if where is the largest eigenvalue of matrix NM-1 [22].
Lemma 2: Let M be an n x n positive definite matrix, that is M > 0, and α be some vector, then if and only if [23].
Lemma 3: Let i = 1, 2 be two linear estimators of α. Suppose that where i = 1, 2 denotes the covariance matrix of and i = 1, 2. Consequently,
If and only if where [24].
Theoretical Comparisons among the Proposed LDK Estimator and the OLSE, Ordinary Ridge Regression (ORR), Liu Estimators, Kibria-Lukman (KL) Estimator, Dawoud-Kibria (DK), and Modified Ridge-Type (MRT) Estimator.
Theorem 1: The estimator is superior to the estimator if and only if
Proof: The difference between the dispersion matrices is given as
will be positively definite if and only if
Theorem 2: The estimator is superior to the estimator if and only if
Proof: The difference between the dispersion matrices is given as
AΛA - WSΛ-1WS will be positively definite if and only if
Theorem 3: The estimator is superior to the estimator if and only if
Proof: The difference between the dispersion matrices is given as
EΛ-1E - WSΛ-1WS will be positively definite if and only if
Theorem 4: The estimator is superior to the estimator if and only if
Proof: The difference between the dispersion matrices is given as
PΛ-1P - WSΛ-1WS will be positively definite if and only if
Theorem 5: The estimator is superior to the estimator if and only if
Proof: The difference between the dispersion matrices is given as
HΛ-1H - WSΛ-1WS will be positively definite if and only if
Theorem 6: The estimator is superior to the estimator if and only if
Proof: The difference between the dispersion matrices is given as
KMΛ-1KM - WSΛ-1WS will be positively definite if and only if
Various researchers have introduced different estimators of k and d for different kinds of regression models. Some of these authors are Hoerl and Kennard [4]; Dorugade [25]; Lukman and Ayinde [26]; Aslam and Ahmad [27], among others.The optimal values of k and d for the proposed estimator are obtainedthus. In determining the optimal value of k, d is fixed. The optimal value of the k can be considered to be that k that minimizes:
Taking the partial derivative of the function g(k,d) with respect to k gives
Let
For practical purposes, σ2 and are replaced with and respectively. Consequently, (30) becomes
and,
and the biasing parameters d proposed by Liu [5] in equation (38) will be adopted in this study. It is given as follows:
For practical purposes, and can be considered as and respectively. Consequently, (38) becomes
The simulation procedure used by McDonald and Galarneau [28]; Wichern and Churchill [29]; Gibbons [30]; Lukman and Ayinde [26] was utilized to generate the predictor variables in this study: This is given as:
where zij is an independent standard normal distribution with mean zero and unit variance, ρ is the correlation between any two explanatory variables, and p is the number of explanatory variables. For this study, we considered the values of ρ to be 0.8, 0.9, 0.95, and 0.99. Also, explanatory variables (p) were taken to be three (3) and seven (7) for the simulation study. The error terms ut were generated following Firinguetti [31] such that The values of Newhouse and Oman [32]. The standard deviations in this simulation study were σ = 3, 5, and 10.
Simulation results from discussion
Table 2 and Table 3 show that as σ and ρ increase, the estimated MSE values increase. Likewise, as n increases, the estimated MSE values decrease. As expected and from several simulations and empirical research, when the multicollinearity problem exists, the OLS estimator gives the highest MSE values and performs the worst among all estimators. Additionally, the results show that the proposed LDK estimator performs best and dominates the rest of the estimators in this study. Thus, the findings agree with the theoretical results.
Table 2: Estimated MSE when p = 3, n = 50. View Table 2
Table 3: Estimated MSE when p = 3, n = 100. View Table 3
In this section, Portland cement data was used to demonstrate the performance of the proposed estimator. The Portland cement data was originally adopted by Woods, et al. [33] and was later adopted by Li and Yang [34] and Ayinde, et al. [35]. The data set is widely known as the Portland cement dataset. The regression model for these data is defined as:
where yi = heat evolved after 180 days of curing measured in calories per gram of cement, X1 = tricalcium aluminate, X2 = tricalcium silicate, X3 = tetracalcium aluminoferrite, and X4 = β-dicalcium silicate. The variance inflation factors (VIFs) are 38.50, 254.42, 46.87, and 282.51, respectively. Eigenvalues of matrix are λ1 = 44676.206, λ2 = 5965.422, λ3 = 809.952, and λ4 = 105.419, and the condition number of is approximately 424. The VIFs, the eigenvalues, and the condition number indicate severe Multicollinearity. The estimated parameters and the MSE values of the estimators are presented in Table 4.
From Table 4, the proposed estimator (LDK) performs best among other estimators as it gives the smallest MSE value. Just as observed in the simulation study results, the OLS estimator did not perform well in the presence of Multicollinearity as it has the highest MSE.
Table 4: The results of regression coefficients and the corresponding MSE values. View Table 4
In this paper, a new two-parameter estimator (LKD) is proposed. Theoretical comparison of the proposed with six other existing estimators shows the superiority of the proposed estimator. Results from the simulation study reveal that the proposed estimator performs better than other existing estimators used in this study under certain conditions, which further strengthens the theoretical study. Application to real-life dataset also reveals the dominance of the proposed estimator (LKD). The newly proposed estimator (LKD) is recommended for parameter estimation in the linear regression model in the presence of Multicollinearity.
The authors declare that they have no known competing interests.
No financial assistance was received for this study.