Statistical classification analysis has been widely used in many fields. In this article, we applied and compared three different classification procedures: Logistic regression, Fisher''s linear discriminant function and the second order Bahadur representation to two datasets from two surveys on asthma among healthcare professional in Texas. The first dataset contained 102 subjects and the second dataset 2963. The concordance of the classification from the three statistical procedures with possible asthma identified by physician and airway responsiveness to methacholine challenge was assessed through Cohen's κ statistic via a series of 2 × 2 contingency tables.
Bahadur representation, Classification, Discrete discriminant analysis, Fisher''s linear discriminant function, Logistic regression
The incidence and prevalence of asthma, a chronic inflammatory disease of airways, is on the rise in the US and has increased by 75% in the past two decades [1]. Estimates of the prevalence of asthma differ based on the definition used and range from 4.5% to as high as 16.4% [2]. It is estimated that more than 14 million persons in the United States suffer from asthma. Community-based studies have reported asthma incidence rates from 0.5 to 2.5 per 1000 [3,4]. Questionnaires have long been a cornerstone of asthma epidemiology studies, and much work has gone into standardizing asthma questionnaires for use in the general population, by groups such as the British Medical Research Council (MRC) (1960), [5] American Thoracic Society (ATS) [6], and the International Union Against Tuberculosis and Lung Disease (IUATLD) [7]. However, in the absence of a gold standard, the definitions of asthma used in surveys vary and may not necessarily correspond to its clinical definition. Relatively few studies have been published with information on formal validation of asthma questionnaires [7-10]. Accurate detection of asthma in epidemiological studies is critical for the proper characterization of etiologic risk factors, triggers and the identification of prevention and intervention opportunities. There are many different ways that have been proposed and revised for the diagnosis of asthma. The current operational definition of asthma was given in the International Consensus Report on the Diagnosis and Treatment of Asthma, which is based on three components: Chronic airway inflammation, reversible airflow obstruction and enhanced bronchial reactivity that lead to symptoms of wheezing, breathlessness, chest tightness, cough, and sputum production [11].
The Southwest Center for Occupational and Environmental Health at The University of Texas School of Public Health conducted a two-phase survey of asthma among healthcare professionals in Texas [12]. In the first phase, an initial questionnaire was given to a convenience sample of 102 subjects. A methacholine challenge was administered to the 102 subjects in addition to self-administered questions regarding asthmatic symptoms, environmental risk factors and basic demographic characteristics [13]. In Delclos, et al. logistic regression models were based on 118 subjects (16 subjects in the testing stage were included) [13]. However, in current article, the 16 subjects in the testing stage were excluded. For the second phase, the refined questionnaire was administered to a random sample of healthcare professionals in Texas. The second phase of the study consisted of a cross-sectional group-comparison study design, using a mail survey administered to a sample (n = 5600) of four groups (n = 1400 per group) of Texas healthcare workers: Physicians, nurses, respiratory therapists and occupational therapists. Questionnaires were received from 3528 participants, with an overall response rate of 63%. After removing subjects with missing values, we used 2963 subjects with complete responses for model-based classification analysis. In the second phase, no Methacholine challenge was given.
For an accurate estimation of prevalence, a proper diagnosis of asthma is necessary. Because of the multivariate nature of the risk factors and unknown etiology of asthma, there is always uncertainty for the diagnosis [14]. A reasonable diagnosis of asthma for a person by a medical doctor generally requires some period of follow up and sufficient clinical and physiologic information documented during this follow up. One of the purposes of developing the questionnaire from the surveys was to provide a useful instrument in assessing asthma burden to the healthcare professionals in Texas [13]. In the questionnaire, in addition to a sequence of questions on symptoms, environmental risk factors and demographic characteristics, subjects were also asked if they had ever been diagnosed as having asthma by a physician (MD asthma) [13]. Preliminary analysis based on logistic regression identified a subset of eight symptom items that exhibited the best combination of sensitivity and specificity when compared to MD asthma and PC_{20}4 and PC_{20}8, where PC_{20}4 = 1 denotes a ≥ 20% decline in the subject.s FEV_{1} (forced exposure volume at one second) at ≤ 4 mg/ml methacholine challenge, PC_{20}8 = 1 indicates an FEV_{1} fall of least 20% at ≤ 8 mg/ml for the challenge. The eight symptom items were: 1) Have you ever had trouble with your breathing? 2) Have you had an attack of shortness of breath at any time in the last 12 months? 3) Have you had wheezing or whistling in your chest at any time in the last 12 months? 4) Have you been awakened during the night by an attack of cough in the last 12 months? 5) Have you been awakened during the night by an attack of chest tightness in the last 12 months? 6) When you are near animals, feathers or in a dusty part of the house, do you ever get itchy or watery eyes? 7) When you are near animals, feathers or in a dusty part of the house, do you ever get a feeling of tightness in your chest? 8) When you are near tree, grass or flowers, or when there is a lot of pollen around, do you ever get itchy or watery eyes?
There are many widely used classification procedures in the statistical literature [15]. Clinicians and statisticians have been engaged in discussions of the consistency and discrepancies of using various methods. In this article, we applied and compared three classification analysis methods for the diagnosis of asthma. Two methods (logistic regression model and Fisher''s linear discriminant function) are commonly used and the third one (Bahadur model) is less commonly used. Our comparative study intends to highlight the utilities of these models through real datasets. These three methods were applied to two data sets. The first one was a small data set from our phase I survey with 102 subjects. The second was a large data set from our phase II survey with 2963 complete subjects. In Section 2, the three classification techniques studied in this article are briefly reviewed. The results from the three methods on the two data sets are tabulated in a series of 2 × 2 contingency tables and the agreement among these three methods is quantified via κ statistic and presented in Section 3. Discussions and concluding remarks are given in Section 4.
The three classification analysis tools applied to the two data sets were logistic regression [16], Fisher''s linear discriminant function [17] and the second order Bahadur model [18]. Logistic regression has been widely used to model binary dependent variable in response to risk factors in many fields [19]. In this study, we denoted the dependent variable being 1 as asthma positive and 0 as negative. As noted in the introduction, there is no gold standard for the detection of asthma. For phase I data, we modeled three dependent variables: Asthma diagnosed by a physician (MD asthma = 1), and two levels of response to methacholine challenge (PC_{20}4 = 1 or PC_{20}8 = 1). In the phase II survey, methacholine challenge testing was not performed.
Logistic regression models the probability of asthma in relation to symptoms (risk factors). In our setting, let y = 1 be MD asthma = 1 or PC_{20}4 = 1 or PC_{20}8 = 1. The eight symptoms variables were described in previous section. Mathematically, logistic regression establishes a generalized linear model:
$$p\left(y=\text{1}|{x}_{1},{x}_{2},\mathrm{...},{x}_{k}\right)\text{=}\frac{{e}^{\alpha +{\beta}_{1}{x}_{1}+{\beta}_{2}{x}_{2}+\mathrm{...}+{\beta}_{p}{x}_{p}}}{1+{e}^{\alpha +{\beta}_{1}{x}_{1}+{\beta}_{2}{x}_{2}+\mathrm{...}+{\beta}_{p}{x}_{p}}}\text{(1)}$$
Where, ${x}_{j}$, j = 1, 2, . . . , p, denotes the p dichotomous symptom variables used in our study. In our case, p = 8. We used R 3.5.0 [20] to estimate the parameters in the model. In general, if we observed P (y = 1) > 0.5, we would classify the subject with the given combinations of symptoms as being asthmatic. However, more careful assessment of the threshold value for classification may be needed in some cases as discussed in Section 4.
Fisher''s linear discriminant function is another widely used technique in classification analysis. The simplest Fisher''s linear discriminant function applied to the classification of two populations is based on two multivariate normal distributions with equal covariance [17]. In our two survey phases, the symptom variables were generally binary, with "yes" or "no" answers. It is then obvious that the application of Fisher''s linear discriminant function to our data is questionable. Nevertheless, we included this method for comparison to the other two methods in our study. In applying Fisher''s linear discriminant function, we may assume there was an underlying quantitative process of the symptoms. For example, subjects answering a question on shortness of breath would dichotomize the underlying obstruction of the airway into a "yes" or "no" response according to a subjective feeling.
Let $N\left({\mu}_{1},\sum \right)$ and $N\left({\mu}_{2},\sum \right)$ be the distribution of the asthmatic and nonasthmatic subjects, respectively, where ${\mu}_{1}$ is the vector of proportions of positive responses (X_{j} = 1), ${\mu}_{2}$ is the vector of proportions of negative responses $\left({X}_{j}=\text{0}\right)$ and $\sum $ is the common variance-covariance matrix for both populations. Let X = (X_{1}, X_{2}, ..., X_{p}) be the vector of symptoms of an individual. The Fisher''s linear discriminant function would classify a subject with X as an asthmatic if
$${x}^{\text{'}}{\widehat{\sum}}^{-1}\left({\widehat{\mu}}_{1}-\text{}{\widehat{\mu}}_{2}\right)-\text{}\frac{1}{2}{\left({\widehat{\mu}}_{1}+\text{}{\widehat{\mu}}_{2}\right)}^{\text{'}}{\widehat{\sum}}^{-1}\left({\widehat{\mu}}_{1}-\text{}{\widehat{\mu}}_{2}\right)\ge \text{log}\left(k\right)\text{(2)}$$
Where x is the observed value of X and the hat on µ_{1}, µ_{2} and $\sum $ denotes the sample version of the parameters and
$$k=\text{}\frac{{q}_{0}C\left(1|0\right)}{{q}_{1}C\left(0|1\right)}$$
Where q_{1}, q_{0} are prior probabilities of asthma or absence of asthma, respectively, and $C\left(1|0\right)$ is the cost of misclassification of a nonasthmatic as asthmatic and $C\left(0|1\right)$ is the cost of misclassification of an asthmatic as nonasthmatic. In our application, we assume k = 1, which is a commonly used criterion.
The third method applied to our data sets was the second order Bahadur representation [18,21]. In our application, the symptom variables were all correlated and dichotomous. Let
${\theta}_{j}=\text{}P\left({X}_{j}=\text{1}\right),\text{}j\text{=1,2,}\mathrm{....}\text{,}p\text{,}$ where, ${X}_{j}$ is one of the symptom variables of asthma such as cough, shortness of breath, ${X}_{j}=\text{1}$ denotes the presence of the symptom and ${X}_{j}=\text{0}$ for absence of the symptom. As mentioned previously, in our study, we identified eight symptoms for our comparative classification analysis.
Let ${X}_{j}$ be a binary random variable. The standardized version of ${X}_{j}$ is given by
$${Z}_{j}=\text{}\frac{{X}_{j}-{\theta}_{j}}{\sqrt{{\theta}_{j}\left(1-{\theta}_{j}\right)}}\text{(3)}$$
Define the expectations of their cross products as ${\rho}_{jk}=\text{}E\left({Z}_{j}{Z}_{k}\right),\text{}\mathrm{.....}\text{,}{\rho}_{jk\mathrm{....}p}=\text{}E\left({Z}_{j}{Z}_{k}\mathrm{...}{Z}_{p}\right)$. Bahadur showed that the joint distribution of X = (X_{1}, X_{2}, ..., X_{p}) can be written as (Goldstein and Dillon 1978)
$$f\left({x}_{1},{x}_{2},\text{}\mathrm{...}\text{,}{x}_{p}\right)\text{=}P\left({x}_{1},{x}_{2},\text{}\mathrm{...}\text{,}{x}_{p}\right){P}_{\left[1\right]}\left({x}_{1},{x}_{2},\text{}\mathrm{...}\text{,}{x}_{p}\right),\text{(4)}$$
Where
$$P\left({x}_{1},{x}_{2},\text{}\mathrm{...}\text{,}{x}_{p}\right)\text{=1+}{\displaystyle \sum _{jk}{\rho}_{jk}}{Z}_{j}{Z}_{k}+\text{}{\displaystyle \sum _{jkl}{\rho}_{jk}}{Z}_{j}{Z}_{k}{Z}_{l}+\mathrm{...}+\text{}{\rho}_{\mathrm{12....}p}{Z}_{1}{Z}_{2}\mathrm{...}{Z}_{p}$$
And
$${P}_{\left[1\right]}\left({x}_{1},{x}_{2},\text{}\mathrm{...}\text{,}{x}_{p}\right)=\text{}{\displaystyle \prod _{j=1}^{p}{\theta}_{j}{}^{{x}_{j}}}{\left(1-{\theta}_{j}\right)}^{1-{x}_{j}}$$
Assuming the correlation coefficients with order higher than 2 being zero and using the sample mean and sample Pearson correlation coefficients, we obtain the second (sample) Bahadur representation$\widehat{f}\left({x}_{1},{x}_{2},\text{}\mathrm{...}\text{,}{x}_{p}\right)\text{=}\left({\displaystyle \prod _{j=1}^{p}{\widehat{\theta}}^{{x}_{j}}}{\left(1-{\widehat{\theta}}_{j}\right)}^{1-{x}_{j}}\right)\left(1+{\displaystyle \sum _{jk}{\widehat{\rho}}_{jk}}{\widehat{z}}_{j}{\widehat{z}}_{k}\text{}\right),\text{(5)}$
Where the estimates of the mean, the standardized observation and the empirical pair wise correlation coefficient are calculated as following: ${\widehat{\theta}}_{j}=\text{}{\displaystyle {\sum}_{j=1}^{n}\frac{I\left({X}_{j}=1\right)}{n}}$ , ${\widehat{z}}_{j}=\text{}\frac{{x}_{j}={\widehat{\theta}}_{j}}{\sqrt{{\widehat{\theta}}_{j}\left(1-{\widehat{\theta}}_{j}\right)}}$ and
${\widehat{\rho}}_{jk}=\text{}\frac{{\sum}_{j,k}I\left({X}_{j}=1,{X}_{k}=1\right)/n-{\widehat{\theta}}_{j}{\widehat{\theta}}_{k}}{\sqrt{{\widehat{\theta}}_{j}\left(1-{\widehat{\theta}}_{j}\right){\widehat{\theta}}_{k}\left(1-{\widehat{\theta}}_{k}\right)}}$ . Note that I (condition) is an indicator function that takes a value of 1 if the condition is true and 0 otherwise. The probability $\widehat{f}$ can be estimated based on the sample values of $\theta $ and $\rho $ from the asthmatic and the nonasthmatic group. Let ${\widehat{f}}_{1}\left({x}_{1},{x}_{2},\text{}\mathrm{...}\text{,}{x}_{p}\right)$ and ${\widehat{f}}_{0}\left({x}_{1},{x}_{2},\text{}\mathrm{...}\text{,}{x}_{p}\right)$
be the probability estimated from the asthmatic and the nonasthmatic group, respectively. We would classify a subject with symptom $x=\left({x}_{1},{x}_{2},\text{}\mathrm{...}\text{,}{x}_{p}\right)$ into the asthmatic group if
$$\delta {\widehat{f}}_{1}\left({x}_{1},{x}_{2},\text{}\mathrm{...}\text{,}{x}_{p}\right)\text{}\left(1-\delta \right){\widehat{f}}_{0}\left({x}_{1},{x}_{2},\text{}\mathrm{...}\text{,}{x}_{p}\right),\text{(6)}$$
where $\delta $ is the prior probability of asthma. We assumed a $\delta $ = 0.5 in our comparative study of these three classification procedures.
We applied the three classification methods to the two data sets from our surveys on Texas healthcare professionals. The pair wise comparisons of the classification results were based upon κ statistic via a series of 2 × 2 contingency tables as that shown in Table 1 [22].
Table 1: The agreement table for two classification methods. View Table 1
In Table 1, ${p}_{ij,}\text{}i=\text{0,1}$ and $j=0,1,$ is the proportion of subjects in category $i$ by method A and in category $j$ by method B. The estimate of the κ statistic is defined in Equation (7). If two methods are in complete agreement, κ = 1. If κ ≥ 0, the observed agreement is greater than chance, and if observed agreement is less than chance, then κ < 0. We used R 3.5.0 to compute the estimates of the κ statistic and its standard deviation (R Project 2019).
$$\widehat{\kappa}=\text{}\frac{{p}_{0}-{p}_{e}}{1-{p}_{e}},\text{(7)}$$
Where ${p}_{0}=\text{}{p}_{00}+\text{}{p}_{11}$ and ${p}_{e}=\text{}{p}_{0.}{p}_{.0}+\text{}{p}_{1}.{p}_{.1}$.
The values of the κ statistic are shown in Table 2 and Table 3 for phase I and phase II data, respectively.
Table 2: Selected pairwise comparisons of three classifications procedures with MD asthma, pc_{20}4 and pc_{20}8 for phase 1 survey of 102 subjects. View Table 2
Table 3: Pairwise Comparisons of Three Classifications Procedures with MD Asthma for Phase 2 Survey of 2963 Subjects. View Table 3
Results in Table 2 show that the classification based on logistic regression was highly concordant with a prior physician diagnosis of asthma (MD asthma) for the phase I data. We observed p_{01} = p_{10} = 2/102 and the estimate of the κ statistic was 0.8651, with a standard deviation of 0.0659. Methacholine challenge was given during phase I and two indicator variables (PC_{20}4 and PC_{20}8) were generated from the outcomes as described in Section 2. The concordance between the logistic regression using the methacholine challenge and the MD asthma were low. Similar low concordance was observed between the MD asthma and the direct PC_{20}4, PC_{20}8 without using logistic regression. The methacholine challenge seemed much more sensitive than the physician.s diagnosis. For phase I data, the concordance between MD asthma and the Fisher''s linear discriminant function was high. The value of κ statistic was 0.7414 with a standard deviation a value of 0.0864. The results from the second order Bahadur representation and MD asthma produced 0.5885 for the κ statistic. For the three pair wise comparisons among logistic regression, second order Bahadur representation and Fisher''s linear discriminant function, logistic regression and Fisher''s method had a high concordance with κ = 0.8061 and standard deviation being 0.0761. Bahadur representation and Fisher''s linear discriminant function had κ = 0.7253 with a standard deviation 0.0861. The κ statistic for concordance between the logistic model and the Bahadur representation was 0.5885 with standard deviation 0.1023, which was the same as MD asthma compared directly to second order Bahadur representation.
In phase II of the survey, no methacholine challenge was given. The data set used in this comparative analysis consisted of 2963 subjects without missing values. Table 3 summarizes the κ and corresponding standard deviations for the six pair wise comparisons. Compared to MD asthma, the three statistical classification procedures showed relatively large κ values, although they were lower than those in phase I, ranging from 0.4917 to 0.5826. For the pair wise comparisons among the three statistical procedures, logistic regression produced overly sensitive classification since p10 = 0. The κ statistics were 0.5224 and 0.7362 when comparing results from logistic regression to the second order Bahadur representation and Fisher''s linear discriminant functions, respectively. Comparison of the Bahadur representation to Fisher''s method resulted in a κ statistic of 0.6749 with a standard deviation of 0.0164.
In this study, we applied three widely used statistical classification techniques to data obtained from two surveys on asthma in Texas healthcare professionals. All three procedures showed a high concordance with a prior physician diagnosis of asthma although concordance decreased as sample size increased. The Fisher''s linear discriminant function used an assumption of normality for the explanatory variables that was clearly not true in our study. However, our study demonstrated its robustness when applied to dichotomous variables. For the classification based on logistic regression, we used a default cut-off value of 0.5 when classifying a subject as asthmatic or nonasthmatic, which may not be appropriate in other applications. A Bayesian approach, together with a cost function of misclassification, may add more insights in classification. However, this was beyond the scope of this study since it is hard to justify a particular cost function in a general setting. Due to the correlation among the dichotomous variable, we would expect use of Bahadur representation to produce better classification results. However, we were unable to confirm or reject this in the absence of a true gold standard. Comparing with the doctor.s classification, logistic regression and Fisher''s linear discriminant function provided stronger (higher) κ value than that of Bahadur representation. This is consistent with the current paractice since Bahadur is less commonly used in data analysis. The estimates of κ statistic between the logistic regression and Fisher''s linear discriminant function showed relatively high consistency for these two methods. For simplicity only the second order Bahadur representation was used in this study. Analysis with the higher order representation was also beyond the scope of our current study.
Statistical classification tools have been widely applied for clinical evaluations. For example, multiple logistic regression was used to design and enroll patients for clinical trial of Early Treatment for Retinopathy of Prematurity study [23]. Gregori, et al. [24] have studied and compared statistical classifiers in evaluating coronary artery diseases.
This study was partially supported by Grants No. 5R01OH03945-01A1, T42CCT610417 and 5T42OH008421 from the National Institute for Occupational Safety and Health/Centers from Disease Control and Prevention and by Grant No. 71161011/G0107 from National Natural Science Foundation of China.