Penalized weighted least-squares estimate for variable selection on correlated multiply imputed data (original) (raw)

Journal Article

,

Center for Applied Statistics and School of Statistics, Renmin University of China

,

Beijing

,

China

Search for other works by this author on:

,

School of Statistics, Renmin University of China

,

Beijing

,

China

Search for other works by this author on:

,

School of Statistics, Renmin University of China

,

Beijing

,

China

Search for other works by this author on:

,

College of Public Health, University of Georgia

,

Athens, Georgia

,

USA

Search for other works by this author on:

College of Public Health, University of Georgia

,

Athens, Georgia

,

USA

Address for correspondence: Ye Shen, College of Public Health, University of Georgia, 101 Buck Rd, Athens, Georgia 30602, USA. Email: yeshen@uga.edu

Search for other works by this author on:

Conflict of interest: The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author Notes

Received:

30 September 2021

Revision received:

13 September 2022

Cite

Yang Li, Haoyu Yang, Haochen Yu, Hanwen Huang, Ye Shen, Penalized weighted least-squares estimate for variable selection on correlated multiply imputed data, Journal of the Royal Statistical Society Series C: Applied Statistics, Volume 72, Issue 3, June 2023, Pages 703–717, https://doi.org/10.1093/jrsssc/qlad028
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Considering the inevitable correlation among different datasets within the same subject, we propose a framework of variable selection on multiply imputed data with penalized weighted least squares (PWLS–MI). The methodological development is motivated by an epidemiological study of A/H7N9 patients from Zhejiang province in China, where nearly half of the variables are not fully observed. Multiple imputation is commonly adopted as a missing data processing method. However, it generates correlations among imputed values within the same subject across datasets. Recent work on variable selection for multiply imputed data does not fully address such similarities. We propose PWLS–MI to incorporate the correlation when performing the variable selection. PWLS–MI can be considered as a framework for variable selection on multiply imputed data since it allows various penalties. We use adaptive LASSO as an illustrating example. Extensive simulation studies are conducted to compare PWLS–MI with recently developed methods and the results suggest that the proposed approach outperforms in terms of both selection accuracy and deletion accuracy. PWLS–MI is shown to select variables with clinical relevance when applied to the A/H7N9 database.

1 Introduction

Avian influenza A H7N9 (A/H7N9) is a virus that causes excessive production of cytokines, rapidly progressive pneumonia, respiratory failure, acute respiratory distress syndrome (ARDS), and fatal outcomes (Li et al., 2014). The A/H7N9 virus has the highest risk score for the potential threat to human health among the 12 novel influenza A viruses assessed by the Influenza Risk Assessment Tool to date (Wang et al., 2017). Since the first identification of the A/H7N9 virus in 2013, it has caused five epidemic waves, resulting in more than 1,500 infections with a mortality rate close to 40% (Shi et al., 2017). The case fatality risk was estimated to be between 160 and 2,800 per 100,000 symptomatic cases (Yu et al., 2013). The earlier start and larger size of recent A/H7N9 epidemics have prompted concerns about whether the epidemiology has changed to indicate an increasing pandemic threat (Iuliano et al., 2017; Zhou et al., 2017). Therefore, it is of great interest to study the influencing factors for preventing and controlling A/H7N9. An important indicator of the severity and prognosis in patients with A/H7N9 infection is the C-reactive protein (CRP).

CRP is synthesized primarily by the liver; it is a nonspecific acute-phase protein produced in response to most forms of infection, inflammation, and tissue injury (Fang et al., 2015). Both acute systemic Gram-positive and Gram-negative bacterial infections, as well as systemic fungal infections, cause a marked increase in CRP, even in immunodeficient patients (Póvoa, 2002). Thus, this protein has been considered a nonspecific biomarker for early diagnosis and prognostic measurements of infections. The majority of laboratory-confirmed A/H7N9 patients have symptoms of pneumonia, for which CRP is often used as an early, but unspecific, acute-phase inflammatory biomarker (Silvestre et al., 2009). Previous studies also suggest that CRP expression levels are higher in A/H7N9-infected individuals. As the expression level of CRP is negatively associated with prognosis, it is often used as a biomarker for death (Ho et al., 2008). Thus, it is of great clinical interest to study the impact factors of CRP in early prevention and prognosis efforts. Wu et al. (2016) examined the association between CRP and cytokines and further explored the association between cytokines and CRP in H7N9 infections. This article explores A/H7N9 patients’ characteristics associated with CRP through the analysis of an A/H7N9 cohort from Zhejiang province in China. Our motivating dataset was collected from all laboratory-diagnosed A/H7N9 patients in Zhejiang province since the first case of A/H7N9 virus infection occurred in April 2013. The data include demographic, clinical, and laboratory information of 305 patients. We consider CRP as the response variable and other 28 potential influencing factors as predictors, including demographic characteristics, exposure-related characteristics, clinical features, and laboratory indicators. Full details on these variables can be found in the attached tables. We see that a significant portion of the variables in our data is not completely observed, and the missing percentage of the response variable CRP is higher than 50%. Figure 1 illustrates the variable’s missing patterns. However, the existing methods of variable selection have their own challenges. Therefore, we propose the method herein to solve this problem.

Missing schematic of the A/H7N9 dataset. The white area represents the missing.

Figure 1.

Missing schematic of the A/H7N9 dataset. The white area represents the missing.

Missing data are often associated with loss of precision and lead to biased estimation (Ibrahim et al., 2012). They present challenges to the validity of analysis in social sciences, epidemiology, biology, computer science, and many other fields. The most widely implemented strategy for dealing with missing data is to omit observations with missing values and perform a complete case (CC) analysis (Little & Rubin, 2019). However, CC analysis may lead to a selection bias. When there are many variables with missing information, a large portion of valuable research data is discarded when the CC is conducted, which, in turn, is likely to dramatically reduce the statistical power and compromise the accuracy of estimation. The A/H7N9 database under investigation comprises 305 observations and 28 variables; however, many of the variables have missing data. Only eight observations are completely observed across all patients, and half of the variables have missing values. In this case, a CC analysis is not a valid option for data analysis. Johnson et al. (2008) propose a method based on weighted penalized estimating functions (PSR) to combine variable selection techniques with inverse probability weighting for handling missing data. Garcia et al. (2010) propose a likelihood-based method for handling missing data, which incorporates variable selection techniques such as lasso and smoothly clipped absolute deviation (SCAD) penalty into the expectation-maximization algorithm to maximize the penalized observed data likelihood.

Another commonly adopted approach to handle missing data is multiple imputation (MI) (Little & Rubin, 2019). While MI can effectively impute missing data following ignorable missing mechanisms, it also creates multiple datasets after imputation; this makes it challenging to implement reliable variable selection. Based on an MI framework, Heymans et al. (2007) propose to choose common predictors based on the selection results from each dataset and Wood et al. (2008) present an MI–stepwise method, that is, a stepwise variable selection method by repeated use of Rubin’s rules. Besides, Chen and Wang (2013) propose the MI–least absolute shrinkage and selection operator (MI–LASSO) method to deal with the variable selection problem of missing data. The MI–LASSO method treats the estimated regression coefficients of the same variable across all imputed datasets as a group and applies the group LASSO penalty to reach a consistent variable selection across multiply-imputed datasets. Long and Johnson (2015) investigate a general resampling approach (BI–SS) that combines bootstrap imputation and stability selection. BI-SS aims to draw strength from bootstrap imputation and stability selection.

However, similarities among the multiply-imputed datasets are not fully addressed in the MI–LASSO method. The same observations across multiply-imputed datasets are often highly correlated and can be considered repeated measures on a specific subject. Such a correlation structure is often ignored in existing variable selection methods for multiply-imputed data. To capture this correlation structure and prevent potential information loss, we propose a general framework of variable selection on multiply-imputed data with penalized weighted least squares (PWLS–MI) for handling repeated measurements. Specifically, when performing variable selection through the penalized function, the loss function is derived with a working matrix introduced to model the correlation among imputed observations. The proposed PWLS–MI framework is rather flexible, as it can be adjusted to different penalties, such as the LASSO, adaptive LASSO (ALASSO), Elastic Net, SCAD, and minimax concave penalty.

The remaining paper is organized as follows. In Section 2, we propose a PWLS–MI framework to model the correlation structure when performing the variable selection. Several simulation settings are considered in Section 3. An applied analysis of the A/H7N9 data is illustrated in Section 4, followed by a discussion and concluding remarks in Section 5.

2 Methods

Suppose that there are n observations with p covariates. Let Y∈Rn be the n×1 response vector, and X∈Rn×(p+1) be the augmented covariates matrix where the first column is all 1’s. We assume that both the response and the covariates in original dataset may have missing values. As Figure 1 illustrates, the missing patterns of different variables may vary among the observations. Thus, the sample size of the CCs may be small.

2.1 PWLS–MI method

We assume that the complete data are generated from the following linear regression model,

where βtrue is the (p+1)×1 regression coefficient with the first element represents the intercept, and the error term ϵ∼N(0,σ2)⁠. We are interested in selecting important covariates that predict Y⁠.

We assume that missing data can occur in either Y or X and the missing mechanism is missing at random (MAR). Then, we impute D datasets by the MI (Azur et al., 2011).

The Dn×1 response vector is denoted as

Y~=(Y1(1),…,Y1(D),…,Yn(1),…,Yn(D))T,

while the Dn×(p+1) covariate matrix is X~ = (Xij(k)),i=1,…,n;j=1,…,p+1;k=1,…,D⁠. Denote Yi(k) and Xij(k) as the _k_th imputation of the response and imputed covariate j∈[2,p+1] for subject i, respectively. Thus, the linear model can be converted into

where Xi(k) = (Xi1(k),…,Xip(k)),i=1,…,n,k=1,…,D⁠.

For each subject i, we have

where Yi=(Yi(1),…,Yi(D))T⁠, Xi=((Xi(1))T,…,(Xi(D))T)T⁠, and εi=(εi(1),…,εi(D))T⁠. Note that E(Yi|Xi)=Xiβtrue and Var(Yi|Xi)=Var(εi)=σ2R⁠.

Compared with MI–LASSO, we introduce the concept of a D×D working matrix R to model the correlated structure among the imputed datasets within the same subject. Since the within-subject correlation in our framework is derived from the MI using the same model, it is straightforward to assume that the working covariance matrix is exchangeable, that is,

where ρ is the correlation coefficient which is related to the imputation method.

Then, the PWLS–MI method can be written as the optimization problem in the following way:

minβ,R,σ2∑i=1n1σ2(Yi−Xiβ)TR−1(Yi−Xiβ)+Pλ(β).

(1)

In the optimization problem (1), we refer β⁠, σ2⁠, and R to the formal parameters to be optimized, and let β^ represent the estimation of the true coefficient βtrue⁠. We would like to emphasize that the problem (1) is optimized iteratively and thus the parameters β⁠, σ2⁠, and R are estimated jointly. Moreover, during the optimization, there are two points worth noting. On the one hand, we need to specify the form of the penalty term Pλ(β)⁠. On the other hand, parameters σ2 and R−1 also need to be estimated jointly.

For the penalty term, we use the ALASSO penalty as an illustration. That is, Pλ(β)=λ∑j=1pw^j|β^j|⁠, where λ is the tuning parameter, and w^j is the _j_th element of w^=1/|β~|γ⁠, where β~ is the pilot estimator which can be obtained through the ordinary least-square estimation and γ>0 is also a tuning parameter (Zou, 2006). We then apply the local quadratic approximation (LQA) method for optimization (Fan & Li, 2001).

Then, we consider estimation the of σ2 and R−1 in the optimization problem (1). Based on the form of R−1⁠, we need to estimate the correlation coefficient ρ⁠, which can be estimated by the following equation:

ρ^=1n∑i=1n1D(D−1)∑j≠lri(j)ri(l),

where the indexes j and l are from 1 to D with the constraint j≠l⁠, and hence there are D(D−1) terms in the inner sum. Meanwhile, ri(k)=(Yi(k)−μ^i(k))/σ^ is defined as the Pearson residual. The variance σ2 is estimated by σ^2=∑i=1n(Yi−Xiβ^)TR^−1(Yi−Xiβ^)/(n−p), where R^−1 is the estimated working covariance matrix.

2.2 Selection of tuning parameter

To select the optimal tuning parameter λ, we define a novel information criterion, termed the multiply imputed information criterion (MIC), by combining the weighted Bayesian information criterion (BIC) and quasi-generalized cross-validation (QGCV) (Fu, 2005).

First, we would like to denote two sets. Let I1 represent the set of CC indices and I0 represent the set of missing indices, with |I1|=n1 and |I0|=n0⁠. Then, MIC is formulate as

To be specific, BIC is defined as

log{∑k=1D∑i∈I0(Yi(k)−Xi(k)β^)2/n0}+df×log(n0)/n0,

where df is the degrees of freedom of the fitted model, which can be estimated by the number of nonzero fitted coefficients (Chen & Wang, 2013; Zou & Hastie, 2005).

According to Fu (2005), the QGCV with the adaptive lasso penalty is defined as

where λ and γ are the tuning parameters of the penalty function as described in the previous section. For the sake of simplicity, please refer to the appendix for the details of the QGCV calculation.

In addition, note that the BIC is calculated by the samples with completely observed responses in the original dataset, while QGCV is calculated by the samples with missing responses in the imputed datasets. We select the optimal tuning parameter λ by minimizing MIC.

We conduct a simulation to evaluate the performance of MIC in the selection of tuning parameters under the proposed method. Here we consider the situation where both response Y and variable X1 have 50% missing. The missing generation setting of X1 is X1∼X4⁠, and that of response Y is Y∼X1+X3+X5⁠. The simulation results for PWLS-MI (with ALASSO) are shown in Table 1. We measure the variable selection performance for each criterion by sensitivity (SEN), specificity (SPE), and Gmeans. Specifically,

SEN=number of selected important covariatesnumber of true important covariates,SPE=number of removed unimportant covariatesnumber of true unimportant covariates,

Sensitivity measures the proportion of selected important covariates while specificity measures the proportion of non-selected unimportant covariates. Higher sensitivity and specificity reflect a better variable selection performance. The other one is the geometric mean of sensitivity and specificity, which gives an overall performance measure,

The values of the three measures range from 0 to 1. It implies that a variable selection method works better if its values are more close to 1.

Table 1.

The simulation results for PWLS-MI (with ALASSO) with different tuning selection criteria

Measurement
Criterion SEN SPE Gmeans
BIC 0.108 0.947 0.319
QGCV 0.998 0.787 0.886
MIC 0.988 0.933 0.960
Measurement
Criterion SEN SPE Gmeans
BIC 0.108 0.947 0.319
QGCV 0.998 0.787 0.886
MIC 0.988 0.933 0.960

Table 1.

The simulation results for PWLS-MI (with ALASSO) with different tuning selection criteria

Measurement
Criterion SEN SPE Gmeans
BIC 0.108 0.947 0.319
QGCV 0.998 0.787 0.886
MIC 0.988 0.933 0.960
Measurement
Criterion SEN SPE Gmeans
BIC 0.108 0.947 0.319
QGCV 0.998 0.787 0.886
MIC 0.988 0.933 0.960

Table 1 shows the variable selection performance of the optimal tuning parameter with BIC, QGCV, and MIC, respectively. We can find that the optimal tuning parameters with both BIC and QGCV lead to low Gmeans on variable selection. Intuitively, MIC can reach the best performance by incorporating information on CCs and missing cases well.

Moreover, to evaluate the performance of MIC further, we take the MI–Stacked, which applies lasso to one large data set that stacks the M imputed data sets together (Long & Johnson, 2015; Wood et al., 2008) for illustration. Numerical results show that using the MIC criterion can enhance the performance of the variable selection method based on multiple implementation. Please refer to the appendix for the detailed results.

2.3 Algorithms

In this section, we would like to provide more details on the implementation of the proposed method.

Suppose there are missing values in the original data and the missing mechanism is MAR or missing completely at random (MCAR). We adopt the MI procedure to impute the missing data by chained equations (MICE) (Azur et al., 2011). This procedure can be easily implemented through the MICE package in R. Specifically, during the MI procedure, the conditional distribution of each variable should be modelled using the appropriate regression model given the other variables. The imputed values are drawn from the posterior predictive distribution of the missing data given the observed data. The imputed variables are then used in the imputation of the next variables until all variables with missing values are imputed. This process repeats until it converges.

Based on the multiply imputed data, the parameters are estimated through the optimization problem (1). Since the objective function does not have an analytical solution, an iterative algorithm is needed. To be specific, without loss of generality, suppose we have done s iterations of the estimate so far, and we are now at the (s+1)th iteration. First, we can calculate

Δ=[∑i=1nXiTR(s)−1Xiσ2(s)+∂2Pλ(β)∂β2|β=β(s)]−1[∑i=1nXiTR(s)−1(Yi−μ^i(s))σ2(s)−∂Pλ(β)∂β|β=β(s)],

where R(s)−1⁠, σ2(s)⁠, and β(s) represents the estimations of the _s_th estimate. Then, we can obtain (s+1)th estimation of βtrue through β(s+1)=β(s)+Δ⁠. Meanwhile, the estimations of ρ and σ2 are also updated as ρ(s+1) and σ2(s+1) accordingly. We repeat the iterative estimation process until the estimation of βtrue satisfies max(|β(s+1)−β(s)|)≥ϵ⁠, where ϵ represents the pre-defined convergence value.

In addition, in this work, we apply the LQA method for optimization (Fan & Li, 2001). Let βi(s) be the _i_th coefficient estimate of β(s)⁠. As long as (βi(s))2>0⁠, we can make the approximation βi2≈βi2/(βi(s))2⁠. To keep the condition (βi(s))2>0 as well as the coefficient is shrunken to 0, we fix βi(s)=δ when (βi(s))2<δ and set criterion δ as 1×10−10⁠, as discussed in Chen and Wang (2013).

3 Numerical study

In this section, we conduct simulations to evaluate the performance of the proposed method, PWLS–MI (with ALASSO) and PWLS–MI (with LASSO), under different scenarios. The comparable methods are CC analysis with LASSO (CC–LASSO), MI–LASSO (Chen & Wang, 2013), MI–Stacked (Wood et al., 2008), a resampling approach (BI–SS) (Long & Johnson, 2015), and a method based on weighted PSR (Johnson et al., 2008).

3.1 Simulation settings

To fulfil the MAR assumption, we generate missing data with the model

logit{Pr(Zimissing|Ui)}=α+Ui,

indicating that the missing probability for variable Zi depends on covariate Ui⁠. The intercept α is chosen to control the missing percentage.

We simulate 1,000 datasets of size n containing outcome Y and p covariates X under six different scenarios. Scenario 0 is the benchmark setting, we take n=200 and p=20⁠, both Y and X are missing. In scenario 1, only Y is missing with different missing proportions. In scenario 2, we simulate X has different missing proportions. In scenario 3, scenario 4, and scenario 5, we set different correlations between X⁠, different n and p were compared with the benchmark setting respectively. In all cases, covariates are generated from a multivariate normal distribution with zero mean and variance-covariance matrix Σ⁠, wherein the variance is 3. The correlations among the covariates is set to 0.1 except for scenario 3. The missing proportion of Y is set to 50% except for scenario 1. The missing proportion of X is set to 50% except for scenario 2. The response of the _i_th subject in scenarios 0–4 is generated from the linear model

Yi=Xi1−Xi2+Xi3−Xi4+Xi5−Xi6−Xi7+Xi8+ϵi,

where ϵi∼N(0,3),i=1,…,n⁠.

A more comprehensive description of each scenario is detailed as follows.

3.2 Simulation results

We measure the variable selection performance for each method by sensitivity (SEN), specificity (SPE), and Gmeans. Figure 2 illustrates the results of S0. Note that it is not feasible to apply CC–LASSO since there are few complete observations under scenario 0. In this scenario, the SEN of MI–LASSO, MI–stacked, BI–SS, and PSR are similar to PWLS–MI, but the SPE of these methods are much lower than that of PWLS–MI. Hence we consider that PWLS–MI (with both LASSO and ALASSO) has good performance in variable selection compared with others.

Simulation results for MI–LASSO, MI–stacked, BI–SS, PSR, PWLS–MI (with LASSO), and PWLS–MI (with ALASSO) in S0. The bars from V1 to V8 are the selection frequency of important covariates under 1,000 replicates while the bars from V9 to V20 stand for the selection frequency of irrelevant variables.

Figure 2.

Simulation results for MI–LASSO, MI–stacked, BI–SS, PSR, PWLS–MI (with LASSO), and PWLS–MI (with ALASSO) in S0. The bars from V1 to V8 are the selection frequency of important covariates under 1,000 replicates while the bars from V9 to V20 stand for the selection frequency of irrelevant variables.

Tables 2 and 3 report the results for S1 and S2, respectively. The SEN for these methods under different missing proportions are almost 1.000, while the SPE of PWLS–MI (with both LASSO and ALASSO) is dramatically larger than other methods. Thus, PWLS–MI can eliminate more noise covariates than other methods do, although all the methods can select important predictors well. As the proportion of missing grows, it is reasonable that the Gmeans of all five methods decrease. When the proportion of missing is 70%, PWLS–MI still works well since it has the high Gmeans. Table 4 illustrates the specific numerical results for S3, S4, and S5. We find little difference between the results for weak or strong correlated covariates, which reflects that PWLS–MI has wider applicability. With the increase of the sample size, the performance of PWLS–MI improves. As the dimension increases, the performance of the PWLS–MI decreases.

Table 2.

The simulation results for CC–LASSO, MI–LASSO, PWLS–MI (with LASSO), PWLS–MI (with ALASSO), BI–SS, and PSR in S1

Missing Measurement CC–LASSO MI–LASSO PWLS–MI PWLS–MI BI–SS PSR
percentage (Y) (LASSO) (ALASSO)
SEN 0.888 0.919 1.000 1.000 0.931 0.926
30% SPE 0.701 0.843 0.973 0.990 0.894 0.862
Gmeans 0.789 0.880 0.986 0.995 0.912 0.894
SEN 0.932 0.902 1.000 1.000 0.933 0.909
50% SPE 0.603 0.837 0.913 0.923 0.851 0.837
Gmeans 0.750 0.869 0.956 0.961 0.891 0.872
SEN 0.932 0.951 1.000 1.000 0.932 0.825
70% SPE 0.611 0.716 0.758 0.776 0.627 0.718
Gmeans 0.755 0.825 0.871 0.881 0.764 0.815
Missing Measurement CC–LASSO MI–LASSO PWLS–MI PWLS–MI BI–SS PSR
percentage (Y) (LASSO) (ALASSO)
SEN 0.888 0.919 1.000 1.000 0.931 0.926
30% SPE 0.701 0.843 0.973 0.990 0.894 0.862
Gmeans 0.789 0.880 0.986 0.995 0.912 0.894
SEN 0.932 0.902 1.000 1.000 0.933 0.909
50% SPE 0.603 0.837 0.913 0.923 0.851 0.837
Gmeans 0.750 0.869 0.956 0.961 0.891 0.872
SEN 0.932 0.951 1.000 1.000 0.932 0.825
70% SPE 0.611 0.716 0.758 0.776 0.627 0.718
Gmeans 0.755 0.825 0.871 0.881 0.764 0.815

Table 2.

The simulation results for CC–LASSO, MI–LASSO, PWLS–MI (with LASSO), PWLS–MI (with ALASSO), BI–SS, and PSR in S1

Missing Measurement CC–LASSO MI–LASSO PWLS–MI PWLS–MI BI–SS PSR
percentage (Y) (LASSO) (ALASSO)
SEN 0.888 0.919 1.000 1.000 0.931 0.926
30% SPE 0.701 0.843 0.973 0.990 0.894 0.862
Gmeans 0.789 0.880 0.986 0.995 0.912 0.894
SEN 0.932 0.902 1.000 1.000 0.933 0.909
50% SPE 0.603 0.837 0.913 0.923 0.851 0.837
Gmeans 0.750 0.869 0.956 0.961 0.891 0.872
SEN 0.932 0.951 1.000 1.000 0.932 0.825
70% SPE 0.611 0.716 0.758 0.776 0.627 0.718
Gmeans 0.755 0.825 0.871 0.881 0.764 0.815
Missing Measurement CC–LASSO MI–LASSO PWLS–MI PWLS–MI BI–SS PSR
percentage (Y) (LASSO) (ALASSO)
SEN 0.888 0.919 1.000 1.000 0.931 0.926
30% SPE 0.701 0.843 0.973 0.990 0.894 0.862
Gmeans 0.789 0.880 0.986 0.995 0.912 0.894
SEN 0.932 0.902 1.000 1.000 0.933 0.909
50% SPE 0.603 0.837 0.913 0.923 0.851 0.837
Gmeans 0.750 0.869 0.956 0.961 0.891 0.872
SEN 0.932 0.951 1.000 1.000 0.932 0.825
70% SPE 0.611 0.716 0.758 0.776 0.627 0.718
Gmeans 0.755 0.825 0.871 0.881 0.764 0.815

Table 3.

The simulation results for MI–LASSO, PWLS–MI (with LASSO), PWLS–MI (with ALASSO), BI–SS, and PSR in S2

Missing Measurement MI–LASSO PWLS–MI PWLS–MI BI–SS PSR
percentage (X) (LASSO) (ALASSO)
SEN 0.933 1.000 1.000 0.906 0.906
30% SPE 0.766 0.819 0.838 0.870 0.704
Gmeans 0.846 0.905 0.916 0.888 0.799
SEN 0.993 1.000 0.987 0.995 0.937
50% SPE 0.726 0.778 0.872 0.757 0.621
Gmeans 0.849 0.882 0.928 0.868 0.763
SEN 0.930 0.996 0.992 0.953 0.882
70% SPE 0.734 0.740 0.816 0.664 0.556
Gmeans 0.816 0.855 0.900 0.795 0.701
Missing Measurement MI–LASSO PWLS–MI PWLS–MI BI–SS PSR
percentage (X) (LASSO) (ALASSO)
SEN 0.933 1.000 1.000 0.906 0.906
30% SPE 0.766 0.819 0.838 0.870 0.704
Gmeans 0.846 0.905 0.916 0.888 0.799
SEN 0.993 1.000 0.987 0.995 0.937
50% SPE 0.726 0.778 0.872 0.757 0.621
Gmeans 0.849 0.882 0.928 0.868 0.763
SEN 0.930 0.996 0.992 0.953 0.882
70% SPE 0.734 0.740 0.816 0.664 0.556
Gmeans 0.816 0.855 0.900 0.795 0.701

Table 3.

The simulation results for MI–LASSO, PWLS–MI (with LASSO), PWLS–MI (with ALASSO), BI–SS, and PSR in S2

Missing Measurement MI–LASSO PWLS–MI PWLS–MI BI–SS PSR
percentage (X) (LASSO) (ALASSO)
SEN 0.933 1.000 1.000 0.906 0.906
30% SPE 0.766 0.819 0.838 0.870 0.704
Gmeans 0.846 0.905 0.916 0.888 0.799
SEN 0.993 1.000 0.987 0.995 0.937
50% SPE 0.726 0.778 0.872 0.757 0.621
Gmeans 0.849 0.882 0.928 0.868 0.763
SEN 0.930 0.996 0.992 0.953 0.882
70% SPE 0.734 0.740 0.816 0.664 0.556
Gmeans 0.816 0.855 0.900 0.795 0.701
Missing Measurement MI–LASSO PWLS–MI PWLS–MI BI–SS PSR
percentage (X) (LASSO) (ALASSO)
SEN 0.933 1.000 1.000 0.906 0.906
30% SPE 0.766 0.819 0.838 0.870 0.704
Gmeans 0.846 0.905 0.916 0.888 0.799
SEN 0.993 1.000 0.987 0.995 0.937
50% SPE 0.726 0.778 0.872 0.757 0.621
Gmeans 0.849 0.882 0.928 0.868 0.763
SEN 0.930 0.996 0.992 0.953 0.882
70% SPE 0.734 0.740 0.816 0.664 0.556
Gmeans 0.816 0.855 0.900 0.795 0.701

Table 4.

The simulation results for PWLS–MI (with ALASSO) in S3–S5

Scenarios Settings Measurement
SEN SPE Gmeans
S3 φ=0.1 0.987 0.872 0.928
φ=0.5 0.975 0.861 0.916
φ=0.7 0.959 0.836 0.895
S4 n=200 0.987 0.872 0.928
n=600 1.000 0.921 0.959
S5 p=20 0.987 0.872 0.928
p=50 0.989 0.810 0.895
Scenarios Settings Measurement
SEN SPE Gmeans
S3 φ=0.1 0.987 0.872 0.928
φ=0.5 0.975 0.861 0.916
φ=0.7 0.959 0.836 0.895
S4 n=200 0.987 0.872 0.928
n=600 1.000 0.921 0.959
S5 p=20 0.987 0.872 0.928
p=50 0.989 0.810 0.895

Table 4.

The simulation results for PWLS–MI (with ALASSO) in S3–S5

Scenarios Settings Measurement
SEN SPE Gmeans
S3 φ=0.1 0.987 0.872 0.928
φ=0.5 0.975 0.861 0.916
φ=0.7 0.959 0.836 0.895
S4 n=200 0.987 0.872 0.928
n=600 1.000 0.921 0.959
S5 p=20 0.987 0.872 0.928
p=50 0.989 0.810 0.895
Scenarios Settings Measurement
SEN SPE Gmeans
S3 φ=0.1 0.987 0.872 0.928
φ=0.5 0.975 0.861 0.916
φ=0.7 0.959 0.836 0.895
S4 n=200 0.987 0.872 0.928
n=600 1.000 0.921 0.959
S5 p=20 0.987 0.872 0.928
p=50 0.989 0.810 0.895

4 Analysis of the A/H7N9 patients database from Zhejiang province

The proposed PWLS–MI approach is applied to the A/H7N9 patient database to identify the influencing factors for CRP in subjects infected with A/H7N9. We compare the performance of PWLS–MI with that of MI–LASSO while considering two candidate penalties from the framework, namely, ALASSO and LASSO. It is not feasible to present the results of CC–LASSO. Notice that there are only eight complete observations in such a database.

The A/H7N9 dataset from Zhejiang province is part of a national epidemiological survey collected by the Chinese Center for Disease Control and Prevention. It includes all laboratory-confirmed A/H7N9 patients in Zhejiang province since the first case of A/H7N9 infection in April 2013. Details on the data collection methods and variable definitions have been published previously (Martinez et al., 2019). A/H7N9 is a virus that can cause excessive production of cytokines; it can lead to acute pneumonia and ARDS. Previous studies (Wu et al., 2016) show that the CRP is an important indicator of the severity and prognosis of patients with an A/H7N9 infection. This data application aims to explore individual characteristics associated with CRP expression levels to assist in the early prognosis of disease severity.

In this investigation, we treat CRP expression levels as the response variable and consider 28 potential influencing factors, including demographic characteristics, exposure-related characteristics, clinical features, and laboratory indicators. Variable information and missing proportions are listed in Table 5, and further statistical analysis results are given in the supplementary material. Specifically, ‘Exposure’ refers to exposure to the poultry or related environment; ‘Area residential’ indicates urban and rural area; ‘Area exposure’ indicates urban and rural; ‘First medical unit’ refers to the places where initial medical treatment was provided to each subject, and is integrated into three levels, namely, first (village, community, and private), second (county and prefecture), and tertiary (provincial); ‘Age’ is categorized into five levels, namely, <45⁠, [45,55)⁠, [55,65)⁠, [65,75)⁠, and ≥75⁠; ‘White (blood cell) count’ is divided into <3.5⁠, [3.5,10.5]⁠, and >10.5⁠; ‘Highest temperature’ is dichotomized as fever (higher than 37.8 Celsius) versus normal (lower than 37.8 Celsius); ‘Lymphocyte count per cent’ has three levels, namely, low (⁠<0.20⁠), normal (⁠[0.20,0.39]⁠), and high (⁠>0.40⁠); ‘Neutrophil count per cent’ is grouped into four distinct ranges, namely, <0.70⁠, [0.70,0.79)⁠, [0.79,0.86)⁠, and ≥0.86⁠. Variable categorization cut-offs are chosen such that they are consistent with a previous investigation of the same cohort (Martinez et al., 2019). Our proposed approach with two selection penalties, ALASSO and LASSO, is applied to the A/H7N9 dataset. Figure 3 shows the results. Different variable selection results are noted between PWLS–MI approaches and MI–LASSO. For the PWLS–MI approaches with different types of selection penalties, the variable selected by both PWLS–MI (with ALASSO) and PWLS–MI (with LASSO) is ‘Lymphocyte count per cent’, while ‘Both lung infection’ and ‘Age’ are only selected by PWLS–MI (with ALASSO).

Selection results of PWLS–MI (with ALASSO), PWLS–MI (with LASSO), and MI–LASSO in A/H7N9 real data. The horizontal bars stand for the selection frequency of a specific method in 50 Jackknife resampling datasets. The vertical solid line at 0.8 is the threshold as 80%.

Figure 3.

Selection results of PWLS–MI (with ALASSO), PWLS–MI (with LASSO), and MI–LASSO in A/H7N9 real data. The horizontal bars stand for the selection frequency of a specific method in 50 Jackknife resampling datasets. The vertical solid line at 0.8 is the threshold as 80%.

Table 5.

The covariates in the real data of A/H7N9 can be divided into four categories

Variable Variable Missing Estimate
classification proportion
Age 0
<45 −1.463
45∼55 −1.127
55∼65 −0.325
65∼75 Reference
Demography >75 0.130
Sex 0
Area exposure 0
Area residential 0
First medical unit 0
Smoking 60.7%
Hypertension (HP) 4.9%
Underlying diseases (Underlying) 4.9%
Clinical Diabetes (DB) 6.9%
Pulmory 11.1%
Cardiovascular 11.5%
Highest temperature 4.6%
White count 6.9%
Neutrophil count per cent 19.7%
Pneumonia 21.0%
Both lung infection 27.9%
Laboratory no Reference
related yes 0.130
Lymphocyte count per cent 46.9%
low 0.410
normal Reference
high −0.778
Chronic drug use 48.2%
Visit Live Poultry Markets (LPM) 0
Live poultry market worker (LPMW) 0
Direct Contact with Poultry or Swine (LP) 0
Exposure Killed Live or Freshly Slaughtered Poultry (Kill) 0
related Bought Live or Freshly Slaughtered Poultry (Buy) 0
Raise poultry at home or around the house (Raise) 0
Exposure to poultry or related environment (Exposure) 0
Response C-reactive Protein 56.4%
Working matrix Correlation 0.382
Variable Variable Missing Estimate
classification proportion
Age 0
<45 −1.463
45∼55 −1.127
55∼65 −0.325
65∼75 Reference
Demography >75 0.130
Sex 0
Area exposure 0
Area residential 0
First medical unit 0
Smoking 60.7%
Hypertension (HP) 4.9%
Underlying diseases (Underlying) 4.9%
Clinical Diabetes (DB) 6.9%
Pulmory 11.1%
Cardiovascular 11.5%
Highest temperature 4.6%
White count 6.9%
Neutrophil count per cent 19.7%
Pneumonia 21.0%
Both lung infection 27.9%
Laboratory no Reference
related yes 0.130
Lymphocyte count per cent 46.9%
low 0.410
normal Reference
high −0.778
Chronic drug use 48.2%
Visit Live Poultry Markets (LPM) 0
Live poultry market worker (LPMW) 0
Direct Contact with Poultry or Swine (LP) 0
Exposure Killed Live or Freshly Slaughtered Poultry (Kill) 0
related Bought Live or Freshly Slaughtered Poultry (Buy) 0
Raise poultry at home or around the house (Raise) 0
Exposure to poultry or related environment (Exposure) 0
Response C-reactive Protein 56.4%
Working matrix Correlation 0.382

Note. Estimated effect of each selected variables by PWLS–MI (with ALASSO) are listed.

Table 5.

The covariates in the real data of A/H7N9 can be divided into four categories

Variable Variable Missing Estimate
classification proportion
Age 0
<45 −1.463
45∼55 −1.127
55∼65 −0.325
65∼75 Reference
Demography >75 0.130
Sex 0
Area exposure 0
Area residential 0
First medical unit 0
Smoking 60.7%
Hypertension (HP) 4.9%
Underlying diseases (Underlying) 4.9%
Clinical Diabetes (DB) 6.9%
Pulmory 11.1%
Cardiovascular 11.5%
Highest temperature 4.6%
White count 6.9%
Neutrophil count per cent 19.7%
Pneumonia 21.0%
Both lung infection 27.9%
Laboratory no Reference
related yes 0.130
Lymphocyte count per cent 46.9%
low 0.410
normal Reference
high −0.778
Chronic drug use 48.2%
Visit Live Poultry Markets (LPM) 0
Live poultry market worker (LPMW) 0
Direct Contact with Poultry or Swine (LP) 0
Exposure Killed Live or Freshly Slaughtered Poultry (Kill) 0
related Bought Live or Freshly Slaughtered Poultry (Buy) 0
Raise poultry at home or around the house (Raise) 0
Exposure to poultry or related environment (Exposure) 0
Response C-reactive Protein 56.4%
Working matrix Correlation 0.382
Variable Variable Missing Estimate
classification proportion
Age 0
<45 −1.463
45∼55 −1.127
55∼65 −0.325
65∼75 Reference
Demography >75 0.130
Sex 0
Area exposure 0
Area residential 0
First medical unit 0
Smoking 60.7%
Hypertension (HP) 4.9%
Underlying diseases (Underlying) 4.9%
Clinical Diabetes (DB) 6.9%
Pulmory 11.1%
Cardiovascular 11.5%
Highest temperature 4.6%
White count 6.9%
Neutrophil count per cent 19.7%
Pneumonia 21.0%
Both lung infection 27.9%
Laboratory no Reference
related yes 0.130
Lymphocyte count per cent 46.9%
low 0.410
normal Reference
high −0.778
Chronic drug use 48.2%
Visit Live Poultry Markets (LPM) 0
Live poultry market worker (LPMW) 0
Direct Contact with Poultry or Swine (LP) 0
Exposure Killed Live or Freshly Slaughtered Poultry (Kill) 0
related Bought Live or Freshly Slaughtered Poultry (Buy) 0
Raise poultry at home or around the house (Raise) 0
Exposure to poultry or related environment (Exposure) 0
Response C-reactive Protein 56.4%
Working matrix Correlation 0.382

Note. Estimated effect of each selected variables by PWLS–MI (with ALASSO) are listed.

For ‘Both lung infection’ selected by PWLS–MI (with ALASSO), Moayyedkazemi and Rahimirad (2018) report that lung infection is related to the severity of pneumonia, but they do not discuss the relationship with CRP. However, as suggested by Figure 3 in the supplementary material, a majority of our enrolled subjects experienced pneumonia, and Chalmers et al. (2008) show that CRP is an independent marker of the severity of community-acquired pneumonia (CAP). The difference in CRP between both lung-infected and non-both-lung-infected A/H7N9 patients suggested by our analysis establishes the association between CRP expression levels and lung infection severity. The results in the attached Tables also reveal the statistical significance of ‘Both lung infection’. For the variable ‘Age’, as shown in the appendix, the CRP levels increase with increasing age. Besides, there also exists a statistically significant difference among different levels of ‘Age’. This result is consistent with the current clinical knowledge since previous studies show that CRP levels increase with healthy and successful aging as well as in age-related diseases (Tang et al., 2017). The variable ‘Lymphocyte count per cent’ is selected by PWLS–MI, but not by MI–LASSO. Malhotra et al. (2015) report that the severity of pneumonia is related to lymphocytes. Likewise, our PWLS–MI selects ‘Lymphocyte count per cent’ as an influencing factor and the differences in CRP levels among H7N9 patients with low, normal, and high ‘Lymphocyte count per cent’, which, in turn, echoes the findings from Chalmers et al. (2008). In this specific case, PWLS–MI (with ALASSO) can identify additional important variables with clinical evidence, while MI–LASSO fails to do so.

In the data analysis, we use the ‘observed occurrence index (OOI)’ in Huang and Ma (2010) to measure variable selection stability. Specifically, we employ the Jackknife resampling method on the original data 50 times and obtain 50 subsets, each with 200 observations, as per similar procedures suggested in Huang and Ma (2010). For each subset, we perform an MI for the variable with missing information. We subsequently apply PWLS–MI (with ALASSO), PWLS–MI (with LASSO), and MI–LASSO to each subset and collect the variable selection results. We then measure the OOI of a variable by its selection frequency based on 50 selection experiments. Note that the higher selection frequency indicates a higher degree of variable selection stability. As Figure 3 shows, variables selected by PWLS–MI have higher OOI than those by MI–LASSO.

We also list the effect size estimates of the variables selected by PWLS–MI (with ALASSO) under the linear model in Table 5. ‘Both lung infection’ is positively associated with CRP. Additionally, patients with low and high ‘Lymphocyte count per cent’ have higher and lower CRP expression levels compared with those who have ‘Lymphocyte count per cent’ within the normal range, respectively. For ‘Age’, when patients aged between 65 and 75 years are selected as the reference group, all the younger age groups (⁠<45⁠, 45∼55⁠, 55∼65⁠) have lower CRP, while the oldest group (⁠>75⁠) shows increasing expression levels.

5 Discussion

In this paper, we proposed a framework for variable selection on multiply imputed data, which considers the correlation among imputed datasets from MI. We presented the specific mathematical framework under the linear model. To compare our proposed method with existing approaches for variable selection on multiply imputed data, we conducted extensive simulations with various complex settings. The numerical results suggest that our proposed method outperforms other methods in terms of selection and deletion accuracies. A case study on the A/H7N9 data illustrates the various selection results from different approaches and selection penalties in practice.

Many directions for further research remain. Firstly, we focused on the development of a novel variable selection framework for MI data under the linear model. In many other cases, binary responses or count outcomes are of research interest as well. The extension of our proposed framework to generalized linear models is, thus, warranted. Secondly, survival analysis has an important role to play in the prognosis study of cancers, and the problem of missing data also exists in survival analysis. It would be of interest to develop a variable selection framework for multiply imputed survival data.

Acknowledgments

The authors are grateful to the Editor, the Associate Editor, and two anonymous reviewers for their many insightful comments and suggestions. The authors thank Ms. Lin Li for her productive discussion.

Funding

Dr Yang Li’s work is supported by the National Natural Science Foundation of China (72271237) and the Platform of Public Health & Disease Control and Prevention, Major Innovation & Planning Interdisciplinary Platform for the ‘Double-First Class’ Initiative, Renmin University of China, and the MOE Project of Key Research Institute of Humanities and Social Sciences (22JJD910001).

Data availability

The data that support the findings of this study are available on request from the corresponding author (yeshen@uga.edu). The data are not publicly available due to privacy or ethical restrictions. R code used in this article can be found in the Supplementary material.

Supplementary material

Supplementary material are available at Journal of the Royal Statistical Society: Series C online.

Appendix A. Additional Details for QGCV

In this work, we develop a novel information criterion, termed the multiply mputed information criterion (MIC), by combining the weighted BIC and quasi-generalized cross-validation (QGCV). In this section, we would like to provide more details about the QGCV.

According to Fu (2005), the QGCV with the adaptive lasso penalty is defined as

where λ and γ are the tuning parameters of the penalty function, Pλ(β)=λ∑j=1pw^j|β^j|⁠, where λ is the tuning parameter, and w^j is the _j_th element of w^=1/|β~|γ⁠, where β~ is the pilot estimator which can be obtained through the ordinary least-square estimation and γ>0 is also a tuning parameter (Zou, 2006).

Specifically, for each term in equation (3),

WDev(λ,γ)=∑i∈I1ξiTR^−1ξi,N=∑i∈I1D2|R^|=n1D1+(D−1)ρ^,p(λ,γ)=p⋅‖β^(λ,γ)‖1‖β^(0)‖1,

where I1 represent the set of CC indices and with |I1|=n1⁠. following Fu (2005), we define ξi=(ξi(1),…,ξi(D))T as the deviance residuals, where ξi(k)=sign(yi(k)−μ^i(k))−2logL(yi(k),μ^i(k)) is calculated at each multiply imputed data k of subject i, where L(yi(k),μ^i(k)) is the density function or probability distribution for the observation yi(k)⁠, and μ^i(k) is the fitted value. R^−1 represents the estimation of the working correlation matrix, |R^| represents the determinant of the matrix R^⁠, and ρ^ represents the estimated ρ involved in R⁠. ‖β^(λ,γ)‖1 and ‖β^(0)‖1 represent the ℓ1 norm of the proposed estimator and the no–penalty estimator, respectively.

Moreover, note that the BIC is calculated by the samples with completely observed responses in the original dataset, while QGCV is calculated by the samples with missing responses in the imputed datasets.

Appendix B. Additional Simulation Results for MIC

In this section, we conduct the variable selection through MI–Stacked using the proposed MIC criterion. In this simulation, we follow the benchmark setting and measurements in Section 3.1 of the manuscript. In particular, we implement the original MI–Stacked method using cross-validation with deviance loss to select the tuning parameter. MI–stacked (MIC) represents the stacking method that the tuning parameter is selected according to the MIC criterion.

As seen in Table B1, numerical results show that the proposed method performs better than MI–stacked, which means taking the correlation among multiply imputed data sets is helpful for the variable selection. In addition, numerical results indicate that using the MIC criterion can also improve the performance of MI–stacked. Therefore, we conclude that both considering the correlation and using the MIC criterion can enhance the performance of the variable selection method based on multiple implementation.

Table B1.

The simulation results for PWLS–MI, MI–stacked, and MI–stacked (MIC)

Measurement PWLS–MI MI–stacked MI–stacked (MIC)
SEN 1.000 1.000 1.000
SPE 0.794 0.576 0.696
Gmeans 0.891 0.759 0.834
Measurement PWLS–MI MI–stacked MI–stacked (MIC)
SEN 1.000 1.000 1.000
SPE 0.794 0.576 0.696
Gmeans 0.891 0.759 0.834

Table B1.

The simulation results for PWLS–MI, MI–stacked, and MI–stacked (MIC)

Measurement PWLS–MI MI–stacked MI–stacked (MIC)
SEN 1.000 1.000 1.000
SPE 0.794 0.576 0.696
Gmeans 0.891 0.759 0.834
Measurement PWLS–MI MI–stacked MI–stacked (MIC)
SEN 1.000 1.000 1.000
SPE 0.794 0.576 0.696
Gmeans 0.891 0.759 0.834

References

Azur

M. J.

,

Stuart

E. A.

,

Frangakis

C.

, &

Leaf

P. J.

(

2011

).

Multiple imputation by chained equations: What is it and how does it work?

.

International Journal of Methods in Psychiatric Research

,

20

(

1

),

40

49

. https://doi.org/10.1002/mpr.329

Chalmers

J. D.

,

Singanayagam

A.

, &

Hill

A. T.

(

2008

).

C-reactive protein is an independent predictor of severity in community-acquired pneumonia

.

The American Journal of Medicine

,

121

(

3

),

219

225

. https://doi.org/10.1016/j.amjmed.2007.10.033

Chen

Q.

, &

Wang

S.

(

2013

).

Variable selection for multiply-imputed data with application to dioxin exposure study

.

Statistics in Medicine

,

32

(

21

),

3646

3659

. https://doi.org/10.1002/sim.5783

Fan

J.

, &

Li

R.

(

2001

).

Variable selection via nonconcave penalized likelihood and its oracle properties

.

Journal of the American statistical Association

,

96

(

456

),

1348

1360

. https://doi.org/10.1198/016214501753382273

Fang

S.

,

Wang

Y.

,

Sui

D.

,

Liu

H.

,

Ross

M. I.

,

Gershenwald

J. E.

,

Cormier

J. N.

,

Royal

R. E.

,

Lucci

A.

,

Schacherer

C. W.

,

Gardner

J. M.

,

Reveille

J. D.

,

Bassett

R. L.

,

Wang

L.-E.

,

Wei

Q.

,

Amos

C. I.

, &

Lee

J. E.

(

2015

).

C-reactive protein as a marker of melanoma progression

.

Journal of Clinical Oncology

,

33

(

12

),

1389

1396

. https://doi.org/10.1200/JCO.2014.58.0209

Heymans

M. W.

,

Buuren

S. v.

,

Knol

D. L.

,

Mechelen

W. v.

, &

Vet

H. C.

(

2007

).

Variable selection under multiple imputation using the bootstrap in a prognostic study

.

BMC Medical Research Methodology

,

7

(

1

),

33

. https://doi.org/10.1186/1471-2288-7-33

Ho

K. M.

,

Lee

K. Y.

,

Dobb

G. J.

, &

Webb

S. A.

(

2008

).

C-reactive protein concentration as a predictor of in-hospital mortality after icu discharge: A prospective cohort study

.

Intensive Care Medicine

,

34

(

3

),

481

487

. https://doi.org/10.1007/s00134-007-0928-0

Iuliano

A. D.

,

Jang

Y.

,

Jones

J.

,

Davis

C. T.

,

Wentworth

D. E.

,

Uyeki

T. M.

,

Roguski

K.

,

Thompson

M. G.

,

Gubareva

L.

,

Fry

A. M.

, &

Burns

E.

(

2017

).

Increase in human infections with avian influenza A (H7N9) virus during the fifth epidemic–china, october 2016–february 2017

.

MMWR. Morbidity and Mortality Weekly Report

,

66

(

9

),

254

255

. https://doi.org/10.15585/mmwr.mm6609e2

Johnson

B. A.

,

Lin

D.

, &

Zeng

D.

(

2008

).

Penalized estimating functions and variable selection in semiparametric regression models

.

Journal of the American Statistical Association

,

103

(

482

),

672

680

. https://doi.org/10.1198/016214508000000184

Li

Q.

,

Zhou

L.

,

Zhou

M.

,

Chen

Z.

,

Li

F.

,

Wu

H.

,

Xiang

N.

,

Chen

E.

,

Tang

F.

,

Wang

D.

, &

Meng

L.

(

2014

).

Epidemiology of human infections with avian influenza A (H7N9) virus in china

.

New England Journal of Medicine

,

370

(

6

),

520

532

. https://doi.org/10.1056/NEJMoa1304617

Little

R. J.

, &

Rubin

D. B.

(

2019

).

Statistical analysis with missing data

(Vol.

793

).

John Wiley & Sons

.

Malhotra

R.

,

Marcelli

D.

,

Gersdorff

G. von

,

Grassmann

A.

,

Schaller

M.

,

Bayh

I.

,

Scatizzi

L.

,

Etter

M.

,

Guinsburg

A.

,

Barth

C.

, &

Marelli

C

(

2015

).

Relationship of neutrophil-to-lymphocyte ratio and serum albumin levels with c-reactive protein in hemodialysis patients: Results from 2 international cohort studies

.

Nephron

,

130

(

4

),

263

270

. https://doi.org/10.1159/000437005

Martinez

L.

,

Cheng

W.

,

Wang

X.

,

Ling

F.

,

Mu

L.

,

Li

C.

,

Huo

X.

,

Ebell

M. H.

,

Huang

H.

,

Zhu

L.

, &

Li

C

(

2019

).

A risk classification model to predict mortality among laboratory-confirmed avian influenza A H7N9 patients: A population-based observational cohort study

.

The Journal of Infectious Diseases

,

220

(

11

),

1780

1789

. https://doi.org/10.1093/infdis/jiz328

Moayyedkazemi

A.

, &

Rahimirad

M. H.

(

2018

).

Evaluating serum c-reactive protein level in patients with chronic obstructive pulmonary disease and its correlation with disease severity

.

Biomedical Research and Therapy

,

5

(

11

),

2784

2788

. https://doi.org/10.15419/bmrat.v5i11.494

Shi

J.

,

Deng

G.

,

Kong

H.

,

Gu

C.

,

Ma

S.

,

Yin

X.

,

Zeng

X.

,

Cui

P.

,

Chen

Y.

,

Yang

H.

,

Wan

X

,

Wang

,

X.

,

Liu

,

L.

,

Chen

,

P.

,

Jiang

,

Y.

,

Liu

,

J.

,

Guan

,

Y.

,

Suzuki

,

Y.

,

Li

,

M.

, …

Chen

,

H.

(

2017

).

H7N9 virulent mutants detected in chickens in china pose an increased threat to humans

.

Cell Research

,

27

(

12

),

1409

1421

. https://doi.org/10.1038/cr.2017.129

Silvestre

J.

,

Povoa

P.

,

Coelho

L.

,

Almeida

E.

,

Moreira

P.

,

Fernandes

A.

,

Mealha

R.

, &

Sabino

H.

(

2009

).

Is c-reactive protein a good prognostic marker in septic patients?

.

Intensive Care Medicine

,

35

(

5

),

909

913

. https://doi.org/10.1007/s00134-009-1402-y

Wang

X.

,

Jiang

H.

,

Wu

P.

,

Uyeki

T. M.

,

Feng

L.

,

Lai

S.

,

Wang

L.

,

Huo

X.

,

Xu

K.

, &

Chen

E.

(

2017

).

Epidemiology of avian influenza A H7N9 virus in human beings across five epidemics in mainland china, 2013–17: An epidemiological study of laboratory-confirmed case series

.

Lancet Infectious Diseases

,

17

(

8

),

822

832

. https://doi.org/10.1016/S1473-3099(17)30323-7

Wood

A. M.

,

White

I. R.

, &

Royston

P.

(

2008

).

How should variable selection be performed with multiply imputed data?

.

Statistics in Medicine

,

27

(

17

),

3227

3246

. https://doi.org/10.1002/sim.3177

Wu

W.

,

Shi

D.

,

Fang

D.

,

Guo

F.

,

Guo

J.

,

Huang

F.

,

Chen

Y.

,

Lv

L.

, &

Li

L.

(

2016

).

A new perspective on c-reactive protein in H7N9 infections

.

International Journal of Infectious Diseases

,

44

(

1

),

31

36

. https://doi.org/10.1016/j.ijid.2016.01.009

Yu

H.

,

Cowling

B. J.

,

Feng

L.

,

Lau

E. H.

,

Liao

Q.

,

Tsang

T. K.

,

Peng

Z.

,

Wu

P.

,

Liu

F.

,

Fang

V. J.

, &

Zhang

H

(

2013

).

Human infection with avian influenza A H7N9 virus: An assessment of clinical severity

.

The Lancet

,

382

(

9887

),

138

145

. https://doi.org/10.1016/S0140-6736(13)61207-6

Zhou

L.

,

Ren

R.

,

Yang

L.

,

Bao

C.

,

Wu

J.

,

Wang

D.

,

Li

C.

,

Xiang

N.

,

Wang

Y.

,

Li

D.

, &

Sui

H

(

2017

).

Sudden increase in human infection with avian influenza A (H7N9) virus in china, september–december 2016

.

Western Pacific Surveillance and Response Journal: WPSAR

,

8

(

1

),

6

14

. https://doi.org/10.5365/wpsar.2017.8.1.001

Author notes

Conflict of interest: The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

© (RSS) Royal Statistical Society 2023. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 423

297 Pageviews

126 PDF Downloads

Since 4/1/2023

Month: Total Views:
April 2023 52
May 2023 32
June 2023 14
July 2023 17
August 2023 18
September 2023 6
October 2023 16
November 2023 8
December 2023 16
January 2024 13
February 2024 20
March 2024 43
April 2024 8
May 2024 17
June 2024 36
July 2024 30
August 2024 40
September 2024 34
October 2024 3

×

Email alerts

Citing articles via

More from Oxford Academic