Abstract
We describe the Bedside Patient Rescue (BPR) project, the goal of which is risk prediction of adverse events for non-intensive care unit patients using ∼100 variables (vitals, lab results, assessments, etc.). There are several missing predictor values for most patients, which in the health sciences is the norm, rather than the exception. A Bayesian approach is presented that addresses many of the shortcomings to standard approaches to missing predictors: (i) treatment of the uncertainty due to imputation is straight-forward in the Bayesian paradigm, (ii) the predictor distribution is flexibly modeled as an infinite normal mixture with latent variables to explicitly account for discrete predictors (i.e., as in multivariate probit regression models), and (iii) certain missing not at random situations can be handled effectively by allowing the indicator of missingness into the predictor distribution only to inform the distribution of the missing variables. The proposed approach also has the benefit of providing a distribution for the prediction, including the uncertainty inherent in the imputation. Therefore, we can ask questions such as: is it possible this individual is at high risk but we are missing too much information to know for sure? How much would we reduce the uncertainty in our risk prediction by obtaining a particular missing value? This approach is applied to the BPR problem resulting in excellent predictive capability to identify deteriorating patients. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
Original language | English (US) |
---|---|
Pages (from-to) | 32-46 |
Number of pages | 15 |
Journal | Journal of the American Statistical Association |
Volume | 115 |
Issue number | 529 |
DOIs | |
State | Published - Jan 2 2020 |
Keywords
- Continuous and categorical
- Dirichlet process
- Hierarchical Bayesian model
- Latent variable
- Missing data
- Multiple imputation
ASJC Scopus subject areas
- Statistics and Probability
- Statistics, Probability and Uncertainty
Access to Document
Other files and links
Fingerprint
Dive into the research topics of 'Prediction and Inference With Missing Data in Patient Alert Systems'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS
In: Journal of the American Statistical Association, Vol. 115, No. 529, 02.01.2020, p. 32-46.
Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - Prediction and Inference With Missing Data in Patient Alert Systems
AU - Storlie, Curtis B.
AU - Therneau, Terry M.
AU - Carter, Rickey E.
AU - Chia, Nicholas
AU - Bergquist, John R.
AU - Huddleston, Jeanne M.
AU - Romero-Brufau, Santiago
N1 - Funding Information: The Bedside Patient Rescue (BPR) project at Mayo Clinic is an automated alert system to predict risk of deterioration of patients in general care floors. The primary response of interest was the time until one of the following events (i.e., cardiac arrest, transfer to the intensive care unit (ICU), or the patient requiring rapid response team intervention). Hospital patients typically show signs of deterioration up to days prior to adverse outcomes like cardiorespiratory arrest (CRA, Buist et al. 1999; Schein et al. 1990). The rate of in-hospital CRA requiring cardiopulmonary resuscitation is estimated to be 0.174 per bed per year in the US (Peberdy et al. 2003). After a cardiac arrest, survival to discharge is estimated to be as low as 18% (Peberdy et al. 2003), therefore, efforts to predict and prevent arrest could prove beneficial (Buist et al. 1999; Schein et al. 1990). There have been several proposed approaches for early warning systems (Kirkland et al. 2013; Griffiths and Kidney 2012; Prytherch et al. 2010; Smith et al. 2013), however, existing automated approaches have been hindered by low positive predictive values (Romero-Brufau et al. 2014). The BPR project is unique in that it uses ∼100 predictors (vitals, labs, assessments, demographics, etc.) from ∼35,000 patients to create a risk score. The training data were extracted from patients at Mayo Clinic from 2010 to 2011. Many of the variables are time varying, and some are time-lagged vitals (e.g., min and max in the last 24 hr). However, due to engineering decisions made at the time of implementation, the entire time profiles for vitals, etc., are not available for the patients. Several of the candidate predictor variables are the result of clinician selected multiplicative interactions and ratios (e.g., the shock index is heart rate over blood pressure). The predictor variables used are described in more detail in Section 4 and the complete list of 98 variables collected and calculated on each patient are provided in the supplementary materials. The focus of this article is on the statistical modeling and analysis of the BPR data, and in particular the treatment of missing data. Missing data problems are very common in practice, particularly in health sciences. All patients in the BPR training dataset have a missing value for at least one of their predictors. The average number of missing values for patients is 25, many of these variables being lab tests such as albumin or troponin, which are informative risk factors. Thus, a simple approach of excluding cases or variables that have missing data is not practical. Another common approach is to create a “missing” indicator and include it in the regression. Similarly, tree based algorithms like gradient boosting machine (GBM, Friedman 2001) treat missing values as a separate (third) node in a split on a variable and work well for prediction. However, with the above approaches, interpretation becomes challenging and this does not leverage the relationships among predictors. Regression based imputation, for example, random forest imputation (Stekhoven and Bühlmann 2012), can result in good prediction in many cases, however, it assumes no uncertainty about the imputed values making it difficult to assess uncertainty in predictions and inference. Multiple imputation (MI) pioneered by Rubin (1976) and maximum likelihood (ML) of the observed data (e.g., Schafer 1997) are intuitively attractive concepts to admit this uncertainty. MI and ML methods typically make the commonly misunderstood assumption of missing at random (MAR) Rubin (1976). The MAR assumption implies that the likelihood of a missing value can depend on the value of the unobserved variable, marginally, just not after accounting for all observed variables. Both MI and ML essentially aim to marginalize over the distribution of the missing data. The main caveat is that care must be taken to accurately represent the joint distribution of the complete data, that is, for all variables. The prominent approach is to assume a multivariate normal (MVN) distribution for the predictors and the assumed regression model for the response(s) conditional on them. However, predictors are commonly not Gaussian; often many are not even continuous and/or have hard limits. In particular, see the pairwise scatterplots of a few of the BPR predictors presented in Figure 1. Still, it is commonplace to treat categorical data as continuous, that is, MVN, and impute them anyhow. Such methods can work well for certain problems but are known to fail in others (Allison 2000; Horton, Lipsitz, and Parzen 2003; Bernaards, Belin, and Schafer 2007; Finch 2010). Pairwise scatterplots of a few BPR predictors (a random sample of 2000 observations). An alternative to the MVN and its shortcomings is to specify separate univariate models for each variable conditional on all other variables. This is called the chained equations or fully conditional approach (Raghunathan et al. 2001; Van Buuren et al. 2006). This approach allows flexibility for discrete variables, for example, categorical variables can be represented by a multinomial logistic model on the other variables. Variables are then imputed one-at-a-time in an iterative fashion resembling a Gibbs sampler. This approach has shown practical success, however, the resistance to more widespread use is that the full conditional models may not be compatible. That is, they may not correspond to any multivariate distribution (Raghunathan et al. 2001), which raises theoretical concerns and can lead to convergence issues (Li, Yu, and Rubin 2012; Si and Reiter 2013). Postulating a valid multivariate distribution is notoriously challenging in the presence of mixed variable types. One solution is the conditional Gaussian approach (Schafer 1997), where a log-linear model is specified for the discrete variables, and conditional on them, an MVN distribution is assumed for the continuous variables. However, even a modest number of categorical variables can lead to known difficulties in estimation (Horton and Kleinman 2007; Si and Reiter 2013). Another approach to specify a (valid) joint model is via specification of univariate sequential conditional distributions (e.g., Lipsitz and Ibrahim 1996). Xu, Daniels, and Winterstein (2016) extend this approach to use Bayesian additive regression trees as the model for the univariate conditionals. The drawback to this type of approach is that the specification is not invariant to the order of the conditioning. Thus, different orderings will lead to different joint distributions. Locally independent mixtures of multinomial distributions (i.e., where covariates are independent within a mixture component) have previously been used for MI of categorical data (Vermunt et al. 2008; Gebregziabher and DeSantis 2010; Si and Reiter 2013). Recently, Murray and Reiter (2016) have extended this approach to handle mixed categorical and continuous predictors with a conditional Gaussian framework, melding mixtures of independent multinomials for the categorical variables and mixtures of MVN conditional on the categorical values for the continuous variables. However, with a large number of discrete variables, a very large number of mixture components will be needed to represent even simple dependencies between variables. Dirichlet process models (DPMs) have been widely used for the flexible estimation of a multivariate distributions. Wade et al. (2014) divided these works into two classes, (i) the joint approach which formulates an infinite mixture model for the collection of the predictors and the response y (e.g., Müller, Erkanli, and West 1996; Shahbaba and Neal 2009), and (ii) the conditional approach which formulates a mixture model for and a separate model for. As documented in Wade et al. (2014), the joint approach has difficulties as p becomes large, particularly when accurate estimation of is important. They proposed enriched DPM which allows a nested clustering structure to alleviate this issue. This model was also used in Roy et al. (2017) in the context of inference in the presence of missing data. However, all of these works operate under the assumption of local independence and will require a large number of components to represent the dependency among the covariates, particularly when p is large. In this work, we extend the multivariate probit approach to modeling discrete variables (Lesaffre and Kaufmann 1992; Chib and Greenberg 1998), to allow for a multivariate representation of categorical and continuous variables as suggested by Dunson (2000) and Gueorguieva and Agresti (2001). A joint model for the predictors is created by assuming an MVN model for, the collection of continuous predictors and (continuous) latent variables for categorical predictors. We then relax the MVN assumption with a DPM for the distribution of. A similar model is proposed in Antonio and David (2015). However, with a large number of covariates, DPMs with an MVN kernel, struggle to produce many components due to the large number of parameters in each component (Storlie et al. 2017). To overcome the dimensionality issues, we use a variable selection approach (Raftery and Dean 2006; Kim, Tadesse, and Vannucci 2006; Storlie et al. 2017) to the DPM to allow the component parameters to vary on only sparse subset of the xj. Despite the flexibility and broad utility for such an approach, to the best of our knowledge, it has not been used to model mixed categorical and continuous data. We also investigate the inclusion of an indicator for missingness variable into the model for. This essentially amounts to formulating a selection model (Heckman 1976; Amemiya 1984), where here the selection probabilities (i.e., probabilities of observing the variables) depend on the rest of. These indicators are not part of the regression model for y, rather they are only to inform the distribution of the missing variables. Certain types of missing not at random (MNAR) situations can be handled effectively in this manner. We use a Bayesian approach to estimate this model, which has similarities to both MI and ML approaches in that it aims to marginalize over the missing data and obtain the posterior distribution of the relevant parameters, conditional on only the observed data. This is typically accomplished by sampling the missing values inside of the Markov chain Monte Carlo (MCMC) algorithm in a Gibbs type framework; conceptually each MCMC iteration contains two steps (i) complete the data conditional on the model parameters and (ii) update the parameters conditional on complete data. One of the advantages of Bayesian analysis to this problem is that it provides natural measures of uncertainty for parameters via the posterior distribution. An important consequence of this is that risk prediction for a given patient is represented as a distribution, including the uncertainty due to the missing predictors. Therefore, we can ask questions, such as, how much could we reduce the uncertainty in our risk prediction by obtaining a particular lab value? The rest of the article is laid out as follows. Section 2 describes the proposed Bayesian missing data approach for mixed categorical and continuous variables. Section 3 evaluates the performance of this approach when compared to some other methods on several simulation cases. The approach is then applied to the problem for which it was designed in Section 4 where a comprehensive analysis of the BPR problem is presented. Section 5 concludes the article. This article also has online supplementary materials containing details of the MCMC algorithm and descriptions of the variables in the BPR dataset. Publisher Copyright: © 2019, © 2019 American Statistical Association.
PY - 2020/1/2
Y1 - 2020/1/2
N2 - We describe the Bedside Patient Rescue (BPR) project, the goal of which is risk prediction of adverse events for non-intensive care unit patients using ∼100 variables (vitals, lab results, assessments, etc.). There are several missing predictor values for most patients, which in the health sciences is the norm, rather than the exception. A Bayesian approach is presented that addresses many of the shortcomings to standard approaches to missing predictors: (i) treatment of the uncertainty due to imputation is straight-forward in the Bayesian paradigm, (ii) the predictor distribution is flexibly modeled as an infinite normal mixture with latent variables to explicitly account for discrete predictors (i.e., as in multivariate probit regression models), and (iii) certain missing not at random situations can be handled effectively by allowing the indicator of missingness into the predictor distribution only to inform the distribution of the missing variables. The proposed approach also has the benefit of providing a distribution for the prediction, including the uncertainty inherent in the imputation. Therefore, we can ask questions such as: is it possible this individual is at high risk but we are missing too much information to know for sure? How much would we reduce the uncertainty in our risk prediction by obtaining a particular missing value? This approach is applied to the BPR problem resulting in excellent predictive capability to identify deteriorating patients. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
AB - We describe the Bedside Patient Rescue (BPR) project, the goal of which is risk prediction of adverse events for non-intensive care unit patients using ∼100 variables (vitals, lab results, assessments, etc.). There are several missing predictor values for most patients, which in the health sciences is the norm, rather than the exception. A Bayesian approach is presented that addresses many of the shortcomings to standard approaches to missing predictors: (i) treatment of the uncertainty due to imputation is straight-forward in the Bayesian paradigm, (ii) the predictor distribution is flexibly modeled as an infinite normal mixture with latent variables to explicitly account for discrete predictors (i.e., as in multivariate probit regression models), and (iii) certain missing not at random situations can be handled effectively by allowing the indicator of missingness into the predictor distribution only to inform the distribution of the missing variables. The proposed approach also has the benefit of providing a distribution for the prediction, including the uncertainty inherent in the imputation. Therefore, we can ask questions such as: is it possible this individual is at high risk but we are missing too much information to know for sure? How much would we reduce the uncertainty in our risk prediction by obtaining a particular missing value? This approach is applied to the BPR problem resulting in excellent predictive capability to identify deteriorating patients. Supplementary materials for this article, including a standardized description of the materials available for reproducing the work, are available as an online supplement.
KW - Continuous and categorical
KW - Dirichlet process
KW - Hierarchical Bayesian model
KW - Latent variable
KW - Missing data
KW - Multiple imputation
UR - http://www.scopus.com/inward/record.url?scp=85067881880&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85067881880&partnerID=8YFLogxK
U2 - 10.1080/01621459.2019.1604359
DO - 10.1080/01621459.2019.1604359
M3 - Article
AN - SCOPUS:85067881880
SN - 0162-1459
VL - 115
SP - 32
EP - 46
JO - Journal of the American Statistical Association
JF - Journal of the American Statistical Association
IS - 529
ER -