Home Nephro Research Explainable artificial intelligence model to predict acute critical illness from electronic health records

Explainable artificial intelligence model to predict acute critical illness from electronic health records

Data description

In this study, we analyzed the secondary healthcare data of all residents of four Danish municipalities (Odder, Hedensted, Skanderborg, and Horsens) who were 18 years of age or older for the period of 2012–2017. The data contained information from the electronic health record (EHR), including biochemistry, medicine, microbiology, and procedure codes, and was extracted from the “CROSS-TRACKS” cohort, which embraces a mixed rural and urban multi-center population with four regional hospitals and one larger university hospital. Each hospital comprises multiple departmental units, such as emergency medicine, intensive care, and thoracic surgery. We included all 163,050 available inpatient admissions (45.9% male) during the study period and excluded only outpatient admissions. The included admissions were distributed across 66,288 unique residents. The prevalence for sepsis, AKI, and ALI among these admissions was 2.44%, 0.75%, and 1.68%, respectively (see Table 1). The CROSS-TRACKS cohort offers a combined dimensional model of the secondary healthcare data. Merging all data sets is possible via a unique personal identification number given to all Danish citizens and by which all information within any public institution is collected31.

The model parameters were limited to include 27 laboratory parameters and six vital signs (see Tables 2 and 3). The parameters were selected by trained specialists in emergency medicine (medical doctors) with the sole purpose of simplifying the model to enable a better discussion of the model explanations. While a deeper model with more parameters might lead to better performance, it would also have made the discussions between clinicians and software engineers difficult. Therefore, the scope of this article is not to obtain the best performance at all costs but to demonstrate how clinical tasks can be supported by a fully explainable deep learning approach.

Data preprocessing

In the data extracted from the CROSS-TRACKS cohort, each admission is represented as a time-ordered sequence of EHR events. Importantly, the time-stamped order of this data reflects the point in time at which the clinicians record each event during the admission. Each event comprises three elements: a time stamp; an event name, such as blood pressure; and a numerical value. The event sequence is partitioned in aggregated intervals of one hour, such that the observation window of 24 h is divided into 24 one-hour periods, and all the events occurring within the same one-hour period are grouped together by their average numerical value.

Gold standards

Via a classification process, each admission was classified as sepsis-positive, AKI-positive, ALI-positive, or negative (no critical illness). For sepsis classification, we followed the recent Sepsis-330,39 implementation by Moor et al.5, according to which both suspected infection and organ dysfunction are required to be present5,30,39. Suspected infection was defined by the co-occurrence of body fluid sampling and antibiotic administration. When a culture sample was obtained before antibiotics administration, the antibiotic had to be ordered within 72 h. If the antibiotic was administered first, then the culture sample had to follow within 24 h.

The degree of organ dysfunction is described by an acute increase in the SOFA40 score and an increase of more than or equal to two points is used in the criteria for sepsis39. To implement the organ dysfunction criterion, we used a 72-h window from 48 h before to 24 h after the time of suspected infection, as suggested by Singer et al.30 and Moor et al.5. The Sepsis-3 implementation is visualized in Fig. 5.

AKI classification was performed according to the KDIGO criteria31. KDIGO accepts three definitions of AKI: (1) an increase in serum creatinine of 0.3 mg/dl (26.5 μmol/l) within 48 h; (2) an increase in serum creatinine by 1.5 times the habitual creatinine level of a patient within the previous seven days; and (3) a urine output of <0.5 ml/kg/h over 6 h. Following the work of Tomašev et al.1, only the first two definitions were used to provide ground-truth labels for the onset of AKI as urine measurements were not available. The habitual creatinine level was computed as the mean creatinine level during the previous 365 days. We used binary encoding for AKI such that all three severity stages (KDIGO stages 1, 2, and 3) were encoded as positive AKI. For ALI classification, we considered the presence of either NIV or CPAP during the admission, because PaO2/FiO2 measurements were not available. The ALI onset was the first occurrence of either NIV or CPAP (see Fig. 5).

Prediction module

The AI-EWS model is designed as a variation of a convolutional neural network (CNN) called a temporal convolutional network (TCN). CNNs have dominated computer vision tasks for the last century and are also highly capable of performing sequential tasks, such as text analysis and machine translation41. A TCN23,24 models the joint probability distribution over sequences by decomposing the distribution over discrete time-steps (p_theta left( x right) = mathop {prod }nolimits_{t = 1}^T p_theta left( {left. {x_{t}} right|x_{1:t – 1}} right)), where (x = left{ {x_1,x_2, ldots ,x_{{T}}} right}) is a sequence, and the joint distribution is parameterized by the TCN parameter (theta). Thus, a TCN operates under the autoregressive premise that only past values affect the current or future values, e.g., if a patient will develop acute critical illness. Moreover, TCNs differ from “ordinary” CNNs by at least one property: the convolutions in TCNs are causal in the sense that a convolution filter at time t is only dependent on the inputs that are no later than (t), wherein the input subsequence is (x_1,x_2, ldots ,x_{{t}}). TCNs can take a sequence of any length as input and output a sequence of the same length, similar to RNNs22,28. The TCN achieves this by increasing the receptive field of the model with dilated convolutions instead of performing the traditional max pooling operation, as seen in most CNNs. Dilated convolutions achieve a larger receptive field with fewer parameters by having an exponential stride compared to the traditional linear stride. By increasing the receptive field, a temporal hierarchy comparable to multi-scale analysis from computer vision can be achieved42. Figure 6 schematizes the xAI-EWS model and the concept of dilated convolutions. At the time of prediction, the xAI-EWS model receives an input matrix of shape time-steps × features for each patient.

The data are processed by three temporal blocks, each including one-dimensional dilated causal convolutions (Conv1d) with 64 filters, ReLU activations43, layer44, and one-dimensional spatial dropout layers45. Dilation is increased between each temporal block, but keep it constant inside each temporal block (meaning that the second conv1d layer in each temporal block has a dilation = 1). The receptive field for this model can be calculated with (1 + mathop {sum }nolimits_{i = 1}^n left( {k – 1} right) ast left( {2^{i – 1} + 1} right)), where k is kernel size and n is the number of temporal blocks. We used a kernel size = 4 yielding a maximum receptive field of 31. Outputs from the third temporal block are pooled together across time-steps by a global average pooling operation46 to obtain a stabilizing effect for the final output of the model. The pooled output from each kernel in the dilated causal convolutions is flattened to a single vector that is used as input to a final dense layer followed by a softmax activation function. The output from the softmax activation is the probability of future sepsis, AKI, or ALI during admission.

Training and hyperparameters

The model was trained to optimize the cross-entropy loss using the Adam optimizer47 with mini-batches of the size of 200, a learning rate of 0.001, and a dropout rate of 10%. All weights were initialized with He Normal initialization48. The model was trained on a NVIDIA Tesla V100 GPU. Convergence was reached in ~30 min.

Explanation module

In simple models, such as linear regression models, the simple association between input and the prediction outcome is readily transparent and explainable. Consider the linear function fc that weights the input x by wc in order to assign a decision for class c:

$$f_cleft({mathbf{x}} right) = {mathbf{w}}_c^Tx = mathop {sum }limits_i w_{ic}x_i.$$

(1)

Here, each input feature xi of x contributes together with the trainable weight wic to the overall evaluation of fc through the quantity (w_{ic}x_i). The importance-weighted input, therefore, offers a simple explanation for a decision made by the linear model. In contrast, the complexity associated with the multi-layer non-linear nature of deep learning models counteracts with such simplicity in explanations.

Layer-wise relevance propagation (LRP)23,24,25,26,27 is an explanatory technique that applies to deep-learning models, including TCNs. Starting from the output (f_cleft({mathbf{x}} right)), LRP decomposes an explanation into simpler local updates, each recursively defining the contribution to the explanation (called relevance) for all activating neurons in the previous layer. The initial relevance score (R_j = f_cleft({mathbf{x}} right)) is hereby propagated backward through the network by local relevance updates (R_{i leftarrow j}) between connecting neuronsi and j, until the input layer is finally reached. In this process, all incoming relevance values to an intermediate node i are pooled, (R_i = mathop {sum }nolimits_j R_{i leftarrow j}), before its relevance is propagated to the next layer. Figure 7 illustrates the relevance propagation, which is similar to standard backpropagation of errors except that relevance values are propagated backward in the network instead. The conservation property23, one of the important defining properties in LRP, ensures that the total back-propagated relevance amounts to the extent to which the illness of interest is detected by the function (f_cleft({mathbf{x}} right)), which in this paper equals the logits that feed into the final transformation layer.

In this process, all incoming relevance values to an intermediate node i are pooled, (R_i = mathop {sum }nolimits_j R_{i leftarrow j}), before its relevance is propagated to the next layer. Figure 7 illustrates the relevance propagation, which is similar to standard backpropagation of errors except that relevance values are propagated backward in the network instead.

There are many variations of local backpropagation rules in the LRP framework. See, e.g., Montavon et al.49 for a collection of commonly used LRP rules. We have used propagation rules that can be interpreted as DTD23, which defines a sound theoretical framework behind most of the LRP variations. In DTD, a local backward propagation of relevance accounts for a non-linearity in the network model by a first-order Taylor approximation at some well-chosen root-point. Using the origin as root recovers the original LRP update rule from Bach et al.25. That is, the relevance Rj at neuron j propagates backward to neuron i as

$$R_{i leftarrow j} = frac{{w_{ij}a_i}}{{mathop {sum }nolimits_i w_{ij}a_i}}R_j$$

(2)

where (a_i) is the activation for neuron i. Notice that this local relevance-update rule is similar to the simple explanation for the linear model in Eq. (1), except that the normalization ensures that relevance is conserved across layers. It is the simplest local relevance propagation in the LRP framework, known as simply LRP, LRP-0, or the z-rule in DTD.

In general, the Taylor expansion that defines the local relevance propagation rule depends on the type of non-linearity and can, in addition, be engineered to enforce desirable properties based on root-point restrictions23. The network for the model considered in this work is composed of only ReLU activations and linear projections without a bias term. In this case, we can use a particularly engineered rule that will only distribute relevance along positive contributions through the layers in the network and therefore produces sparser (i.e., simpler) explanations. This rule is in the literature known as the z+-rule for DTD or LRP-(alpha _1beta _0), which again is a special case of LRP-(gamma)49. It is defined as

$$R_{i leftarrow j} = frac{{w_{ij}^ + a_i}}{{mathop {sum }nolimits_i w_{ij}^ + a_i}}R_j$$

(3)

where (w_{ij}^ + = w_{ij}) for positive weights and otherwise equals zero. Finally, at the input layer, we used the so-called DTD w2-rule

$$R_{i leftarrow j} = frac{{w_{ij}^2}}{{mathop {sum }nolimits_i w_{ij}^2}}R_j$$

(4)

as it is recommended49 for real valued input.

The DTD (and LRP) framework leaves flexibility to mix layer-specific rules in the network. As mentioned above, we have used the z+-rule in Eq. (3) for all intermediate layers, and in that way, we favor simpler explanations, with features that are explained as either relevant (positive) or irrelevant (zero) for a given prediction.

The AI-EWS explanation module allows two perspectives on model explanations: an individual and a population-based perspective. For the individual perspective, DTD can be used for all patients with a high probability of developing acute critical illness. The module will simply pinpoint which clinical parameters at a given point in time were relevant for the given prediction (Fig. 3). For the population-based perspective, relevance is back-propagated from the output neuron representing the positive classes (sepsis, AKI, and ALI) and is only considered for the patients with a positive ground truth label (sepsis, AKI, and ALI). The individual data points and back-propagated relevance scores for these patients were aggregated in two ways to enable global parameter importance estimation and local explanation summary34 (Fig. 4). For estimating global parameter importance, the mean relevance scores were computed for each clinical parameter. This computation enabled parameter-importance estimation comparable to standardized regression coefficients in multiple linear regression or feature importance measures in random forest50. The local explanation summary (Fig. 4b) presents all individual data points, colored by parameter value and displaced by the relevance. In the local explanation summary, the height of the data points shown for each parameter correlates with the number of data points at their associated level of relevance. The population-based perspective is simplified by ignoring potential temporal relevance variations and treating all data points at different times equally. The visual concepts of global parameter importance estimation and local explanation summary used in this paper are adopted from the shapley additive explanations (SHAP)34 library by Lundberg et al. The SHAP toolbox was not used to provide explanations. In this paper, DTD was implemented using the iNNvestigate51 library developed by Alber et al. iNNvestigate is a high-level library with an easy-to-use interface for many of the most-used explanation methods for neural networks.

Explaining predictions in other ways

Over the last decade, many other methods52,53,54,55,56,57,58,59,60,61,62,63 have been proposed to address the problem of attributing a value to each feature in order to explain the prediction from a complex model. Recent work62,63 have brought some theoretical understanding into fundamental properties that relates many of the methods. In general, the relevance attribution methods can be categorized into two basic categories62: (1) backpropagation-based methods that propagate the attribution of relevance backward through the network form the relevance of output and back to the input features, and (2) perturbation-based methods that rely on running multiple perturbations of the input forward through the network and measuring the consequence that a perturbation may have on the output.

With reference in the Gradient × Input method58, Ancona et al.62 defines a unification of many of the backpropagation-based attribution methods. In particular, they show that LRP (DTD with the z-rule), DeepLIFT (Rescale)59, and Integrated Gradients60 are all strongly related by a reformulation that expresses the attribution as the elementwise multiplication of a modified gradient with either the input (Gradient × Input, LRP) or with the difference between the input and a baseline (Integrated Gradients, DeepLIFT). The modification to the gradient is achieved by letting backpropagation implement a modified chain-rule

$$frac{{d^ ast a_j}}{{da_i}} = w_{ij}gleft( {z_j} right),$$

(5)

where (w_{ij}) is the standard gradient of the linear transformation (z_j = mathop {sum }nolimits_i w_{ij}a_i), and (g(z_j)) represents some unification of the gradient for the non-linear activation. For Gradient × Input, the unifying gradient function (g(z_j)) is the usual instant gradient, whereas it is some form of average gradient for the remaining three methods. We refer to Ancona et al. for the actual expressions of (g(z_j)). Interestingly, it turns out that all four methods would be equivalent in this paper’s setup, with only ReLU non-linearities, no additive bias in the linearities, and ({mathbf{x}} = {bf{0}}) as baseline.

Now, the DTD z+-rule, as we have used in all the intermediate layers, does not fit directly into the unified gradient framework, but it is a trivial exercise to show that the backpropagation rule in Eq. (5) can be modified as

$$frac{{d^ ast a_j}}{{da_i}} = w_{ij}^ + g^ + ( {z_j} ),$$

(6)

where (g^ + ( {z_j} ) = a_j/z_j^ +), (z_j^ + = mathop {sum }nolimits_i w_{ij}^ + a_i) accounts for the fact that the relevance method only distributes relevance along positive contributions.

Backpropagation-based methods are in general fast compared to perturbation-based methods, as the number of features grow62. The backpropagation-based methods require a fixed number of forward-prediction and backward-gradient passes to compute the attribution, whereas perturbation-based methods demand a non-linear number of forward passes in the number of features to properly account for the complex nature of a deep network.

On the other hand, perturbation-based methods may implement other desirable properties, such as the fairness constraints from cooperative game theory, when attributing an outcome of a prediction (the game) to the individual features (the players)53,61,63,64. In particular, Lundberg and Lee63 recently demonstrated that Shapley values65 uniquely defines the solution to these constraints within a large class of additive feature attribution methods, which includes LIME57, DeepLIFT, and LRP. Unfortunately, computing exact Shapley values is, in general, NP-hard66 and sampling approximations are therefore considered. By defining a specific kernel in the LIME setup, Lundberg and Lee introduced KernelSHAP that reduces the number of necessary samples by combining sampling and penalized linear regression, as it is done in LIME. The same paper further proposed DeepSHAP as a variant of DeepLIFT that computes a layer-wise composition of approximate Shapley values. Unfortunately, the chain rule does not hold in general for Shapley values64, and to that end, Ancona et al. presents a method based on uncertainty propagation that allows approximate Shapley value to be computed in polynomial time.

Baseline models

MEWS

The MEWS baseline model interprets raw MEWS scores as a prediction model for acute critical illness. MEWS was implemented as the Danish variant called “Early detection of critical illness” [TOKS: Tidlig opsporing af kritisk sygdom]. MEWS scores were calculated each time one of the model components was updated with a new measurement. Missing values were imputed with a standard carry-forward interpolation.

SOFA

This model interprets raw SOFA30,39,40 scores as a prediction model for acute critical illness. SOFA scores were calculated each time one of the model components was updated with a new measurement. Missing values were imputed with a standard carry-forward interpolation.

GB-Vital

This model is a replication of a well-known sepsis detection model3,6,9 from the literature, which has shown great results in a randomized study7. The complete description of the model can be found in the study from Mao et al.6. The model parameters are constructed by considering six vital-signs from the EHR: systolic blood pressure, diastolic blood pressure, heart rate, respiratory rate, peripheral capillary oxygen saturation, and temperature. For each of the six vital signs, five parameters are constructed to represent the average value for the current hour, the prior hour, and the hour prior to that hour, together with the trend value between two succeeding hours. Based on these 30 parameters (five parameters from each of the six vital-sign events), the GB-Vital model is constructed as a gradient-boosted classifier of decision trees67.

Evaluation

The xAI-EWS model was validated using five-fold cross-validation. Data were randomly divided into five portions of 20% each. For each fold four portions (80%) was used to fit the xAI-EWS model parameters during training. The remaining 20% was split into two portions of 10% each for validation and test. The validation data were used to perform an unbiased evaluation of a model fit during training, and the test data were used to provide an unbiased evaluation of the final model. All data for a single patient was assigned to either train, validation or test data. Figure 2 report performance from the test data. For each fold data were shifted such that a new portion was used for testing. The cross-validation scheme is illustrated in Supplementary Fig. 1. As comparative measures for the predictive performance, we used the AUROC and AUPRC. Regarding the explanations, the quality was assessed by manual inspection of trained specialists (medical doctors) in emergency medicine.

Ethics and information governance

The study was approved by The Danish Data Protection Agency [case number 1-16-02-541-15]. Additionally, the data used in this work were collected with the approval of the steering committee for CROSS-TRACKS. Only retrospective data were used for this research without the active involvement of patients or potential influence on their treatment. Therefore, under the current national legislature, no formal ethical approval was necessary.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

0 comment

Related Posts

This website uses cookies to improve your experience. We will assume you are ok with this, but you can opt-out if you wish. Accept Read More