### Data description

In this study, we analyzed the secondary healthcare data of all residents of four Danish municipalities (Odder, Hedensted, Skanderborg, and Horsens) who were 18 years of age or older for the period of 2012–2017. The data contained information from the electronic health record (EHR), including biochemistry, medicine, microbiology, and procedure codes, and was extracted from the “CROSS-TRACKS” cohort, which embraces a mixed rural and urban multi-center population with four regional hospitals and one larger university hospital. Each hospital comprises multiple departmental units, such as emergency medicine, intensive care, and thoracic surgery. We included all 163,050 available inpatient admissions (45.9% male) during the study period and excluded only outpatient admissions. The included admissions were distributed across 66,288 unique residents. The prevalence for sepsis, AKI, and ALI among these admissions was 2.44%, 0.75%, and 1.68%, respectively (see Table 1). The CROSS-TRACKS cohort offers a combined dimensional model of the secondary healthcare data. Merging all data sets is possible via a unique personal identification number given to all Danish citizens and by which all information within any public institution is collected^{31}.

The model parameters were limited to include 27 laboratory parameters and six vital signs (see Tables 2 and 3). The parameters were selected by trained specialists in emergency medicine (medical doctors) with the sole purpose of simplifying the model to enable a better discussion of the model explanations. While a deeper model with more parameters might lead to better performance, it would also have made the discussions between clinicians and software engineers difficult. Therefore, the scope of this article is not to obtain the best performance at all costs but to demonstrate how clinical tasks can be supported by a fully explainable deep learning approach.

### Data preprocessing

In the data extracted from the CROSS-TRACKS cohort, each admission is represented as a time-ordered sequence of EHR events. Importantly, the time-stamped order of this data reflects the point in time at which the clinicians record each event during the admission. Each event comprises three elements: a time stamp; an event name, such as blood pressure; and a numerical value. The event sequence is partitioned in aggregated intervals of one hour, such that the observation window of 24 h is divided into 24 one-hour periods, and all the events occurring within the same one-hour period are grouped together by their average numerical value.

### Gold standards

Via a classification process, each admission was classified as sepsis-positive, AKI-positive, ALI-positive, or negative (no critical illness). For sepsis classification, we followed the recent Sepsis-3^{30,39} implementation by Moor et al.^{5}, according to which both suspected infection and organ dysfunction are required to be present^{5,30,39}. Suspected infection was defined by the co-occurrence of body fluid sampling and antibiotic administration. When a culture sample was obtained before antibiotics administration, the antibiotic had to be ordered within 72 h. If the antibiotic was administered first, then the culture sample had to follow within 24 h.

The degree of organ dysfunction is described by an acute increase in the SOFA^{40} score and an increase of more than or equal to two points is used in the criteria for sepsis^{39}. To implement the organ dysfunction criterion, we used a 72-h window from 48 h before to 24 h after the time of suspected infection, as suggested by Singer et al.^{30} and Moor et al.^{5}. The Sepsis-3 implementation is visualized in Fig. 5.

AKI classification was performed according to the KDIGO criteria^{31}. KDIGO accepts three definitions of AKI: (1) an increase in serum creatinine of 0.3 mg/dl (26.5 μmol/l) within 48 h; (2) an increase in serum creatinine by 1.5 times the habitual creatinine level of a patient within the previous seven days; and (3) a urine output of <0.5 ml/kg/h over 6 h. Following the work of Tomašev et al.^{1}, only the first two definitions were used to provide ground-truth labels for the onset of AKI as urine measurements were not available. The habitual creatinine level was computed as the mean creatinine level during the previous 365 days. We used binary encoding for AKI such that all three severity stages (KDIGO stages 1, 2, and 3) were encoded as positive AKI. For ALI classification, we considered the presence of either NIV or CPAP during the admission, because PaO_{2}/FiO_{2} measurements were not available. The ALI onset was the first occurrence of either NIV or CPAP (see Fig. 5).

### Prediction module

The AI-EWS model is designed as a variation of a convolutional neural network (CNN) called a temporal convolutional network (TCN). CNNs have dominated computer vision tasks for the last century and are also highly capable of performing sequential tasks, such as text analysis and machine translation^{41}. A TCN^{23,24} models the joint probability distribution over sequences by decomposing the distribution over discrete time-steps (p_theta left( x right) = mathop {prod }nolimits_{t = 1}^T p_theta left( {left. {x_{t}} right|x_{1:t – 1}} right)), where (x = left{ {x_1,x_2, ldots ,x_{{T}}} right}) is a sequence, and the joint distribution is parameterized by the TCN parameter (theta). Thus, a TCN operates under the autoregressive premise that only past values affect the current or future values, e.g., if a patient will develop acute critical illness. Moreover, TCNs differ from “ordinary” CNNs by at least one property: the convolutions in TCNs are causal in the sense that a convolution filter at time *t* is only dependent on the inputs that are no later than (t), wherein the input subsequence is (x_1,x_2, ldots ,x_{{t}}). TCNs can take a sequence of any length as input and output a sequence of the same length, similar to RNNs^{22,28}. The TCN achieves this by increasing the receptive field of the model with dilated convolutions instead of performing the traditional max pooling operation, as seen in most CNNs. Dilated convolutions achieve a larger receptive field with fewer parameters by having an exponential stride compared to the traditional linear stride. By increasing the receptive field, a temporal hierarchy comparable to multi-scale analysis from computer vision can be achieved^{42}. Figure 6 schematizes the xAI-EWS model and the concept of dilated convolutions. At the time of prediction, the xAI-EWS model receives an input matrix of shape time-steps × features for each patient.

The data are processed by three temporal blocks, each including one-dimensional dilated causal convolutions (Conv1d) with 64 filters, ReLU activations^{43}, layer^{44}, and one-dimensional spatial dropout layers^{45}. Dilation is increased between each temporal block, but keep it constant inside each temporal block (meaning that the second conv1d layer in each temporal block has a dilation = 1). The receptive field for this model can be calculated with (1 + mathop {sum }nolimits_{i = 1}^n left( {k – 1} right) ast left( {2^{i – 1} + 1} right)), where *k* is kernel size and *n* is the number of temporal blocks. We used a kernel size = 4 yielding a maximum receptive field of 31. Outputs from the third temporal block are pooled together across time-steps by a global average pooling operation^{46} to obtain a stabilizing effect for the final output of the model. The pooled output from each kernel in the dilated causal convolutions is flattened to a single vector that is used as input to a final dense layer followed by a softmax activation function. The output from the softmax activation is the probability of future sepsis, AKI, or ALI during admission.

### Training and hyperparameters

The model was trained to optimize the cross-entropy loss using the Adam optimizer^{47} with mini-batches of the size of 200, a learning rate of 0.001, and a dropout rate of 10%. All weights were initialized with He Normal initialization^{48}. The model was trained on a NVIDIA Tesla V100 GPU. Convergence was reached in ~30 min.

### Explanation module

In simple models, such as linear regression models, the simple association between input and the prediction outcome is readily transparent and explainable. Consider the linear function *f*_{c} that weights the input **x** by w_{c} in order to assign a decision for class *c*:

$$f_cleft({mathbf{x}} right) = {mathbf{w}}_c^Tx = mathop {sum }limits_i w_{ic}x_i.$$

(1)

Here, each input feature *x*_{i} of **x** contributes together with the trainable weight *w*_{ic} to the overall evaluation of *f*_{c} through the quantity (w_{ic}x_i). The importance-weighted input, therefore, offers a simple explanation for a decision made by the linear model. In contrast, the complexity associated with the multi-layer non-linear nature of deep learning models counteracts with such simplicity in explanations.

Layer-wise relevance propagation (LRP)^{23,24,25,26,27} is an explanatory technique that applies to deep-learning models, including TCNs. Starting from the output (f_cleft({mathbf{x}} right)), LRP decomposes an explanation into simpler local updates, each recursively defining the contribution to the explanation (called relevance) for all activating neurons in the previous layer. The initial relevance score (R_j = f_cleft({mathbf{x}} right)) is hereby propagated backward through the network by local relevance updates (R_{i leftarrow j}) between connecting neurons*i* and *j*, until the input layer is finally reached. In this process, all incoming relevance values to an intermediate node *i* are pooled, (R_i = mathop {sum }nolimits_j R_{i leftarrow j}), before its relevance is propagated to the next layer. Figure 7 illustrates the relevance propagation, which is similar to standard backpropagation of errors except that relevance values are propagated backward in the network instead. The conservation property^{23}, one of the important defining properties in LRP, ensures that the total back-propagated relevance amounts to the extent to which the illness of interest is detected by the function (f_cleft({mathbf{x}} right)), which in this paper equals the logits that feed into the final transformation layer.

In this process, all incoming relevance values to an intermediate node *i* are pooled, (R_i = mathop {sum }nolimits_j R_{i leftarrow j}), before its relevance is propagated to the next layer. Figure 7 illustrates the relevance propagation, which is similar to standard backpropagation of errors except that relevance values are propagated backward in the network instead.

There are many variations of local backpropagation rules in the LRP framework. See, e.g., Montavon et al.^{49} for a collection of commonly used LRP rules. We have used propagation rules that can be interpreted as DTD^{23}, which defines a sound theoretical framework behind most of the LRP variations. In DTD, a local backward propagation of relevance accounts for a non-linearity in the network model by a first-order Taylor approximation at some well-chosen root-point. Using the origin as root recovers the original LRP update rule from Bach et al.^{25}. That is, the relevance *R*_{j} at neuron *j* propagates backward to neuron *i* as

$$R_{i leftarrow j} = frac{{w_{ij}a_i}}{{mathop {sum }nolimits_i w_{ij}a_i}}R_j$$

(2)

where (a_i) is the activation for neuron *i*. Notice that this local relevance-update rule is similar to the simple explanation for the linear model in Eq. (1), except that the normalization ensures that relevance is conserved across layers. It is the simplest local relevance propagation in the LRP framework, known as simply LRP, LRP-0, or the z-rule in DTD.

In general, the Taylor expansion that defines the local relevance propagation rule depends on the type of non-linearity and can, in addition, be engineered to enforce desirable properties based on root-point restrictions^{23}. The network for the model considered in this work is composed of only ReLU activations and linear projections without a bias term. In this case, we can use a particularly engineered rule that will only distribute relevance along positive contributions through the layers in the network and therefore produces sparser (i.e., simpler) explanations. This rule is in the literature known as the z^{+}-rule for DTD or LRP-(alpha _1beta _0), which again is a special case of LRP-(gamma)^{49}. It is defined as

$$R_{i leftarrow j} = frac{{w_{ij}^ + a_i}}{{mathop {sum }nolimits_i w_{ij}^ + a_i}}R_j$$

(3)

where (w_{ij}^ + = w_{ij}) for positive weights and otherwise equals zero. Finally, at the input layer, we used the so-called DTD w^{2}-rule

$$R_{i leftarrow j} = frac{{w_{ij}^2}}{{mathop {sum }nolimits_i w_{ij}^2}}R_j$$

(4)

as it is recommended^{49} for real valued input.

The DTD (and LRP) framework leaves flexibility to mix layer-specific rules in the network. As mentioned above, we have used the z^{+}-rule in Eq. (3) for all intermediate layers, and in that way, we favor simpler explanations, with features that are explained as either relevant (positive) or irrelevant (zero) for a given prediction.

The AI-EWS explanation module allows two perspectives on model explanations: an individual and a population-based perspective. For the individual perspective, DTD can be used for all patients with a high probability of developing acute critical illness. The module will simply pinpoint which clinical parameters at a given point in time were relevant for the given prediction (Fig. 3). For the population-based perspective, relevance is back-propagated from the output neuron representing the positive classes (sepsis, AKI, and ALI) and is only considered for the patients with a positive ground truth label (sepsis, AKI, and ALI). The individual data points and back-propagated relevance scores for these patients were aggregated in two ways to enable global parameter importance estimation and local explanation summary^{34} (Fig. 4). For estimating global parameter importance, the mean relevance scores were computed for each clinical parameter. This computation enabled parameter-importance estimation comparable to standardized regression coefficients in multiple linear regression or feature importance measures in random forest^{50}. The local explanation summary (Fig. 4b) presents all individual data points, colored by parameter value and displaced by the relevance. In the local explanation summary, the height of the data points shown for each parameter correlates with the number of data points at their associated level of relevance. The population-based perspective is simplified by ignoring potential temporal relevance variations and treating all data points at different times equally. The visual concepts of global parameter importance estimation and local explanation summary used in this paper are adopted from the shapley additive explanations (SHAP)^{34} library by Lundberg et al. The SHAP toolbox was not used to provide explanations. In this paper, DTD was implemented using the iNNvestigate^{51} library developed by Alber et al. iNNvestigate is a high-level library with an easy-to-use interface for many of the most-used explanation methods for neural networks.

### Explaining predictions in other ways

Over the last decade, many other methods^{52,53,54,55,56,57,58,59,60,61,62,63} have been proposed to address the problem of attributing a value to each feature in order to explain the prediction from a complex model. Recent work^{62,63} have brought some theoretical understanding into fundamental properties that relates many of the methods. In general, the relevance attribution methods can be categorized into two basic categories^{62}: (1) backpropagation-based methods that propagate the attribution of relevance backward through the network form the relevance of output and back to the input features, and (2) perturbation-based methods that rely on running multiple perturbations of the input forward through the network and measuring the consequence that a perturbation may have on the output.

With reference in the Gradient × Input method^{58}, Ancona et al.^{62} defines a unification of many of the backpropagation-based attribution methods. In particular, they show that LRP (DTD with the z-rule), DeepLIFT (Rescale)^{59}, and Integrated Gradients^{60} are all strongly related by a reformulation that expresses the attribution as the elementwise multiplication of a modified gradient with either the input (Gradient × Input, LRP) or with the difference between the input and a baseline (Integrated Gradients, DeepLIFT). The modification to the gradient is achieved by letting backpropagation implement a modified chain-rule

$$frac{{d^ ast a_j}}{{da_i}} = w_{ij}gleft( {z_j} right),$$

(5)

where (w_{ij}) is the standard gradient of the linear transformation (z_j = mathop {sum }nolimits_i w_{ij}a_i), and (g(z_j)) represents some unification of the gradient for the non-linear activation. For Gradient × Input, the unifying gradient function (g(z_j)) is the usual instant gradient, whereas it is some form of average gradient for the remaining three methods. We refer to Ancona et al. for the actual expressions of (g(z_j)). Interestingly, it turns out that all four methods would be equivalent in this paper’s setup, with only ReLU non-linearities, no additive bias in the linearities, and ({mathbf{x}} = {bf{0}}) as baseline.

Now, the DTD z^{+}-rule, as we have used in all the intermediate layers, does not fit directly into the unified gradient framework, but it is a trivial exercise to show that the backpropagation rule in Eq. (5) can be modified as

$$frac{{d^ ast a_j}}{{da_i}} = w_{ij}^ + g^ + ( {z_j} ),$$

(6)

where (g^ + ( {z_j} ) = a_j/z_j^ +), (z_j^ + = mathop {sum }nolimits_i w_{ij}^ + a_i) accounts for the fact that the relevance method only distributes relevance along positive contributions.

Backpropagation-based methods are in general fast compared to perturbation-based methods, as the number of features grow^{62}. The backpropagation-based methods require a fixed number of forward-prediction and backward-gradient passes to compute the attribution, whereas perturbation-based methods demand a non-linear number of forward passes in the number of features to properly account for the complex nature of a deep network.

On the other hand, perturbation-based methods may implement other desirable properties, such as the fairness constraints from cooperative game theory, when attributing an outcome of a prediction (the game) to the individual features (the players)^{53,61,63,64}. In particular, Lundberg and Lee^{63} recently demonstrated that Shapley values^{65} uniquely defines the solution to these constraints within a large class of additive feature attribution methods, which includes LIME^{57}, DeepLIFT, and LRP. Unfortunately, computing exact Shapley values is, in general, NP-hard^{66} and sampling approximations are therefore considered. By defining a specific kernel in the LIME setup, Lundberg and Lee introduced KernelSHAP that reduces the number of necessary samples by combining sampling and penalized linear regression, as it is done in LIME. The same paper further proposed DeepSHAP as a variant of DeepLIFT that computes a layer-wise composition of approximate Shapley values. Unfortunately, the chain rule does not hold in general for Shapley values^{64}, and to that end, Ancona et al. presents a method based on uncertainty propagation that allows approximate Shapley value to be computed in polynomial time.

### Baseline models

#### MEWS

The MEWS baseline model interprets raw MEWS scores as a prediction model for acute critical illness. MEWS was implemented as the Danish variant called “Early detection of critical illness” [TOKS: Tidlig opsporing af kritisk sygdom]. MEWS scores were calculated each time one of the model components was updated with a new measurement. Missing values were imputed with a standard carry-forward interpolation.

#### SOFA

This model interprets raw SOFA^{30,39,40} scores as a prediction model for acute critical illness. SOFA scores were calculated each time one of the model components was updated with a new measurement. Missing values were imputed with a standard carry-forward interpolation.

#### GB-Vital

This model is a replication of a well-known sepsis detection model^{3,6,9} from the literature, which has shown great results in a randomized study^{7}. The complete description of the model can be found in the study from Mao et al.^{6}. The model parameters are constructed by considering six vital-signs from the EHR: systolic blood pressure, diastolic blood pressure, heart rate, respiratory rate, peripheral capillary oxygen saturation, and temperature. For each of the six vital signs, five parameters are constructed to represent the average value for the current hour, the prior hour, and the hour prior to that hour, together with the trend value between two succeeding hours. Based on these 30 parameters (five parameters from each of the six vital-sign events), the GB-Vital model is constructed as a gradient-boosted classifier of decision trees^{67}.

### Evaluation

The xAI-EWS model was validated using five-fold cross-validation. Data were randomly divided into five portions of 20% each. For each fold four portions (80%) was used to fit the xAI-EWS model parameters during training. The remaining 20% was split into two portions of 10% each for validation and test. The validation data were used to perform an unbiased evaluation of a model fit during training, and the test data were used to provide an unbiased evaluation of the final model. All data for a single patient was assigned to either train, validation or test data. Figure 2 report performance from the test data. For each fold data were shifted such that a new portion was used for testing. The cross-validation scheme is illustrated in Supplementary Fig. 1. As comparative measures for the predictive performance, we used the AUROC and AUPRC. Regarding the explanations, the quality was assessed by manual inspection of trained specialists (medical doctors) in emergency medicine.

### Ethics and information governance

The study was approved by The Danish Data Protection Agency [case number 1-16-02-541-15]. Additionally, the data used in this work were collected with the approval of the steering committee for CROSS-TRACKS. Only retrospective data were used for this research without the active involvement of patients or potential influence on their treatment. Therefore, under the current national legislature, no formal ethical approval was necessary.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.