EFFICIENT UNCERTAINTY QUANTIFICATION AND VARIANCE-BASED SENSITIVITY ANALYSIS IN EPIDEMIC MODELLING USING POLYNOMIAL CHAOS

. The use of epidemic modelling in connection with spread of diseases plays an important role in understanding dynamics and providing forecasts for informed analysis and decision-making. In this regard, it is crucial to quantify the eﬀects of uncertainty in the modelling and in model-based predictions to trustfully communicate results and limitations. We propose to do eﬃcient uncertainty quantiﬁcation in compartmental epidemic models using the generalized Polynomial Chaos (gPC) framework. This framework uses a suitable polynomial basis that can be tailored to the underlying distribution for the parameter uncertainty to do forward propagation through eﬃcient sampling via a mathematical model to quantify the eﬀect on the output. By evaluating the model in a small number of selected points, gPC provides illuminating statistics and sensitivity analysis at a low computational cost. Through two particular case studies based on Danish data for the spread of Covid-19, we demonstrate the applicability of the technique. The test cases consider epidemic peak time estimation and the dynamics between superspreading and partial lockdown measures. The computational results show the eﬃciency and feasibility of the uncertainty quantiﬁcation techniques based on gPC, and highlight the relevance of computational uncertainty quantiﬁcation in epidemic modelling.


Introduction
In this manuscript we demonstrate the application of techniques from uncertainty quantification (UQ) to epidemic modelling to provide insight and identify the parameters that have the strongest influence on the results of the predictive modelling. Such knowledge is crucial for mitigation strategies, restriction policies, etc. targeting controlling or reducing the impact of the spread of diseases for securing public health. This will in part also improve the ability to deal with uncertainty in predictive modelling.
Uncertainty quantification as an independent field grew out of problems in various other fields such as probability theory, dynamical systems and numerical simulations. Sampling based techniques, such as Markov Chain Monte Carlo (MCMC) methods and bootstrapping [7,9], are commonly used in epidemic modelling as seen in the studies [13,19,21], and by the expert group 1 providing the Covid-19 related modelling for the Danish government. We propose an alternative approach in the generalized Polynomial Chaos [3,10,23,24] as an efficient general non-iterative framework to do UQ analysis using forward modelling where the uncertainties are parameterized. The outcome are estimates of the expected value and variance of possible solutions under the uncertainty.
Mathematical models are widely applied for modelling the spread of infectious diseases [12], and model predictions are used to inform political decisions for societal counter measures. Models of various complexity, flexibility, restrictions and assumptions exist; see [5] and [2] for a deeper introductions to different kinds of models. If appropriately combined with data the models provide insight into the behaviour of a disease. It may yield estimates for its duration, the peak infection and other various aspects.
In this paper we use extended versions of the SIR model; a compartmental epidemic model named after the division of a population into three compartments: (S)usceptible individuals, (I)nfection individuals, and (R)ecovered individuals; see e.g. [5]. A SIR model is simple in nature and generally assumes a homogeneously mixed population across compartments. The model is yet flexible and it easily allows for extensions e.g. by separation into age groups or local, geographic regions. One should note that SIR models exhibit exponential growth in the modelling of infected people during the early stage of an epidemic, and hence they are sensitive to model parameters and best suited for short term predictions.
Model parameters are often estimated from various kinds of real-world data, and for that reason they are intrinsically affected by uncertainty. Mathematically, the uncertainty is parameterized by a probability distribution, and with this distribution in place, the parameter uncertainty can be propagated to the model output providing a distribution of possible outputs in place of a single prediction. Computational UQ computes various statistics from the output distribution, e.g. mean, variance, confidence intervals, etc. See Figure 1 for a conceptual illustration of UQ.
It can be quite challenging to explore the potentially complicated output distribution. Sampling methods such as Markov Chain Monte Carlo (MCMC) are flexible and commonly employed tools for the purpose. However, MCMC methods often require substantial sampling due to slow convergence. When the model evaluation is expensive, MCMC may not be feasible [15].
Generalized Polynomial Chaos (gPC) poses an efficient alternative non-sampling based method, which can provide very good estimates using significantly fewer model evaluations provided the dimension of the parameter space is sufficiently low. The drawback is that gPC suffers from the curse of dimensionality, i.e. when the parameter count (i.e. the dimension) grows, the computational requirements may grow much more. The method utilizes orthogonal polynomials and Gaussian quadrature to optimize the number of model evaluations necessary to compute statistics by means of an orthonormal expansion. gPC has also been used on Spanish Covid-19 data in [18].
When various statistics of the output distribution are in place it is important what input parameters gave most influence to the output uncertainty. In other words, are some parameters significantly more contributing to the uncertainty in the model output? Variance-based sensitivity analysis gives the answer in terms of the so-called Sobol indices [1,20]. Sobol indices have for example been applied in analysis of the Bristish CovidSim model in [6].
The main novelty of our work is in Section 4, where we demonstrate the utility of gPC analysis and Sobol indices in epidemic models and apply them to Danish data from the early phases of Coronavirus SARS-CoV-2 (Covid- 19). The versatility of the tools is illustrated in two different cases based on SIR models. Case 1 concerns the estimation of the timing and size of the peak of an epidemic. Case 2 provides a way for UQ modelling of superspreaders in a compartmental model inspired by [19].
This manuscript is structured as follows: Sections 2 and 3 provides the theoretical background. In Section 2, we give an introduction to gPC and illustrate how various basic statistics are directly computable from the expansion coefficients. Sobol indices are introduced in Section 3. We provide a short derivation of their formulation, and we relate Sobol indices to Polynomial Chaos by providing formula for their computation in terms of the gPC expansion coefficients.
The computations included in this manuscript were done in Matlab and the framework is available as a small toolbox on the DTU GitLab server 2 . The methods used for computing the various quadratures have been ported to Matlab from NumPy [17].

Polynomial Chaos expansion
We will consider a model described by the input-output map f : Ω ⊆ X → Y, where X = R n is the parameter space and Ω is a subset and Y = R m is the output space. The aim is to quantify the effects uncertainty in input X ∈ X to that of the output of a model f (X) = Y ∈ Y.
Polynomial Chaos decomposes f ∈ L 2 (Ω, Y, dµ) in a basis of orthonormal polynomials {φ α } of increasing order given by the multi-index α = (α 1 , . . . , α n ) ∈ N n . We will use the notational convention that α = 0 when α i = 0 for all 1 ≤ i ≤ n. Note that φ 0 (x) = 1; the zeroth order polynomial is always constant. The decomposition is standard (2.1) Here f, φ α µ are the coefficients of f computed via the L 2 (Ω, dµ) inner product. In practice the series is truncated and f (X) is approximated by a finite sum. When f is smooth the decay of coefficients is exponential [4] and hence the error introduced by the approximation is small. For a number of common probability measures dµ the orthonormal polynomials are well-known and easy to generate. They also yield Gaussian quadratures with respect to these probability measures, which makes the computation of the involved integrals fast [4].
Consider as an example the standard normal distribution N (0, 1 2 ), which up to a scaling constant has probability measure exp(−x 2 /2). The (probabilists) Hermite polynomials form an orthogonal sequence with respect to this measure. The corresponding Gaussian-Hermite quadrature is formed by picking a degree n quad as with ξ i denoting the roots of the Hermite polynomial of the corresponding degree and weights w i ; this quadrature is exact whenever f is a polynomial with deg(f ) ≤ 2n quad − 1.

Statistical properties
While stochastic phenomena come with expressive distributions, they are often well characterized by basic statistical properties like the mean, variance and covariance. Given a model f with parameters characterized by the random variable X, the resulting output Y = f (X) is a new random variable. In this section we will discuss how basic statistical properties of Y become directly computable from the coefficients in our polynomial expansion for f .
Assume that dµ is a probability measure on the parameter set Ω, and that {φ α } is an orthonormal basis as above. Consider the random variable Y = f (X), where X ∼ dµ, i.e. it follows the distribution defined by dµ. Let us denote by c α = f, φ α µ , then it is easy to see that we immediately obtain the mean value in terms of the first coefficient In a similarly way the variance may be derived as Consider now the random variables Y 1 = f 1 (X) and Y 2 = f 2 (X) with coefficients {c 1,α } and {c 2,α }, then a computation analogous to that for the variance yields the covariance as where α(·) : N → N n is some traversal of the multi-index space with α(0) = (0, 0, . . . , 0), the covariance matrix

Variance-based sensitivity analysis
The uncertainty of the output Y = f (X) is quantified in terms of its statistical properties like mean and variance, but it is natural to trace the uncertainty in Y back to the individual parameters in X. The quantification of individual input parameters' influence on the output variance is called variance-based sensitivity analysis. If such information is available, we would know the relative importance of the uncertainty in a single parameter on the overall output variance, and this knowledge would indicate in which parameters of X, if possible, we should aim at reducing the uncertainty.
The Sobol indices form a quantification of the variance contribution on the output Y from each individual parameter and each combination of the parameters X. Like the global statistical properties, the Sobol indices are computable from the polynomial expansion coefficients for f . We give a brief example here, then present the formulation of the Sobol indices, and follow up with the derivation in terms of the coefficients.
Consider the map f : Ω ⊂ X → Y, X = R n , Y = R (note that the following derivation extends naturally to Y = R n ). Let dµ = n i=1 dµ i be a probability measure on Ω ⊆ X and X = (X 1 , . . . , X n ), X i ∼ dµ i . Our example is a simple case: Let X 1 ∼ N (0, a 2 ) and X 2 ∼ N (0, b 2 ) be normally distributed. Say Y = X 1 + X 2 , then the Sobol indices would be S 1 , S 2 and S 12 corresponding to each non-empty combination of X 1 and X 2 . Their values would be S 1 = a 2 a 2 +b 2 , S 2 = b 2 a 2 +b 2 and S 12 = 0. In other words, say a > b, then it is better to decrease the uncertainty in X 1 rather than X 2 . In contrast, if Y = X 1 X 2 then the Sobol indices are S 1 = S 2 = 0 and S 12 = 1 so decreasing the uncertainty in either parameter is equally beneficial.

Sobol indices
To compute the Sobol indices we rely on a decomposition into marginalizations of f . We give a derivation of the Sobol indices based on the exposition in [20] to illustrate their computation.
The remaining functions f u , u = ∅, are then recursively defined by Here dµ = 0 outside the domain of f , i.e. outside Ω. Note that the sum is telescopic, each component corresponding to a set u subtracting subset components again. Hence we may compute each f u (X u ), u ⊆ U starting from the smallest subsets progressing hierarchically upward.
We consider now the variance of f (X) and apply the expansion (3.1) to obtain By dividing by the left hand side in (3.4) we get which divides the variance into partitional contributions by parameter combinations. The partitions . . , n} are the Sobol indices.

Relation to polynomial chaos
The Sobol indices are efficiently computable from the gPC coefficients. The marginalizations of the distribution arise as restrictions to certain subsets of the coefficients.
Let c α be the gPC coefficients of f . To compute the Sobol indices we wish to compute the terms V[f u (X u )]. Taking the variance on both sides in (3.2) we get Clearly, if we compute bottom up hierarchically using the partial ordering u ≤ v if u ⊆ v, we simply need to compute the marginalizations V[E U \u [f (X)]] and then subtract formerly computed values.
Due to the maginalizations of f we will need to consider the marginal structure of our basis functions {φ α } too. For a multi-index α ∈ N n we shall use the notation where {ψ i,j } j is the orthonormal polynomial basis for parameter X i . With this we may derive   (note that this product of mean values is 0 unless α i = 0 for all i ∈ U \u; we write simply α U \u = 0) Taking the variance of the above and using the fact that V[φ α (X)] = 1 for α = 0 and zero otherwise we get Visually, if we consider just two parameters, we see in the coefficient grid below how the different coefficients distribute themselves among the c 2 0,0 Here " " is simply intended as a placeholder symbol for the sum of each of the elements in the sectioned off region of the grid. Note that c 0,0 is crossed out, as it is the mean, which does not contribute to the variance.

Uncertainty quantification in modelling spread of diseases using Polynomimal Chaos
In this section we present two cases to demonstrate the flexibility of the above techniques by applying them to compartmental models. The first case considers a SEIR model and compute distributions for the size and timing of the peak of the modelled epidemic.
For the second case a more extensive compartmental model is considered. Inspired by the agent based modelling of superspreaders discussed in [19] we construct a multi-compartment model and formulate a modelling approach for superspreaders leading to comparable results despite the differences in modelling assumptions. With this model we perform a UQ analysis on the coefficients modelling the government imposed restrictions.
In both cases the population size N is taken as 5.8 × 10 6 matching the size of the Danish population.

Case 1: Epidemic peak
For this simple case we consider a SEIR model, i.e. a model with the compartments (S)usceptible, (E)xposed, (I)infectious and (R)ecovered/removed. The model is visualized in the diagram in Figure 2.
As an ODE system the model is of the form where β, σ and γ are transition coefficients and N the total size of the population. σ is the rate at which people progress from being exposed (incubating) to becoming infectious, and γ is the rate at which one recovers (or dies) from the disease. Their reciprocals are the average time an individual spends in the exposed and infectious compartments respectively. β denotes the average rate of infection happening in the population. This quantity is depending on infectiousness of the virus and the social patterns of the population; e.g. higher hygiene standard  in the population would lead to a lower β. Note that as a consequence of the model, S + E + I + R = N and hence constant at all times.
In an epidemic the number of infected individuals will rise rapidly as each infected individual will infect several others. However, as the population becomes saturated with infected and recovered individuals (henceforth assumed immune to reinfection) the chance for a meeting between an infected and a susceptible will decrease. This effect is often referred to as herd immunity; see [8] for further reading on the concept. Hence, the epidemic is expected to peak at some time t peak where the number of infectious individuals are at its highest.
We consider in this example each parameter β, σ and γ uncertain. The uncertainties are given as uncertainty in the reproduction number R 0 = β γ , in the duration in the exposed compartment τ inc = σ −1 and the duration in the infectious compartment τ inf = γ −1 . As these are positive quantities we assume each log-normally distributed. We thus consider the map F : (R 0 , τ inc , τ inf ) → (t peak , I peak ), (4.2) where I peak := I(t peak ). As the log-normal distribution is simply a transformation of the normal distribution, it is a simple task to transform the quadrature nodes accordingly. We can thus apply the theory from the previous sections to propagate the uncertainty in the arguments of F to the output using only few evaluations. As the output quantities are known to be positive as well, we shall assume log-normal distributions for these as well and fit them by computed means and variances.  The result of the case can be seen in Figures 3, 4 and 5 where we have used the following hyper-parameters; chosen based on early reported numbers for Covid-19 [11,14,22] with some level of contact restriction assumed. However, we stress that for the purpose of demonstrating the technique the specific values are of less importance.  Figure 6 we show the explored samples in image space by one of the MCMC runs and the corresponding marginalized distributions. We also show how the mean and standard deviations stabilize across the 10,000 iterations for all the 8 runs. Figure 3 illustrates all samples in the image space. Their colors and sizes have been scaled by their corresponding quadrature weights w i , thought a minimum size was fixed for visibility. Figure 4 shows the lognormal distribution for the peak time and the magnitude of the peak in infectious individuals. In Figure 5 the corresponding Sobol indices are illustrated for the selected values the variance of R 0 is the primary concern if we wanted to narrow down the peak time further.
We observe how the 27 model evaluations provides similar information as to the 1000 model evaluations, showing that this problem is handled well already at this low number of evaluations. Furthermore, we observe how we need many more iterations from the MCMC method to reach similar levels of precision.

Case 2: Superspreaders
Superspreaders are infected individuals who during an epidemic are responsible for the infection of a significantly larger amount of individuals than the observed average. Historical observations of diseases have shown that incidents with superspreaders play an important role to the development of epidemics [16].
Various aspects play into causing an individual to become a superspreader, which may both be physiological and sociodynamic in nature. An individual exhaling an increased amount of pathogens relative to the norm could lead to a significantly larger number of infections during regular social interactions compared to a "normal" infectious individual. But it could also simply be the participation in a large scale social event for instance a party, concert or a festival, where the physical distancing may be very low and number of contacts proportionally higher, which results in mass infection.
Inspired by [19] we attempt in this case to replicate some of their results in a computationally fast way using a compartmental model. We employ the structure from their agent based model to construct the compartmental model depicted in the diagram in Figure 7. In the diagram we have the following compartments: First, as in the former model susceptible and exposed. Then there are asymptomatic infectious I 1 and symptomatic infectious I 2 . We note that this is a legacy structure from [19], where it is used mostly for book keeping. Neither there nor here is behavior assumed to differ between the compartments. We have a (W)ait compartment, which signifies a short time, where the individual is either so sick that they have isolated themselves as to not infect anyone before admission to the hospital, or they are in non-infecting recovery. There is a branch with (H)ospitalized and (C)ritical care before all ending in the recovered/removed compartment.
The parameters choices are taken as in [19], but we restate them for completeness in Table 1. For z 1 and Figure 6. The density (A) and the marginalized densities (B) explored by one run of 10,000 samples of the MCMC method. In (C) we show the cumulative mean and standard deviation for t peak , meaning the mean and standard deviation of all the samples up to the particular iteration. Figure 7. Illustration of the expanded compartmental model taking into account superspreaders; based on [19]. Table 1. Superspreader model parameters [19]. Units are in [days]. Table 2. Population distribution and hospitalization probability data [19]. 43.2 70.9 * ) 0.1% was added here since the numbers from the source table didn't actually add to 100%. * * ) This number was 0 in the table, it is known that some kids end up hospitalized, so we changed it to a small but strictly positive value. z 2 we compute them from the hospitalization rates listed in the Supplementary material Table 1 in [19] which comes from Norwegian data. We present the data in Table 2 where d i , h i and κ i are the data rows. From these quantities z 1 and z 2 are computed as The expression for the last parameter β(t) is given in (4.6) and (4.5). The modelling approach is covered in Section 4.2.1.

Modelling varying infectivity
We model superspreaders by assuming a distribution of infectivity amongst individuals in the population. Consider the normalized population [0, 1] and assign to each a ∈ [0, 1] an infectivity β(a). That is, if U is some ordered set of possible infectivities and ψ is a probability measure on U with cumulative probability function Ψ, then β(·) = (1 − Ψ) −1 (·). We assume that the population is ordered by decreasing infectivity; i.e. a < a implies that β(a) ≥ β(a ). Assuming a well-mixed distribution such that infection is equally likely to hit any individual a ∈ [0, 1] we may readily calculate the contribution to infection caused by the fraction p most infectious individuals, C p . In a SIR model the number of people getting infected at a time t is commonly, as seen in (4.1a), of the form where I(t) is the number of infected individuals, S(t) the number of susceptible individuals, and N is the population size. Here β is the average infectivity; β = 1 0 β(a) da. The p most infectious individuals would then be contributing the fraction This quantity informs the choice of probability measure ψ if one works under the scheme that superspreaders form some fraction p of the population and is responsible for infecting the fraction C p of the population. Fixing these two quantities limits the admissible measures ψ. This way of modelling a variation in infectivity also admits fairly easy extensions to control scenarios where some rules may change behavioral dynamics over time. We may for instance consider a time-dependent infectivity where φ(b, a, t) is some restriction function describing a change in the behavior over time. In practice, however, φ(b, a, t) ≡ φ(b, t) will typically be independent of a as we cannot feasibly identify an individual as more infectious than another until after the fact. And so we have to make rules that are uniform for everyone. A simple example could be a strict limitation in how many individuals anyone meet, which could be crudely modelled as φ(b, a, t) = min(b, c(t)), (4.4) where c : R + → R + is some time-dependent upper bound. Of course, if φ(b, a, t) ≡ φ(a, t) is independent of b, we may impose any kind of other ordering on the population, e.g. by age, and apply hard restrictions based on that one. But then we lose all information from the infectivity β, which might be undesirable.
We take for our model β(a) as a simple piecewise constant function Here p is the assumed concentration of superspreaders; e.g. if we assume 10% are superspreaders p = 0.1. sA is the infection rate for superspreaders and s the infection rate for the remaining population. It is an easy calculation that C p from (4.3) is independent of s, so choosing an assumed pair (p, C p ) determines A and choosing s then determines the mean rate β. We shall model a social restriction as a hard cap on the amount of individuals any single person gets to interact with. We model this with a restriction function as in (4.4); i.e.

Fitting the model using real-world data
From an assumption about the prevalence of superspreaders we first fit β's scaling parameter s and an initial condition I 0 for the epidemic from hospital admission by day 3 (only for the pre-lockdown part of the data set) and an assumption of an unmitigated growth rate at about 23% per day [19]. For the initial condition we fit a number I 0 and we assume that S(0) = N − I 0 , E(0) = I0 2 , I 1 (0) = I0 3 and I 2 (0) = I0 6 . The remaining compartments start at 0. The problem is formulated as arg min where the weights w −1 0 = 0.23 2 individuals 2 day 2 and w −1 1 = H ssi 2 individuals 2 balance the widely different scales of the two terms, and H ssi is the data set of newly admitted hospitalized by day. G computes the average initial daily growth rate and H computes the newly admitted hospitalizations from the model. By the subscript t < t 1 we mean only the part of the data corresponding to this constraint; t 1 = 16 before which H is independent of our restriction function. We chose α = 0.01.
Using the now determined quantities (s, I 0 ) we fit the three restriction levels in c(t) from the whole data set of hospital admission by day. Fixing (t 1 , t 2 , t 3 ) = (16, 46, 86) we have where w 2 is some arbitrary large number so the last term forms a soft constraint enforcing c 1 < c 2 < c 3 . Assuming (p, C p ) = 1 10 , 4 5 , i.e. that only 10% contribute 80% of all infections, the above fitting schemes resulted in The fractional number of people is due to the continuous nature of the model. These results depended slightly on the choice of initial condition but the differences were on the order of 10 −3 . The simulation, when done using these data, may be viewed in Figure 8. We note that the difference between restriction levels c 2 and c 3 is almost insignificant. There are likely various reasons for this. In the agent based model of [19] a social structure is incorporated, which allow them to close particular sectors of society. In comparison our compartmental model has no such structure and thus cannot model that a only particular part of society is closed. A possible explanation might be that the phase 2 reopening didn't really affect the overall amount of contacts for people. This could also be due to a lack of data as small variations in c 3 did not change the value of the optimization functional significantly.

Adding uncertainty
Assuming a level of uncertainty in the fitted restriction levels we may compute confidence intervals for the model. In Figure 9 we assume normally distributed priors for the restriction levels with means c i , i = 1, 2, 3, and relatively scaled standard deviations of 0.1c i , i = 1, 2, 3. Assuming the posteriors may be approximated reasonably by a truncated normal distributions, 95% confidence intervals are visualized. We see that with uncertainty of this level on the parameters the development is expected to keep declining.
The evolution of the Sobol indices over time is drawn up in Figure 10 illustrating the variance contributions from the parameters, which show as expected how c 1 is the most important initially but is gradually taken over

Conclusion
In this paper, we have summarized modern methods for efficient computational uncertainty quantification with the key elements of generalized Polynomial Chaos and the related Polynomial Chaos Expansions. Moreover, we have introduced the related variance-based sensitivity analysis in terms of Sobol indices with applications to the modelling spread of diseases. The techniques have been applied to epidemic models of SIR type based on official Danish Covid-19 data. Our results have demonstrated that the chosen tools are well suited in this setting and reproduce conclusions similar to those of the more computationally expensive model used e.g. in [19].
Taking uncertainty into account in predictions from epidemic models is an important means to access possible scenarios and influence of uncertainty on the predictions. Doing this in a transparent and efficient way should be a priority. We recommend that future studies using similar models consider the use of the presented techniques as a tool for quantifying the uncertainty and sensitivity in their computations. Figure 9. Shows a superspreader case simulation using fitted data taking into account uncertainty. Hospitalized (H), critical care (C) and newly hospitalized are shown with confidence intervals. The yellow dots are newly hospital admissions by day from the Danish authorities. Figure 10. Shows the Sobol indices as an evolution over time in the superspreader case simulation using fitted data with added uncertainty. At each time the color distribution above determine the part of the variance contributed by each parameter; the corresponding parameters are listed in the legend-box. The c i c j parts are the joint variance contributions of c i and c j , as is visible the joint parts are mostly unrelated here.