Calibrating with Multiple Criteria: A Demonstration of Dominance

: Pattern oriented modelling (POM) is an approach to calibration or validation that assesses a model using multiple weak patterns. We extend the concept of POM, using dominance to objectively identify the best parameter candidates. The TELL ME agent-based model is used to demonstrate the approach. This model sim-ulatespersonal decisionstoadopt protectivebehaviourduring aninfluenzaepidemic. Themodel fitis assessed by the size and timing of maximum behaviour adoption, as well as the more usual criterion of minimising mean squared error between actual and estimated behaviour. The rigorous approach to calibration supported explicit trading off between these criteria, and ultimately demonstrated that there were significant flaws in the model structure.


Introduction
. Agent-Based Models (ABMs) simulate "unique and autonomous entities that usually interact with each other and their environment locally" (Railsback & Grimm , p. ).Such models are therefore designed at the micro-scale, with rules to guide the actions of the simulated individuals based on their specific characteristics and situation.In contrast, much of the interesting behaviour of the model occurs at the macro-level.

.
This scale mismatch complicates model calibration.Parameters for those micro-scale rules may be unmeasurable, but the aggregated e ect of the decisions is routinely collected in data about the operation of the system being modelled.With a large number of parameters, it may be relatively easy to obtain an apparently good fit overall that is nevertheless hiding structural invalidity or other problems.One way to make the calibration more robust is by assessing model output against multiple criteria selected for their diversity, referred to as patternoriented modelling (Wiegand et al. ; Railsback & Grimm ).Doing so, however, introduces the problem of defining an overall 'best fit' since di erent sets of parameter values may generate model output that meet di erent criteria.

.
One approach is to establish an overall objective function that combines each of the criteria in some way.For example, the criteria could be weighted and the model calibrated to best fit the weighted combination.However, this approach introduces an arbitrary function to combine the criteria (such as additional parameters in the form of criteria weights), typically with only limited knowledge of what is being traded away.Another method uses stakeholder or other experts to assess the reasonableness of the model's behaviour (Moss ).
. Categorical calibration or filtering (Wiegand et al. ; Railsback & Grimm ) uses acceptance thresholds for each criterion and retains all parameter sets that meet all the thresholds for further consideration.However, this is ine icient.If any threshold is set too high, a parameter set could be rejected that is an excellent fit on all other criteria.On the other hand, setting a lower threshold passes too many potential solutions to be easily compared.
Figure : Definition of dominance (two dimensions).Point D is dominated by point A because point A is better against all criteria than point D. That is, regardless of the relative importance of the two criteria, point A is always preferred over point D. Similarly, point E is dominated by both point B and point C.But point E is not dominated by point A; if criterion was much more important than criterion , it may be appropriate to select E for the small improvement in criterion at the expense of the loss in criterion .The shaded area indicates the parameter space that is dominated by any of the three points A, B or C. The Pareto e icient front is the set of points that are not dominated by any other, in this case any points on the dashed line specified by the points A, B and C. Along this line, improvements in one dimension can only be achieved at the expense of at least one other criterion; for example, moving from A to B improves criterion but worsens criterion .With more dimensions, the Pareto front is given by a piecewise hyperplane, but is also the set of points that appear on the front of any pair of dimensions, regardless of whether they are dominated in other pairs of dimensions.

.
This paper instead presents the dominance approach, which does not arbitrarily prioritise criteria or set subjective thresholds.Instead, dominance is used to identify all the parameter sets that are on the Pareto e icient frontier.These are the parameter sets that are objectively best, where an improvement in one criterion can only be made by reducing the fit for another criterion (see Figure ).While this approach is well established in operations research for multi-criteria decision making or optimisation (Müssel et al. ), it is less well known in social simulation (with some exceptions, such as Schmitt et al. ).
. The method is described using a case study: calibrating the TELL ME model concerning protective behaviour in response to an influenza epidemic.This paper first presents the model structure and the parameters required to operationalise the links between attitude, behaviour and epidemic spread.The description focuses on the necessary background to understand the calibration process presented in the following sections.The approach to setting parameter values is then described, with the results of that process and conclusions following.

Case Study Description: TELL ME Model
. The European funded TELL ME project concerned communication before, during and a er an influenza pandemic.Ending in January , it was intended to assist health agencies to develop communication plans that encourage people to adopt appropriate behaviour to reduce influenza transmission.One project output was a prototype ABM, to explore the potential of such models to assist communication planning.The agents in that model represent people making decisions about protective behaviour (such as vaccination or hand hygiene) in light of personal attitudes, norms and epidemic risk.

.
The core of the TELL ME model is individual agents making decisions about whether to adopt behaviour to reduce their chance of becoming infected with influenza.Protective behaviour is adopted (or dropped) by an agent if the weighted average of attitude, subjective norms and perception of threat exceeds (or falls below) some threshold.
. Each agent is attached to a patch (a location defined by a grid) overlaid on a map of the country in which the epidemic is being simulated.The epidemic is mathematically modelled by the patches; there is no transmission between individual agents.The infectivity at any patch is adjusted for the proportion of local agents who have adopted protective behaviour and the e icacy of that behaviour.In addition, the number of new infections in nearby patches is a key input to each agent's perception of threat.Thus, the agent protective behaviour decisions and the transmission of the epidemic are mutually dependent.
. The operationalisation of this model design is described briefly below.This description focuses on those elements of the model that were calibrated using dominance.The behaviour of the agents is also a ected by communication plans, which are input to the model as sets of messages.The communication elements were disabled for calibration purposes due to lack of data, and are therefore not described here.
In each patch or region (r), the value of the transition rate parameter from S to E (β) is reduced in accordance with the behaviour decisions taken by individuals at that patch and the e icacy(E) of the behaviour.The reduced infectivity rate (calculated with Equation ) is used in the transmission equations (Equation ), leading to a lower local incidence.To support a mix of behaviour (and hence di erent reductions in infectivity between patches), each patch is home to at least ten agents, with greater numbers in those patches that correspond to high population density real world locations.
. To allow the epidemic to spread, a proportion of estimated new exposures for a region are actually created in neighbouring patches to simulate travel.This requires two additional parameters, the proportion of new infections created at other locations, and the split between neighbouring or longer distance patches.

Operationalising decisions about protective behavior .
The agents' behaviour decisions are based on three psychological models: the Theory of Planned Behavior (Ajzen ), Health Belief Model (Rosenstock ), and Protection Motivation Theory (Maddux & Rogers ).The key factors of attitude, norms and threat from these models were used as the inputs for agent behaviour.The agent compares the weighted average of the three inputs to a threshold (Equation ) for each type of behaviour (vaccination or other protective).If the value is higher, the agent adopts the non-vaccination behaviour or seeks vaccination, and non-vaccination behaviour ceases once the value falls below the relevant threshold.Vaccination cannot be dropped.Threat has the same value for both types of behaviour, but attitude, norms, weights and thresholds may be di erent.
Attitude is operationalised as a value in the range [ , ], initially selected from a distribution that reflects the broad attitude range of the population.Subjective norms describe how a person believes family, friends and other personally significant people expect them to behave and the extent to which they feel compelled to conform.The norm is operationalised as the proportion of nearby agents who have adopted the behaviour.
. Perceived threat (T t ) reflects both susceptibility and severity (Equation ).Following the method of Durham & Casman ( ), susceptibility is modelled with a discounted (δ) cumulative incidence time series.This means that perceived susceptibility will increase as the epidemic spreads but recent new cases (c t ) will impact more strongly than older cases.In contrast to the cited paper, only nearby cases are included in the time series for the TELL ME model, so perceived susceptibility will be higher for the simulated individuals that are close to the new cases than for those further away.Severity is included as a simple 'worry' multiplier (W ), and can be interpreted as subjective severity relative to some reference epidemic.

Calibration Process
. From the model structure discussion, it is clear that there are many parameters to be determined.Some may be estimated directly from measurable values in the real world, such as population counts.Ideally, unmeasurable values should be calibrated to optimise some measure of goodness of fit between model results and real world data.
. The first phase simplified the model to reduce the number of parameters influencing results.This was done by excluding the communication component and fixing protective behaviour to have no e ect.Other values were fixed at values drawn from literature, specifically those that a ected the distribution of attitudes and the transmission of the epidemic.The exclusion of some components and setting of other parameters to fixed estimates can be interpreted as reduction in the dimensions of the parameter space, reducing the scope of the calibration task.
. The second phase calibrated the parameters that are central to the model results; those that govern the agents' decisions to adopt or drop protective behaviour as an epidemic progresses (weights in Equation and discount in Equation ).This phase is where dominance was used, to assess parameter sets against three criteria: size and timing of maximum behaviour adoption, as well as the more usual criterion of minimising mean squared error between actual and estimated behaviour.
. The model parameters are summarised in Table , together with how they were used in the calibration process.
While the TELL ME model included both vaccination and non-vaccination behaviour, only the latter is reported here because the process was identical.Non-vaccination behaviour was calibrated with various datasets collected during the H N epidemic in Hong Kong.The calibration process is described in more detail in the remainder of this section.

Dimension reduction: protective behaviour .
Attitude distribution was based on a study of behaviour during the H N epidemic in Hong Kong (Cowling et al. ), which included four questions about hand hygiene: covering mouth when coughing or sneezing, washing hands, using liquid soap, and avoiding directly touching common objects such as door knobs.A triangular distribution over the interval [ , ] with mode of .was used to allocate attitude scores in the model as an approximation to these data. .
The e icacy of protective behaviour (E) was set to zero (ine ective) during calibration.That is, agents respond to the changing epidemic situation in their decision processes, but do not influence that epidemic.This ensures simulations using the same random seeds will generate an identical epidemic regardless of behaviour adoption, allowing simulated behaviour to respond to the relevant incidence levels.

Dimension reduction: epidemic transmission
. Several parameters that influence epidemic spread were estimated from data.These are the various transition rates between epidemic states, the structure of the population in which the epidemic is occurring, and the  ).Calibration experiments were run with R 0 = 1.5 (the lowest value for which an epidemic could be reliably initiated), latency period of days (European Centre for Disease Prevention and Control ), and infectious period of days (Fielding et al. ).
. The population at each patch was calculated from population densities taken from GIS datasets of projected population density for (obtained from Population Density Grid Future collection held by Center for International Earth Science Information Network -CIESIN -Columbia University & Centro Internacional de Agricultura Tropical -CIAT ).These densities were adjusted to match the raster resolution to the NetLogo patch size and then total population normalised to the forecast national population for (United Nations, Department of Economic and Social A airs, Population Division ).
. As epidemic processes (Equation ) occur independently within each patch, the model explicitly allocates a proportion of the new infections created by a patch to other patches to represent spreading of the epidemic due to travel.The proportion of new infections allocated to other patches was set at ., with .allocated to immediate neighbours and .allocated randomly to patches weighted by population counts.These values provide a qualitatively reasonable pattern of epidemic spread.

Dominance analysis of behaviour parameters .
Four parameters are directly involved in agent adoption of protective behaviour: weights for attitude and norms, the discount applied for the cumulative incidence, and the threshold score for adoption (ω A , ω N , δ, and B in Equations and ).Briefly, multiple simulations were run while systematically varying these parameters to generate a behaviour adoption curve.That curve was assessed against empirical data on three criteria, and dominance analysis was used to identify the best fit candidates.
. Broadly, the empirical behaviour data has an initial population proportion of approximately %, which rises to % and then falls below the starting level.This rise and fall was considered the key qualitative feature of the data and two aspects were included: timing and size of the bump.The three criteria to select the best fit parameter sets were: • mean squared error between prediction and actual over all points in the data series (MSE); • the di erence in values between the maximum predicted adoption proportion and maximum actual adoption proportion (∆Max); and Parameter Range Attitude weight (ωA) .by .to .Norms weight (ωN ) .by .to .Incidence discount (δ) .by .to .Behaviour threshold (B) .by .to .
Table : Parameter values tested in the calibration process.
• the number of ticks (days) between the timing of the maximum predicted adoption and maximum actual adoption (∆When).
. Experiments and dominance analysis were performed with the Sandtable Model Foundry (Sandtable ).This proprietary system was used to manage several aspects of the simulation in a single pass: sampling the parameter space, submitting the simulations in a distributed computing environment via the NetLogo API, comparing the result to the specified criteria, and calculating the dominance fronts.As each run takes several minutes, the sampling and distributed computing environment made it feasible to comprehensively explore the parameter space in a reasonable time and the within-system dominance calculation simplified analysis.

.
Simulations were run with parameter values selected from the ranges at Table , chosen so as to require a contribution by attitude (ω A ≥ 0.2) to support heterogeneity of behaviour between agents on a single patch.Parameter combinations were excluded if they did not include contributions by all three influencing factors of attitude, norms (ω N ≥ 0.1), and threat (ω A + ω N ≤ 0.9).The parameter space was sampled using the Latin Hypercube method, with combinations selected.
. Ten simulations were run for each parameter combination.Preliminary testing with repetitions indicated that simulations using the same parameters could generate epidemics that di er substantially on when they 'take o ', but they had similar shapes once started, and hence similar behaviour adoption curves (not specifically shown, but visible in Figure ).Ten of the seeds were retained for use with the calibration simulations.These random seeds generated epidemics with known peaks regardless of the behaviour parameter combination as the generated epidemic was not a ected by protective behaviour (since e icacy is set to ).

.
The behaviour curves from the simulations were centred on the timestep of the epidemic peak and averaged.The average curve was compared to the (centred) data points of the Hong Kong hand washing dataset (Cowling et al. , supplementary information) for calculation of the three fit criteria.
. Parallel plot analysis was used as an exploratory tool.This is an interactive technique using parallel coordinates (Inselberg ; Chang ) to simultaneously show the full set of model parameters and the criteria metrics.That is, simulation runs can be filtered with specific values or ranges of one or more of the input parameters or di erence from criteria.
. Dominance analysis was used to identify the best fit candidate parameter sets.This technique assigns each parameter set to a dominance front (using the algorithm of Deb et al. ).Front is the Pareto e icient frontier, where any improvement in the fit for one criterion would decrease the fit against at least one of the other criteria (Figure ).Front would be the Pareto e icient frontier if all the front parameter sets were removed from the comparison, and so on for higher front values until all parameter sets are allocated a front number.

Results
. The parameter sets that are not dominated are those on the Pareto e icient frontier (front ).These are described at Table with their performance against the three criteria.By definition, for all other parameter sets, there is at least one on the frontier that is a better fit on at least one criterion and at least as good a fit on all others.Thus, these are the objectively best candidates.
. The choice between these for the best fit overall is subjective, trading performance in one criterion against performance in the others and also adding other factors not captured in the criteria.Two methods were used to assist with that choice, quantitative distance from best fit criteria and qualitative fit of behaviour curves.
. The fit for all tested parameter sets is displayed at Figure , with the non-dominated (front ) candidates marked in red and labelled with the set number from Table : Best fit parameter sets and their assessment.
relevant section of (c) expanded), parameter sets and achieve a much closer maximum adoption compared to sets and , with only a small loss in the mean squared error.While the same analysis could have been performed by examining Table directly, the visualisation allows fast comparison, even with a larger number of criteria. .These best fit candidates are also coloured red in the parallel coordinate analysis (see Figure ).This revealed that good fit parameter sets existed throughout the tested parameter space for the weights and discount, but that the threshold should not exceed . .The main benefit of this analysis, however, is interactive.For example, it can provide a visual method of pattern-oriented modelling filtering, by adjusting ranges on the criteria results and displaying the parameter values of the simulations that survive. .
For the qualitative visualisation, fi y simulations were run using the NetLogo BehaviorSpace tool (Wilensky ) for each of the non-dominated parameter sets.The average adoption curve is shown in Figure .Only sets to display the appropriate pattern of behaviour, with approximately two thirds of the population adopting the behaviour before the start of the epidemic followed by an increase and then return to a similar level once the epidemic has passed.An inspection of Table shows that the mean squared error is similar for all six, but parameter sets and also have a good match in the estimated maximum adoption level, supporting the selection of either of these as the best fit.

Discussion
. This paper describes a detailed calibration process using the prototype TELL ME model as a case study.The model is complicated, with many components and parameters to reflect policy makers' understanding of their planning environment.It is also complex, with model behaviour shaped by two types of interactions.Personal decisions about protective behaviour a ect epidemic progress, which influences perceptions of threat and hence personal decisions.Behaviour decisions of agents are also directly influenced by the decisions of nearby agents, through their perception of norms.
. The calibration process first reduced the dimensions of the parameter space by setting epidemic parameters, population density and attitude distribution to values drawn from the literature.Some other parameters were set to values that removed their influence in the model (notably behaviour e icacy and those associated with communication).
. This reduced the parameters required to calibrate the model to only four: attitude weight, norms weight, incidence discount and adoption threshold.These parameters control the central process of the model -adoption of protective behaviour in response to an epidemic.With only limited empirical information about behaviour throughout an epidemic, we used pattern-oriented modelling and attempted to calibrate against three weak signals: timing of the behaviour peak (compared to the epidemic peak), maximum level of protective behaviour, and minimising the mean square di erence between the simulation estimate and measured behaviour level.
. Having three assessment criteria opens the question as to how to compare the runs where they have di erent rankings across criteria.The standard approach is to set acceptance thresholds for each criterion (Railsback & Grimm ) and then select from only those that pass all.However, this is ine icient: if thresholds are set low enough to pass simulations that are generally excellent but are slightly less fit on one criterion, then the thresholds also allow through any simulation that is slightly less fit on all criteria.Instead, we have used the concept of dominance to identify the objectively best parameter sets; for any excluded simulation, there is at least one member of the dominant candidates that is better on at least one criterion and no worse on all others.Additional criteria were used to choose between these objectively good candidates, determining what to give up in order to achieve the best overall fit.

.
There is little similarity in the non-dominated parameter sets.Very di erent parameters can achieve similar outcomes (for example, sets and ), and parameter values in the best fit sets covered a broad range of values.This reflects the interdependence between the parameters and emphasises the di iculties in calibrating the TELL ME model, it would not have been possible to identify these candidates by tuning parameters individually.
. The rigorous calibration process was instrumental in detecting structural problems with the model.In particular, the prototype was unable to generate results with a behaviour peak earlier than the epidemic peak, in conflict with the empirical results for hand hygiene during the Hong Kong H N epidemic.A reasonable fit could have been achieved against a minimum mean squared error single criterion, but assessing against multiple criteria highlighted the timing weakness.
. Further consideration of the model rules makes it clear that this is a structural or theoretical gap rather than a failure in calibration.As attitude, weights and the threshold are fixed, change in behaviour arises from changes in the norms or perceived threat.The attitude weight is instrumental in setting the proportion adopted in the absence of an epidemic, but plays no part in behaviour change as the attitudes of agents are constant.As the epidemic nears an agent, incidence increases near the agent, which also increases perceived threat and may trigger adoption.This may also trigger a cascade through the norms (proportion of visible agents who are protecting themselves) component.However, the threat component of the behaviour decision (Equation ) can only respond to an epidemic, not anticipate it, and the norms component can only accelerate adoption or delay abandoning it.Therefore, regardless of parameter values, the simulation is unable to generate a pattern with a behaviour curve peak before the epidemic peak.

Conclusion
. Ultimately, the TELL ME ABM was unable to be calibrated adequately for policy assessment.That is, the best fit parameter set was used as the model default values, but the simulation did not produce realistic model behaviour.For the purposes of the TELL ME project, this outcome was disappointing but not unexpected.The ABM was a prototype intended to identify the extent to which such a model could be developed for planning purposes.The attempt highlighted both the limited empirical information about behaviour during an epidemic and the absence of information about the e ect of communication.Relevant behavioural information must be collected if a full planning model is to be developed in the future.
. In contrast, the use of dominance was successful in identifying candidate parameter sets that are objectively best against several competing criteria.Selection between these candidates was then relatively simple as only a limited number needed to be considered.Further, the rigorous process highlighted structural problems in the model as the desired timing of the behaviour peak could not be achieved while also achieving good performance in other criteria.

Notes
This research has received funding from the European Research Council under the European Union's Seventh Framework Programme (FP/ -), Grant Agreement number .The full project title is TELL ME: Transparent communication in Epidemics: Learning Lessons from experience, delivering e ective Messages, providing Evidence, with details at http://tellmeproject.eu/.
The model and supporting documentation are available from several online locations.The EU project site links to the model code and users' guide at http://www.tellmeproject.eu/node/392,together with reports concerning the project.The model and users' guide are also lodged with OpenABM at https://www.openabm.org/model/4536/version/1. The model and users' guide are also available from the CRESS website at http://cress.soc.surrey.ac.uk/web/resources/models/tell-me-model, as is the working paper with the detailed technical information.The calibration simulation dataset is available on request from the first author.Similar functionality could be achieved within an open source environment by combining tools: one for the parameter space sampling and simulation management (such as OpenMOLE, MEME or the lhs and RNetLogo packages in R), and another to analyse the results and calculate the dominance fronts (such as the tunePareto package in R).

Figure :
Figure : Average outcome over simulations for each of parameter sets.Subfigures (a), (b) and (c) display the outcome against di erent pairs of criteria, with subfigure (d) focussing on the best fit section of (c).Those on the Pareto e icient frontier are coloured red and numbered according to Table .
Figure : Interactive analysis of simulation experiments.The input parameter values appear in the le section of screen, and the fit against each criteria on the right.Simulation runs can be highlighted in groups (such as all those on the Pareto e icient frontier as displayed) or individually to explore the e ect of di erent combinations of parameter values.

Figure :
Figure : The average of simulation runs for each of the non-dominated candidate parameter sets.The selected best fit parameter set (set ) is drawn in red.Empirical behavior values (extracted from Cowling et al. , supplementary information) are shown with dots.

Table :
TELL ME model parameter settings for calibration.mobility of that population.The multiplier in Equation was set at W = 1, establishing H N as the reference epidemic.
.The basic reproductive ratio (denoted R 0 ) is related to the parameters in Equation with R 0 = β /γ (Diekmann & Heesterbeek ).R 0 for the H N epidemic was estimated as .-. (European Centre for Disease Prevention and Control