Metamodels for Evaluating, Calibrating and Applying Agent-Based Models: A Review

: The recent advancement of agent-based modeling is characterized by higher demands on the parameterization, evaluation and documentation of these computationally expensive models. Accordingly, there is also a growing request for “easy to go” applications just mimicking the input-output behavior of such models. Metamodels are being increasingly used for these tasks. In this paper, we provide an overview of common metamodel types and the purposes of their usage in an agent-based modeling context. To guide modelers in the selection and application of metamodels for their own needs, we further assessed their implementation effort and performance. We performed a literature research in January 2019 using four different databases. Five different terms paraphrasing metamodels (approximation, emulator, meta-model, metamodel and surrogate) were used to capture the whole range of relevant literature in all disciplines. All metamodel applications found were then categorized into specific metamodel types and rated by different junior and senior researches from varying disciplines (including forest sciences, landscape ecology, or economics) regarding the implementation effort and performance. Specifically, we captured the metamodel performance according to (i) the consideration of uncertainties, (ii) the suitability assessment provided by the authors for the particular purpose, and (iii) the number of valuation criteria provided for suitability assessment. We selected 40 distinct metamodel applications from studies published in peer-reviewed journals from 2005 to 2019. These were used for the sensitivity analysis, calibration and upscaling of agent-based models, as well to mimic their prediction for different scenarios. This review provides information about the most applicable metamodel types for each purpose and forms a first guidance for the implementation and validation of metamodels for agent-based models.


Introduction
. Essentially, a metamodel (MM) is a model which describes the behaviour of an original model on a higher hierarchical level (Moorcro et al. ; Urban ; Gore et al. ). In the context of mechanistically detailed and therefore o en computationally expensive agent-based models (ABM) or individual-based models (IBM ), MMs provide an e icient way to facilitate profound model analysis and prediction of ABM behaviour over a wide range of parameter combinations. .
The term MM originates from the Design of Experiments literature (Wang & Shan ; Montgomery ). It was originally developed to study the e ects of a set of explanatory variables on a response variable. Therein, optimization via response surface MMs was the most widely performed application (Barton ). Both terms, surrogate models (Dey et al. ) or emulators (Conti & O'Hagan ), can also be understood as MMs. Most commonly, they all treat a particular ABM as a white, grey or black box (Papadopoulos & Azar ) and link the input and output values by aggregated functions (Barton ; Friedman & Pressman ; Friedman ; Barton & Meckesheimer ). As a result, MM significantly reduce simulation costs in terms of computational time and allow easier communication and understanding of simulation models' behavior (Kleijnen & Sargent ; Mertens et al. ). This review will not consider other related concepts of MMs such as the model framework of concepts (Goldspink ). .
The aim of this review is to condense available information about common MM types used for various tasks related to ABM analysis and applications to guide modelers in choosing an appropriate MM type for their research problem. For detailed information on specific MMs and their applications, it is advised to look for reviews or tutorials elsewhere like Barraquand  A methodology for rating MM quality and implementation e ort in an ABM context was developed and applied for the reviewed publications by eight di erent raters with varying mathematical skills and scientific backgrounds. This was done to support readers in their selection and application of a metamodel in an ABM context.

Searching procedure
. We conducted a literature survey in Open Access databases (see Table ) on the th, th, st and th of January and considered only peer-reviewed papers. For each database used, we performed ten searches combining the terms agent-based model and individual-based model with each of the following keywords: Approximation, emulator, metamodel, meta-model and surrogate. We did not limit the time frame of the results but took only a maximum of results per search into account, sorted by their relevance. Papers containing a single or combinations of keyword(s) in their title, abstract, or keywords section were selected for review.

Database Website
Academic Search Complete ebscohost.com/academic/academic-search-complete Web of Science Core Collection apps.webofknowledge.com Google Scholar scholar.google.de Scopus elsevier.com/solutions/scopus Table : Complete list of all databases used for the review presented. The survey was conducted in January without limiting the years of publication.

Categorization of MMs and purpose of application .
In contrast to Papadopoulos & Azar ( ), we do not sub-classify MMs into white (reduced order), gray (both physical equations and stochastically estimated parameters) and black box (Machine Learning) surrogate models. Instead, we simply distinguish them according to their approach to describe the link between input and output variables as deterministic (e.g. Di erential Equation) and stochastic (e.g. Machine Learning) MMs, respectively. We thus assign, for example, a Partial Di erential Equation used for upscaling (e.g. Moorcro et al. ) to the family of deterministic MMs, whereas Bayesian Emulators applied for calibration (e.g. Bijak et al. ) are considered as stochastic MMs.
. The MMs were first subdivided into two main classes namely deterministic and stochastic models depending on whether they consider probability distributions linked to the input, output, or processes described by the ABM. The classes were further subdivided into six model families that comprise di erent MM types (Table ). In this sense, all MM family names resemble the so-called suitcase phrases and do not necessarily share all attributes or requirements of their namesake in a mathematical context. The names of the model types were directly  Table : MM classification derived from the accepted papers with MM applications in an ABM context. The di erentiation between deterministic and stochastic models depend on whether probability distributions of input, output or processes described by the emulated ABM were taken into account. Model families represent the so-called suitcase phrases, which are not necessarily mathematical definitions for all MM types included in the family.
extracted from the accepted papers without any adjustments. Appendix A provides complete information about the reviewed papers and the corresponding model families and types. .
We categorized the purpose of each MM exclusively based on the declaration of the particular authors (Table  ). Notably, we understand parameter fitting as calibration incorporating calibration, parameterization or optimization in accordance to Railsback & Grimm ( ).

Assessment of MM quality and implementation e ort .
In the following paragraphs, we briefly describe how we rated the MM's quality and implementation e ort. For more in-depth information on the procedure as well as for some examples of each rating criterion, see Appendix C. This guide was used to rate each MM application and to calculate the mean quality and implementation e ort. An inter-rater reliability was calculated using the icc function of the R package irr version . . (Gamer et al. ). Scale the model to a coarser spatial resolution (Cipriotti et al. ) or from individuals to populations (Campillo & Champagnat ). How did the authors state the suitability of the MM for the given purpose?

Number of Evaluation Criteria (NE)
> How many di erent criteria were provided by the authors for evaluating the MM suitability? Table : Criteria applied for assessing the MM quality for the given purpose of emulating the ABM.
one of interest), using average as type (we want to use the mean ratings for each MM application) and agreement as definition since we had sought to evaluate the agreement among the raters. .
The quality of MM was assessed based on the assessment of the respective source authors using three di erent criteria (Table ): Consideration of Uncertainty (CU), Suitability Assessment by Source Authors (SuA), and Number of Evaluation Criteria (NE). With the CU criterion, we evaluated how the authors considered uncertainties in the inputs and outputs of the respective MM family. In this criterion, the term no means that there was no explicit consideration of uncertainty given by the authors using the MM, while yes refers to those where they used at least some (quantitative) measures (e.g. error bars or R 2 ). We assigned a high quality if the source authors had presented measures of uncertainty with a corresponding evaluation of such measures. The term suitability in SuA refers to the applicability of the given MM type (e.g., Approximate Bayesian Computation) to fulfill the particular purpose (e.g., calibration of an ABM). A good MM evaluation by the authors was regarded as medium if the assessment is only based a qualitative statement (e.g., "The MM performed extremely well."). We adjudged suitability as good in those cases where the ABM emulation was quantitatively assessed with a positive result. The third criteria NE is self-explaining. For example, a basic linear regression model provides two criteria for evaluating suitability (R squared for the goodness of fit and p-value for evaluating the significance of the linear relationship between the input and output variables) and, thus, would receive a medium assessment for this specific criterion if the authors presented those criteria within their peer-reviewed research paper. Example statements like, the MM had a % probability of selecting a parameter set that fitted all investigated outputs, or this procedure was successful in % of cases, revealing its great potential to assess parameters di icult to measure in nature, were considered as SuA = good with NE = low.
. The implementation e ort of each MM family was assessed by the following three criteria (Table ): Availability of Open Access Guiding Sources (AG), R Coverage (RC), and Out-of-the-Box Applicability (OA). Since we focus exclusively on the e ort to implement MMs, computational cost has been absent in our consideration. The AG criterion evaluates the e ort of finding help or further information for the potential MM application to own needs. If no sources could be found by performing a search in Google Scholar and Google.com using the MM type name as search query, the MM was regarded with a high implementation e ort, while multiple usable sources (e.g. a page on Wikipedia.org and a mathematical blog entry) were considered as a medium implementation e ort. Low e orts were assessed if there was one source giving a comprehensive tutorial on implementing the respective MM. The RC criterion focused on the free available statistical language R (R Core Team ). If one dedicated package is available to implement the whole MM, it was rated with a low implementation e ort. If multiple R packages were necessary, a medium e ort was given. We assigned a high implementation e ort if the entire MM had to be developed from scratch. The last criterion OA assessed the possibility of MMs to be immediately usable (partly depends on the existing so ware). MMs were evaluated at a high implementation e ort if the derivation of specific equations was required or some important assumptions had to be investigated for it's use. Little adjustments correspond, for example, to the derivation of a linear model equation for the corresponding R function, while the application of an unsupervised artificial neural network was considered as a low implementation e ort. .
Using the average value of all raters of each criterion, we conducted an overall assessment of quality and implementation e ort of each MM application. Mean ratings were then analyzed separately for quality and implementation e ort using the five-level classification (low, low-medium, medium, medium-high and high) displayed in  Table : Criteria applied for assessing the MM implementation e ort for the given application aims.

Amount of Scores in Overall MM Quality / Implementation E ort
High Medium Low Level high high medium-high medium-high medium medium low-medium low-medium low low Table : The overall MM quality and implementation e ort was calculated for each application according to the mean ratings of each of the three criteria for quality (CU, NE and SuA) and e ort (AG, OA and RC). Table . If, for example, a MM application received a high SuA, a high NE and a medium CU, a high overall MM quality was given. These overall assessments were used to generate a plot for each application aim (Table ) depicting the MM quality in the dependency of the MM implementation e ort. Within these plots a bisecting line was drawn for visualizing the : ration of quality and implementation e ort and highlight favorable MMs scoring above this line and less favorable MMs staying below this line.

Results and Discussion
. Following the previously described selection criteria (see method section), di erent peer-reviewed journal papers published from to ( Figure ) were accepted for the review (see Appendix B. With this we could extract di erent MM applications in an ABM context (see Appendix A).

Sensitivity analysis .
For sensitivity analyses, Bayesian Emulators and Regressions have the highest MM quality indicating accessible implementation e orts (Figure ). Half of the reviewed publications with focus on Machine Learning scored above the bisecting line indicating a broad MM usage, while the remaining applications were either on or below the bisecting line.
. Overall, we found the implementation e ort for the three MM families (Bayesian Emulators, Machine Learning and Regression) to be reasonable due to a predominantly high RC (R coverage) and the broad AG (availability  of Open Access guiding sources) on these MMs. However, a shortcoming in the application of these three MM families for sensitivity analysis is their need for adjustments to be applicable for another ABM: There was not a single MM type within those MM families that could be reused without any changes. The superior qualities of Bayesian Emulators and Regression MMs result from the moderate to good SuA (Suitability Assessment by Source Authors) in addition to their moderate to good CU (Consideration of Uncertainty). The applied Machine Learning MMs for sensitivity analysis never exceeded a moderate NE (Number of Evaluation Criteria) while their CU and the SuA increased in the following order: Decision Tree Ensemble, Support Vector Regression, Symbolic Regression and Random Forest.

Calibration .
For calibration, Bayesian Emulators, Machine Learning and Regression MMs seem to be the preferable MM families since they constantly stay above the bisecting line (or thereon) indicating a beneficial MM quality to implementation e ort ratio (Figure ). In contrast, Di erential Equation and Ordinary Functional Equation MMs do not exceed or even reach the bisecting line and therefore seem to be less favorable MM families to be applied for the purpose of calibrating ABMs.

.
The overall low-medium implementation e orts of the three best scored MM families such as Bayesian Emulator, Machine Learning and Regression can be explained with their good to at least medium RC (R Coverage) as well as the good to moderate AG (Availability of Guiding Sources). Their OA (Out-of-the-Box Applicability) was never rated as low and always received medium or high assessments regarding their implementation e orts. .
High implementation e orts of Di erential Equations and Ordinary Functional Equations are due to considerably low OA because they have to be rebuilt entirely for every new ABM. Their AG and RC remain good to medium, emphasizing their broad usability.
. The superior MM qualities of Bayesian Emulators are due to their high NE as well as in-depth CU (Consideration of Uncertainty). Only SuA (Suitability Assessment of Source Authors) was poor to medium, indicating that not every MM type of this family suited the task of calibration as good as the others. Machine Learning MMs always achieved a good SuA while their CU and NE (Number of Evaluation Criteria) varied from medium to high. .
The considerably poor qualities achieved by Di erential Equations and Ordinary Functional Equations result from their low CU and NE. Nevertheless, the respective source authors assessed the suitability of these MMs qualitatively as good.

Prediction .
In order to predict the behavior of ABMs, Bayesian Emulators and Machine Learning MMs seem to be the most favorable MM families since they continually exceed the bisecting line of : ratio for MM quality and implementation e ort (Figure ). While the only Regression application for predicting ABMs achieves a low-medium MM quality as well as implementation e ort signaling a trade-o between prediction and implementation, Di erential Equations as well as Ordinary Functional Equations consistently remain below the bisecting line.
. For predicting ABMs behavior, Bayesian Emulators scored the best quality rating with varying implementation e orts. The low-medium e ort of Gaussian Process Emulator originates from very good RC (R Coverage) as well as medium OA (Out-of-the-Box Applicability) and AG (Availability of Guiding Sources). The medium-high e ort of the dynamic linear model Gaussian Process is due to worse OA, AG as well as RC. The latter two criteria should be considered critically as we used the exact name presented here as a key phrase in our online research while looking for R packages and guiding sources. We could expect a lower implementation e ort had we used a more flexible search term for this kind of MM type.
. The second best MM family for prediction of ABMs are Machine Learning models. Their considerably low implementation e orts are due to their broad RC and AG. OA varies around a medium ranking with decision trees achieving the highest rating. The varying quality within this MM family is because di erentiating SuA (Suitability Assessment) by the respective source authors, while CU (Consideration of Uncertainty) is overall low and NE (Number of Evaluation Criteria) scores between low and medium. The highest quality is achieved by Random Forest for its comparable higher CU and NE.
. The Regression MM applied for predicting ABMs is a First Order Regression receiving lower quality ratings while still being good at SuA. The implementation e ort consists of a medium OA (the formula of the linear model has to be adapted for every ABM) and a moderate RC, which could be caused by using the whole and exact model name for our online research of R packages. Figure : Results of the MM quality and implementation e ort assessment for the application aim of Prediction.
. The overall high implementation e orts of Di erential Equations (Compartment Ordinary Di erential Equation) and Ordinary Functional Equations (Systems Dynamic Model) while scoring only low-medium to medium qualities are due to their really low OA, since these MM families have to be rebuild anew entirely for each ABM applied. Furthermore, their CU as well as their NE is low, which together with only a qualitatively good SuA add up to medium qualities at best.

Upscaling .
For upscaling ABMs only the Markov Chain MM exceeded a neutral MM quality and implementation e ort ratio ( Figure ). The Di erential Equation MM stayed below the bisecting line, making it a less favorable choice of MM for upscaling ABMs.
. The Markov Chain MM reached a medium quality because of the considerably high SuA (Suitability Assessment by Source Authors), low-medium CU (Consideration of Uncertainty) and NE (Number of Evaluation Criteria).
The implementation e ort is dominated by its poor OA (Out-of-the-Box Applicability), meaning many adjustments are required to adapt this kind of MM to another ABM. The only accepted Di erential Equation (Partial Di erential Equation) scored a low OA since a new equation has to be derived for every application in ABMs.

MM rating method and inter-rater reliability .
The inter-rater reliability never fell below a fair level and even achieved excellent evaluation for CU (Consideration of Uncertainty) and OA (Out-of-the-Box Applicability) (Table ). .
With eight raters and a sample size of MM applications, the requirements suggested by Koo & Li ( ) are met and exceeded, emphasizing the robustness of the inter-rater reliability results and therewith the results of the MM rating. Nevertheless, the calculated fair intra-class correlation coe icients for SuA (Suitability Assessment of Source Authors), AG (Availability of Guiding Literature) and RC (R Coverage) (Table ) indicate a necessity to further improve the clarity of the rating instruction for these criteria.

.
One reason for the stronger variation inside the MM implementation e ort criteria AG and RC lies within the diverse backgrounds of the raters which participated in the MM assessment. Since the individual knowledge, Figure : Results of the MM quality and implementation e ort assessment for the application aim of upscaling.
the experiences with the corresponding MM types as well as the statistical so ware R were di erent (Appendix D), the assessment of a number of R packages needed to apply a given MM varied among reviewers. .
The only fair agreement within the MM quality criterion SuA could be because of the unclear instruction for cases in which the authors provided empirical proof for the suitability but never directly assessed it themselves qualitatively. In these cases, some raters gave a medium rating and others a high. Additional divergences emerged when the source authors did not provide any assessment but some raters were able to identify a good or bad fit by themselves while investigating the provided plots, highlighting disparities in certain instances. A more fine grained analysis (e.g. five or seven scale evaluation) might reveal a clustering around high, medium and low with some within variations. . excellent Table : Calculated inter-rater reliability for the rating criteria with evaluation following Cicchetti ( ).

Conclusions
. Metamodelling is a promising approach to facilitate ABM calibration, sensitivity analysis, prediction and upscaling. We conducted a review that overviews the MM types used among their purposes. Within the papers analysed, we identified di erent MM applications. For each of them, we (PhD students and Postdocs with none up to moderate mathematical background) assessed the performance quality and the implementation e ort. The methodology applied MM rating in this paper was validated by the fair to excellent intra-class correlation coe icients during the inter-rater reliability assessment.

.
Our goal was to support MM selection for the various needs of daily ABM problems by highlighting the currently most promising MM types with an example each serving as a practical application guide: • Prediction: Gaussian Processes from the Bayesian Emulator MM family provide the best quality while o ering low-medium implementation e ort. In contrast, Random Forest MMs (Machine Learning family) o er low-medium e ort but only medium-high quality. An example on predicting new parameter combinations like an inverted calibration can be found in Peters et al. ( ).
• Upscaling: Transition Matrices from the Markov Chain MM family seem to be the most promising tool for scaling up ABMs. Note that we reviewed only two MMs on this application aim. The corresponding application can be found in Cipriotti et al. ( ).
. This review was intended as a "first aid" for agent-based modelers who seek to improve the performance, optimization or analysis of their simulation model using a metamodel. Our motivation for this work ensued from our day-to-day modeling tasks. Please note that the review presented here can only provide an initial overview, which is primarily meant to stimulate and guide a potential reader through a self-exploration of the wide field of metamodels with ease. The examples presented here are not exhaustive and the field of metamodeling itself is constantly and rapidly developing. Particularly, the application of the potentials o ered by various methods of artificial intelligence (with the branches of machine learning or deep learning) is just beginning to emerge. We would therefore like to motivate our readers to stay abreast on new developments in applying metamodeling approach to ABMs, and above all, try out metamodels in their own ways.