No Free Lunch when Estimating Simulation Parameters

: In this paper, we have estimated the parameters of 41 simulation models to find which of 9 estimation algorithms performs better. Unfortunately, no single algorithm was the best for all or even most of the models. Rather, five main results emerge from this research. First, each algorithm was the best estimator for at least one parameter. Second, the best estimation algorithm varied not only between models but even between parameters of the same model. Third, each estimation algorithm failed to estimate at least one identifiable parameter. Fourth, choosing the right algorithm improved estimation performance by more than quadrupling the number of model runs. Fifth, half of the agent-based models tested could not be fully identified. We therefore argue that the testing performed here should be done in other applied work and to facilitate this we would like to share the R package freelunch .


Introduction
Comparing estimation algorithms provides no winner . A mathematical model is a set of causal mechanisms connecting numerical variables. These mechanisms may depend on one or more parameters; some can be readily observed while many cannot. Because un-observed parameters a ect the model output, it may be possible to identify their value by comparing the model output against what actually occurs in the data. Estimation is this process of identifying parameters by comparing model output to data. Many estimation algorithms for simulation models have emerged over the past ten years(for a general review see Hartig et al. ; for an agent-based review see Thiele et al. ; for agent-based models in economics see Fagiolo et al. ; Platt ). .
There are three limitations to current estimation literature. First, papers that introduce new estimation algorithms tend to showcase their performance on few idiosyncratic examples so that comparisons across methods remain di icult. Second, reviews that compare estimation algorithms tend to be small, focusing only on one field, few models and estimation algorithms. Third, existing reviews tend to mix two steps together: the processing of model outputs into useful summary statistics (or distance functions) and the actual algorithm used for estimation. The processing phase is very specific to each discipline which makes it hard to apply lessons from one paper to agent-based models in another field. .
Here, we have built a more thorough comparison of nine estimation algorithms across simulation models (both agent-based and not). Our original objective was to pick the best estimation algorithm so that authors could default to it without worrying about the rest of the estimation literature. Rather, we established that there is no single best algorithm: both the absolute and relative performance are context-dependent. .
The best performing estimation algorithm changes not just between models but sometimes even between parameters of the same model. Worse, even though the best algorithm is context dependent, choosing the right .
We wanted to measure the quality of an estimation algorithm along two dimensions: point predictions "performance" and confidence interval "coverage." We measured point prediction performance as: whereθ is the estimated parameter, θ * is the real hidden parameter,θ is the average parameter value in the training data and j is the row of the testing dataset we are estimating. In other words, performance measures how much more accurate (measured in root mean square error) estimation is compared to just guessing the average parameter value without performing any estimation. Performance ranges from (perfectly estimated) to (unidentified) to negative values (mis-identified). Without square roots this is equal to predictivity (Salle & Yıldızoğlu ) and modelling e iciency (Stow et al. ). .
We defined coverage as in Raynal et al. ( ) as the percentage of times the real parameter falls within the % prediction intervals suggested by the estimating algorithm. The best coverage is %: higher generates type I errors, lower generates type II errors.

Models .
We estimated the parameters of separate simulation models and them either as they appeared as examples in at least another estimation paper ( models) or because they were open source agent-based models ( models) available on the COMSES model library (Rollins et al. ). We can roughly categorize these models into four groups: simple, ill posed, complicated and agent-based models. Table lists them all. In the Appendix, we provide a brief description of each. Simple simulation models have few parameters and summary statistics. They feature prominently in the ABC literature both as a teaching tool and to compare di erent algorithms. They are useful because they run quickly but they may bias comparisons towards simpler estimation algorithms.
. Ill-posed simulation models face clear identification issues: the inability to recover parameters given the information we have. There are many sources of identification issues and each ill-posed model highlights one particular form. A good estimation algorithm facing an ill-posed problem should display two features. First, it should maximize the quality of our estimated parameters when the information is available but noisy (the lesser problem of "weak" identification). Second, estimation algorithm should recognize when the model cannot be identified and return wide confidence intervals, signaling estimation uncertainty.
. We split complicated simulation models into two sets, agent-based models and other complicated simulations. They face similar problems: they tend to be large, involve many input parameters and summary statistics. From an estimation point of view there is no qualitative di erence between the two but in practice agent-based models tend to be slower and produce more summary statistics.

Algorithms .
We tested nine reference table algorithms to parametrize simulations: five are ABC (Approximate Bayesian Computation) and four are regressions-only. We ignored search-based algorithms, such as synthetic likelihood (Wood ; Fasiolo & Wood ), ABC-MCMC (Hartig et al. ) and Bayesian optimization (Snoek et al. ). We also ignored regression-only algorithms that do not generate prediction intervals such as the deep neural networks proposed in Creel ( ) and the elastic nets proposed in Carrella et al. ( ).
. All reference table algorithms share a common estimation procedure. First, we ran the model "many" times and collected the random parameters we input and summary statistics the model outputs into a "reference table." The estimation algorithm allowed us to produces a rule to generalize the information in the reference table and to go from summary statistics back to the parameters that generated them. Finally, we plugged in the real summary statistics vector S * we observed from the data into this rule and obtained the estimated parameters.
The first algorithm we use was the simple rejection ABC (Pritchard et al. ; Beaumont et al. ). Start by ranking all training observations by their euclidean distance to the testing summary statistics i (S i (θ) − S * i ) 2 . Let us ignore all training observations except the closest %. The distribution of θ parameters from the closest training observations becomes the posterior of the estimate θ * . .
The second algorithm is the local-linear regression adjusted ABC (Beaumont et al. ). Weigh all training observations by an Epanechnikov kernel with bandwidth equal to Euclidean the distance between the testing summary statistics S * and the furthest S(θ) we would have accepted using simple rejection. Then run a locallinear regression on the weighted training set to estimate θ * as the predicted E[θ|S(θ)], using the residuals of that regression to estimate its posterior distribution. .
The third algorithm, neural network ABC, inputs the same weighted training set to a feed forward neural network (Blum & Francois ). The approach is similar to the local-linear regression described above but the residuals are also weighted by a second regression (on the log squared residuals) to correct for heteroskedasticity. .
These three algorithms are implemented in the abc package (Csilléry et al. ) in R. We used the package default settings for its neural networks ( networks, units in the hidden layer and weight decay randomly chosen for each network between 0.0001,0.001 and 0.01). .
The fourth and fi h algorithm are semi-automatic ABC methods which "pre-process" summary statistics before applying rejection ABC (Prangle et al. ). More precisely, the original summary statistics S(θ) are fed into a set linear regressions estimating r i = E[θ i |S(θ)] (one for each parameter of the model) and the values are used as summary statistics for the simple rejection ABC. The rationale is that these regressions will project the summary statistics into a space where rejection ABC performs better. We did this in two di erent ways here: by running first or fourth degree linear regressions in the pre-processing phase. This was done using the R package abctools (Nunes & Prangle ) and their default parameters: using half of the training set to run the regression and the other half to run the rejection ABC. .
A feature of all ABC methods is that they are local: they remove or weight training observations di erently depending on the S * (the "real" summary statistics). This means that during cross-validation we need to retrain each ABC for each row of the testing set.

Regression only .
Estimating parameters by regression is a straightforward process. We build a separate regression r for each θ in the reference table as dependent variable using the summary statistics S(θ) as the independent variables. We plug the real summary statistic S * in each regression and the predicted value is the estimated parameter θ * . .
The simplest algorithm of this class is linear regression of degree one. It is linear, its output is understandable and is fast to compute. This speed permits to estimate the prediction interval of θ * by resampling bootstrap (Davison & Hinkley ): bootstrap data sets are produced and the same linear regression is run on each one. From each regression i, their prediction β i S(θ * ) are collected and one standardized residual e is sampled (a residual divided by the square root of one minus the hat value associated with that residual). This produces a set of β i S(θ) + e i . The % prediction interval is then defined by . and . percentile of this set. .
In practice, predictions are then distributed with the formula: where r(S) is the regression prediction, A is an standard error adjustment due to uncertainty about the estimated coe icients (in this case S(β−β i ) whereβ is the original OLS estimated parameters, and β i is a bootstrap estimate of the same) and B is an adjustment due to irreducible noise (in this case, a random sample of standardized residuals). .
A more complex algorithm that is not linear but still additive is the generalized additive model(GAM), where we regress:θ = s i (S i (θ)) ( ) s i is a smooth spline transformation (see Chapter in Hastie et al. ; also Wood & Augustin ). To do so, we used the mgcv R package (Wood , ). The bootstrap prediction interval for the linear regression was too computationally expensive to replicate with GAMs. Instead, prediction intervals are produced by assuming normal standard errors (generated by the regression itself) and by resampling residuals directly: we generate , draws of z(S(θ)) + where z is normally distributed with standard deviation equal to regression's standard error at S(θ) and was a randomly drawn residual of the original regression. The % prediction interval for θ * is then defined by . and . percentile of the generated z(S(θ * )) + set. .
A completely non-parametric regression advocated in Raynal et al. ( ) is the random forest (Breiman ). This has been implemented in two ways. First, as a quantile random forest (Meinshausen ), using in quantregForest R package (Meinshausen ); prediction intervals for any simulation parameter θ * are the predicted . and . quantile at S(θ * ). Second, as a regression random forest using the ranger and caret packages in R (Wright & Ziegler ; Kuhn ). For this method, we generate prediction intervals as in GAM regressions. , draws of z(S(θ)) + are generated where z is normally distributed with standard deviation equal to the infinitesimal jackknife standard error (Wager et al. ) at S(θ) and is a resampled residual; the . and . percentile of the z(S(θ * )) + set are then taken as our prediction interval.

Point performance
. Table summarizes the performance of each algorithm across all identifiable estimation problems (here defined as those where at least one algorithm achieves performance of . or above). Estimation by random forests achieves the highest average performance and the lowest regret (average distance between its performance and the highest performance in each simulation). Even so, regression random forests produce the best point predictions only for out of identifiable parameters. GAM regressions account for another best point predictions. .
Local linear regression and neural-networks are also useful in the ABC context, achieving the highest performance for parameters. Local linear regressions face numerical issues when the number of summary statistics increases and they were o en unable to produce any estimation. The performance of neural network ABC could be further improved by adjusting its hyper-parameters, but this would quickly accrue high computational costs. The Appendix contains a table with the performance for all parameters generated by each algorithm.

.
It is important to note that the advantages of random forests become apparent only when estimating agentbased models. Simpler estimation algorithms perform just as well or better in smaller problems. This is shown in Table . This implies that interactions between summary statistics as well as the need to weigh them carefully matters more when estimating agent-based models than simpler simulations, justifying the additional algorithm complexity. Another important qualification is that random forests tend to perform better for the parameters where maximum performance is below . . This is intuitive because we expect non-linear regressions to function even when summary statistics were noisy and uninformative but it also meant that many of the parameters that were best estimated by random forests remain only barely identified (see Table ). Table lists the number of identification failures: parameters for which the algorithm performance was below . but at least a competing algorithm achieved performance of . or above. In other words, we are tabulating cases where the parameter are identified but an estimation algorithm fails to do so. Local-linear regression struggled with the "natural mean hierarchy" simulation. Linear regression failed to estimate the b parameter from the Lotka-Volterra models, the σ parameter from the normal distribution and the A parameter from the ecological traits model. Random forests failed to identify µ and δ from the RBC macroeconomic model. GAM regressions, in spite of having o en been the best estimation algorithm, failed to identify parameters, all in agent-based models (particularly in "Sugarscape" and "Wolf-Sheep predation").
. Table also shows mis-identifications, cases where an algorithm estimated significantly worse than using no algorithm at all (performance is below -. ). These are particularly dangerous failures because the estimation algorithm "thinks" it has found some estimation pattern which proves to be worse than assuming white noise. Mis-identification seems to apply asymmetrically: it is more common for complicated approximate Bayesian computations and simple regressions. . Figure compares algorithms pairwise with respect to their performance. Even the "best" algorithm, random forest, has only a % chance of doing better than GAMs and performs worse than neural-network ABC for % of the parameters. The humble first degree linear regression (even a er accounting for its identification failures and mis-identifications) wins more than half of the comparisons against any ABC except neural-networks. .
While there was no overall winner, this does not mean that the algorithm choice is unimportant. On average, we improved point prediction performance more by choosing the right algorithm than quadrupling the training data size. To prove this point, we focus on all parameters that were estimated both with , and , training simulations. Across estimated parameters, switching from the worst to the best algorithm improves point prediction performance on average by .
(standard deviation of . , observations) and switching from median to best algorithm improves performance by .
(standard deviation of . , observations). Quadrupling the number of training runs but using the same estimation algorithm increases performance by only .

Identification failures in agent-based models .
The same approach used to rank estimation algorithms can be used to diagnose whether the model can be identified at all: if all algorithms fail to achieve cross-validation performance above a threshold then the parameter cannot be identified.  Figure : Percentage of times algorithm has higher performance than algorithm for all estimations where at least one algorithm achieves . or more performance. A blue cell means that algorithm performs generally better, a red cell means that algorithm does. model. We used two performance thresholds: "serious identification failures" were parameters where the best performance among all algorithms was below . and "identification failures" when the best performance was below . . The . threshold derives from the "scale" model, the canonical example of impossible identification, which in our tests achieves maximum performance of . . Ten out of agent-based models have at least one serious identification failure, out of have at least one identification failure. Table : Table showing for each agent-based model estimated, how many parameters failed to be identified (which we define here as achieving best performance below . ) and how many seriously so (best performance below . ). out of models had at least one serious identification failure. For each agent-based model listed here, we only consider the estimation using the highest number of training simulations and summary statistics available. In applied work, identification failures will be more common than the table suggests for two reasons. First, we did not model lack of data which o en reduces the quality of summary statistics. For example, in the "Intra-Organizational Bandwagon" agent-based model we assumed the ability to monitor adoption thresholds for employees, something impossible in real world applications. Second, the thresholds for failure are somewhat arbitrary and in some applied work we may require higher performance for estimation to be useful. .
Because identification failure has many causes, one needs to look at each model to diagnose its source. Sometimes, we failed to estimate multiple parameters because they governed the same behaviour in ways that made them hard to separate. For example, fertility in the Anasazi model depends on both a base fertility rate and maximum fertile age and it is hard to disentangle the two by just looking at aggregate population dynamics. Sometimes, we failed to estimate parameters because their original bounds are small enough that their e ects are muted: in the COVID agent-based model the parameter controlling what percentage of the population wears N masks varies between and % and on its own this has no appreciable e ect to the overall aggregate behaviour of the contagion. Sometimes a parameter was dominated by others: in the Ebola model the parameter describing the ability to trace cases cannot be identified because two other parameters (the e ectiveness of the serum and the delay with which it is administered) matter far more to the overall aggregate dynamic. Sometimes parameters just did not have much of an impact to the model, as for example the overall standard deviation of catchability in the FishMob agent-based model. .
Understanding the nature of identification failures helped us to find ways to overcome them or judge whether it is a problem at all. Disentangling multiple parameters that govern the same behaviour can be done by collecting new kinds of data or simplifying the model. Low performance of parameters with very narrow ranges is a signal of unreasonable accuracy requirements (or alternatively, low power of the data). The low performance of dominated parameters has valuable policy implications since it shows some dynamics to be less important than others to control.
. The main inconvenience with testing for identification is that we need to test the performance of many estimation algorithms before being confident of the diagnosis. When one algorithm achieves low performance, it may be the algorithm's fault rather than the model's. Only when all algorithms achieve low performance, can we be more confident about the model not being identifiable. If our threshold for identifiability is . , the smallest set of algorithms that finds all the identifiable parameters for all the examples in the paper is of size : random forest, GAM, neural network ABC and local-linear ABC. However, it is probable that with more examples we would expose at least one case where identification is achieved exclusively with another algorithm. In short, a good identification test will require us to test as many estimation algorithms as possible.

Coverage .
To test the quality of the confidence intervals, Table shows how o en the real testing parameter is within the confidence intervals estimated by each algorithm. Two methods that achieved low point prediction performance, rejection ABC and linear regression, achieve best coverage rates for % of the parameters. This underscores how even when point prediction is di icult, a good estimation algorithm will produce confidence intervals that contain the real parameters with the pre-specified confidence level. Linear regressions have the lowest mean and median coverage error while regression-adjusted ABCs tend to produce too narrow confidence intervals. Table : Table showing for each algorithm median and average coverage error: the absolute di erence between % and the proportion of parameters actually falling within the algorithm's stated % confidence interval (out of sample). The lower the error the more precise the algorithm. For each algorithm we also list the number of parameters for which the stated coverage was the most accurate out of sample compared to the other algorithms. Finally we listed how many times the algorithm produces deceptively small confidence intervals (that is, that cover only % or % of the parameter in the testing set) . When a parameter was unidentifiable most algorithms returned very wide confidence intervals. This is a useful feature that warns the user about the poor point estimation quality. Observing confidence intervals is however not a good substitute for a full cross-validation test. This is because, as we show in Table , it is possible for the confidence intervals to be too small, covering only % or even % of the true parameters. In other words, sometimes estimation algorithms are mistakenly confident about their accuracy and we need a cross-validation coverage test to verify. Perversely coverage errors are more likely with algorithms that achieved high estimation performance (random forests, GAMs and neural networks).

A simple identification testing protocol even if we are not interested in estimation
. In this paper, we argued that cross-validation testing is useful to rank estimation algorithms. There are however many models where estimation is not a major concern. Models focusing on theoretical exposition, illustration or description (to use Edmonds et al. definitions) are not concerned with the true value of any specific parameter. We argue however that testing ought to be done even under these circumstances to notice identification issues. .
Identification represents an important feature of the model: whether it can be fit to data at all. We claim that uncovering identification is of the same importance as sensitivity analysis and the two procedures mutually reinforce one another. An un-identifiable parameter must be tested during sensitivity analysis over a large range, since data will never narrow its value down. Vice-versa finding during sensitivity analysis that a model output is insensitive to a parameter provides an important clue to explain identification issues. Performing only sensitivity analysis and preferring models that are less sensitive creates a bias that leads us to prefer less identifiable models .
Ideally, we would like to test for identification without ever running new simulations. We can do this with reference-table methods by recycling the runs used for global sensitivity analysis (Ligmann-Zielinska et al. ; Magliocca et al. ; ten Broeke et al. ). Given the results from Section . , we suggest at least looking at the performance of random forests, GAMs, neural network ABC and local-linear regression ABC. If performance is below . for all, we can be relatively confident that the parameter is not identifiable.

We worry too much about e iciency and too little about testing .
Many recent developments in estimation of simulations have focused on e icient search(see the various algorithms featured in Cranmer et al. review of the "frontier" of simulation-inference). While we welcome higher estimation e iciency, we think this objective is o en pursued at the expense of testing e iciency. .
The pursuit of estimation e iciency can be explained by the desire to build complicated models that run slowly but can still fit to data. The more e icient the estimation algorithm, the bigger our model can get and still be estimated in a given computational budget. Here, however, we highlighted the trade-o between estimation and testing e iciency. Recognizing that faster estimation comes at the price of slower testing, there is again an incentive to focus on the kind of simpler agent-based models advocated, for example, in Miller & Page ( ). .
We are not claiming that search-based estimation algorithms should not be used. Given how contextdependent performance is there must be many agent-based models that will be better estimated by search-based algorithms. Our argument is that even when using a search-based algorithm we should perform a cross-validation test using reference-table methods first, take notice of the unidentified parameters and discount the value that search-based algorithms assign to them.

Estimation testing is not validation
. The idea of cross-validation testing as a way to judge estimation accuracy, as well as the older statistical convergence literature it tries to replace, is premised on a lie: the existence of the mythical "real" parameters. The "real" parameters are the parameters of the "real" model that generated the data we observe. In other words, during estimation and cross-validation we are forced to assume our model to be true when we know it to be false. This is what Gelman & Shalizi ( ) calls "methodological fiction". .

Gelman & Shalizi (
) goes on to stress that the question is not whether the model is true, but whether it is close enough to be useful. Establishing whether the model is "good enough" is the process of validation. The cross-validation tests we are advocating for here are not a validation tool but are instead part of the estimation procedure that precedes validation. Once we have an estimated model, we can test its validity through, for example, out of sample predictions or qualitative responses to shocks in ways experts find believable.

Conclusion .
This paper provides two key results. First, if we are concerned primarily with the quality of point estimates, there is no substitute for trying multiple algorithms and ranking them by cross-validation. GAM and random forests provide a good starting point. Second, identification failure is common in agent-based models but can be spotted with the same cross-validation tests. .
We know of no agent-based model that used cross-validation to choose how to estimate its parameters (with the exception of the comparison between ABC MCMC and simulated minimum distance in Grazzini et al. ). The common approach seems to be to pick one estimation algorithm and apply it. We have proven here that this is sub-optimal: no estimation algorithm seems to be a priori better than the others. .
We should abandon the hope that a large enough literature survey will uncover the single best estimation algorithm to use. Testing estimation algorithms is computationally expensive and we would have preferred such a result. Unfortunately, we found here that performance is context dependent and there is no alternative to test methods for each agent-based model. .
Papers proposing new estimation algorithms tend to showcase their approach against one or two examples. It would help the literature to have a larger, standardized set of experiments to gauge any newcomer. We hope this paper and its code repository to be a first step. However, it may be impossible to find an estimation algorithm that is always best and we should prioritize methods for which cross-validation can be done without having to run more simulations. .

The no free lunch theorem (Wolpert & Macready
) argues that when averaging over the universe of all search problems all optimization algorithms (including random search) perform equally. A supervised learning version of the same (Wolpert ) suggests that "on average" all learning algorithms and heuristics are equivalent. These are deeply theoretical results whose practical applications are limited: nobody has ever suggested abandoning cross-validation because of it, for example. However, some weak form of it seems to hold empirically for simulation inference: for each estimation algorithm there is a simulation parameter for which it does best.

Simple simulations
We computed performance and coverage for all the experiments in this section by -fold cross-validation: keeping one fi h of the data out of sample, using the remaining portion to train our algorithms and doing this five times, rotating each time the portion of data used for testing. We ran all the experiments in this section twice: once the total data is , simulation runs and once it is , simulation runs.

g-and-k distribution: Karabatsos & Leisen (
) used ABC to estimate the parameters of the g-and-k distribution (an extension of the normal distribution whose density function has no analytical expression). We replicated this here using the gk package in R (Prangle ). We wanted to retrieve the parameters of the distribution A, B, g, k ∼ U [0, 10] given the deciles ( %, %,. . . , %) of a sample of , observations from that distribution.
Normal : Sometimes, su icient summary statistics exist but the modeller may miss them and use others of lower quality. In this example, i.i.d observations from the same normal distribution ∼ N (µ, σ 2 )|µ ∼ U (−5, 5); σ ∼ U (1, 10) were used directly as summary statistics to retrieve the two distribution parameters µ, σ 2 .
Moving Average( ): Creel ( ) used neural networks to recover the parameters of the MA( ) process with β 1 ∼ U (−2, 2); β 2 ∼ U (−1, 1). We generated a time series of size and we summarise it with the coe icients a AR( ) regression.
Median and MAD: As a simple experiment we sampled observations from a normal distribution µ ∼ U (−5, 5) and σ ∼ U (0.1, 10) and we collected as summary statistics their median and median absolute deviation, using them to retrieve the original distributions. We ran this experiment twice, the second time adding two useless summary statistics S 3 ∼ N (3, 1) and S 4 ∼ N (100, .01).
µ-σ 2 : The abc package in R (Csilléry et al. ) provides a simple dataset example connecting two observed statistics: "mean"" and "variance" as" generated by the parameters µ and σ 2 . The posterior that connects the two derives from the Iris setosa observation (Anderson ). The dataset contained , observations and we log-transformed σ 2 when estimating.
Ecological Traits: The EasyABC R package (Jabot et al. ) provides a replication of Jabot ( ), a trait-based ecological simulator. Here, we fixed the number of individuals to and the number of traits to , leaving four free parameters: (0.5, 25). We wanted to estimate these with four summary statistics: richness of community S, shannon index H, mean and skewness of traiv values in the community.

Ill-posed models
As with simple simulations, we tested all the experiments with -fold cross validation and ran each twice: once where the total reference table had , total rows, and once where it had , .
Broken Line: we observed summary statistics S = (S 0 , . . . , S 9 ) generated by: and where β ∼ U (0, 2) Hierarchical Normal Mean: Raynal et al. ( ) compared ABC to direct random forest estimation in a "toy" hierarchical normal mean model: where IG(·) was the inverse gamma distribution. We wanted to estimate θ 1 , θ 2 given a sampled vecor y of size which is described by summary statistics: the mean, the variance, the median absolute deviation of the sample, all possible combinations of their products and sums as well as noise summary statistics ∼ U (0, 1).
Locally Identifiable: macroeconomics o en deals with structural models that are only locally identifiable (see Fernández-Villaverde et al. ). These are models where the true parameter is only present in the data for some of its possible values. Here, we used the example: 1,5], each simulation we sampled the vector y of size and we collected its mean and standard deviation as summary statistics.
Scale: a common source of identification failure in economics occurs when "when two structural parameters enter the objective function only proportionally, making them separately unrecoverable" (Canova & Sala ). In this example, two people of weight w 1 , w 2 ∼ U [80, 150] step together on a scale whose reading S 1 = w 1 + w 2 + | ∼ N (0, 1) is the only summary statistic we can use. This problem was locally identifiable to an extent: very low readings means both people are light (and viceversa).
Unidentifiable: in some cases, the model parameters are just unrecoverable and we hope that our estimation algorithm does not tell us otherwise. In this example, the three summary statistics S 1 , S 2 , S 3 ∼ N (x, 1)|x ∼ U [0, 50] provided no information regarding the two parameters we were interested in: µ ∼ U (0, 50), σ ∼ U (0, 25).
Partially Identifiable: Fernández-Villaverde et al. ( ) mention how partial identification can occur when a model is the real data generating process conditional on some other unobserved parameter. This makes the model identifiable in some samples but not others. In our case, we used a slight modification of the original: we tried to retrieve parameter θ ∼ U [1, 5] when we observed mean and standard deviation of a size vector y generated as follows:

Complicated models
Birds We ran this experiment twice, once where there are only summary statistics: mean abundance and mean variation over years, and one where are (comprising the average, last value, standard deviation, range and the coe icients of fitting an AR( ) regression to the time series of abundance, variation, months spent foraging and average age within bird population). This experiment is useful because in the original specification (with summary statistics) the scout-prob parameter is unidentifiable. For each experiment we ran the model times. ) in particular used this dataset to compare the quality of ABC dimensionality reduction schemes to better estimate the two parameters. This dataset was too big for cross-validation. Thus, in this experiment, we simply used , observation as the testing data-set and the rest for training. Here, we assumed a, b ∼ U (0, 10) (avoiding the negative values in the original paper). For each simulation, we sampled observations for predator and prey at time t = 1, 1.2, 2.4, 3.9, 5.7, 7.5, 9.6, 11.9, 14.5 (as in the original paper). We ran this experiment twice, once where data was observed perfectly and one where to each observation we added noise ∼ N (0, 0.5). In both experiments, we did not perform -fold cross validation, rather we generate , sets of summary statistics for training and another , sets of summary statistics to test the parametrization.

Lotka
Real Business Cycle: we wanted to parametrize the default Real Business Cycle model (a simple but outdated class of macro-economics models) implemented in the gEcon R package (Klima et al. ). It had parameters (β, δ, η, µ, φ, σ) and we tried to parametrize them in two separate experiments. In the first, we used as summary statistics the -,+ cross-correlation table between output Y , consumption C, investment I, interest rates r and employment L ( summary statistics in total). For this experiment, we had , distinct observations. In the second experiment, we followed Carrella et al. ( ) using as summary statistics (i) coe icients of regressing Y on Y t−1 , I t , I t−1 , (ii) coe icients of regressing Y on Y t−1 , C t , C t−1 , (iii) coe icients of regressing Y on Y t−1 , r t , r t−1 , (iv) coe icients of regressing Y on Y t−1 , L t , L t−1 , (v) coe icients of regressing Y on C, r (vi) coe icients of fitting AR( ) on Y , (vii) the (lower triangular) covariance matrix of Y, I, C, r, L. summary statistics in total. For this experiment, we had , distinct observations. JASSS, ( ) , http://jasss.soc.surrey.ac.uk/ / / .html Doi: . /jasss.
Pathogen: another dataset used by Blum et al. ( ) to test dimensionality reduction methods for ABC concerned the ability to predict pathogens' fitness changes due to antibiotic resistance (the original model and data is from Francis et al. ). The model had four free parameters and summary statistics. While the original data-set contained , , separate observations, we only sampled , at random for training the algorithms and , more for testing.
Earthworm: van der Vaart et al. ( ) calibrated an agent-based model of earthworms with rejection ABC. The simplified version of the model contained parameters and summary statistics. The original paper already carried out cross-validation proving some parameters to be unidentifiable: the model contained a mixture of unidentified, weakly identified and well identified parameters. We used , runs from the original paper, setting , aside for out of sample testing and using the rest for training.

COMSES Agent-based models
Strictly speaking, agent-based models are just another kind of complicated simulation. Agent-based models tend to be slow to run, contain many moving parts and interacting components and they tend to produce many summary statistics as they picture the evolution of systems along many dimensions. The agent-based models we describe here are all Netlogo models available at the COMSES computational model library (Rollins et al. ) and we tried sampling across disciplines.
• Anasazi: This simulation follows Janssen ( ) replication of the famous Kayenta Anasazi agent-based model (Axtell et al. ). We varied four parameters: harvestAdjustment, a productivity variable, harvestVariance, the variance of the productivity, as well as Fertility, representing the fertility rate and FertilityEndsAge which represents the maximum fertile age for the population. The first three parameters had uniform priors ∈ [0.1, 0.9] while the last parameter was uniform between and . We only looked at one time series, i.e., the total number of households. We generated summary statistics on this time series by looking at its value every time steps as well as its maximum, minimum, average, standard deviation and trend. We ran each simulation for steps. ) and we varied eight parameters: wages-shock-xi ∼ U [0.01, 0.5], controlling wage shocks due to vacancies, interest-shock-phi∼ U [0.01, 0.5], controlling interest shocks due to contracting, price-shock-eta∼ U [0.01, 0.5], parameter exploring price setting, production-shock-rho∼ U [0.01, 0.5], parameter exploring production setting, v∼ U [0.05, 1], capital requirements coe icient, beta∼ U [0.05, 1], preference for smoothing consumption, dividends-delta∼ U [0.01, 0.5], fraction of profits returned as dividends, size-replacing-firms∼ U [0.05, 0.5], parameter governing inertia in replacing bankrupt firms. We looked at time series: unemployment rate, inflation, net worth of firms, production, wealth of workers, le over inventory, CPI index, real gdp and total household consumption. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps with steps of spinup (where data was not collected).
• Intra-Organizational Bandwagon: Secchi & Gullekson ( ) simulates the adoption of innovation within an organization depending on tolerance to bandwagons. We varied two variables: vicinity∼ U [2, 50] representing the size of the space workers were aware of when thinking about jumping on a bandwagon and K ∼ U [0.1, 0.9] representing the ease with which imitation occurs. We followed four time series: the number of adopters, the mean threshold for adoption, its standard deviation and the maximum threshold in the organization. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• Governing the Commons: an example of socio-ecological system from an introductory textbook (Janssen ). A landscape of logistic-growth patches were harvested by agents who can imitate each other's threshold for action. We modified four parameters: discount∼ U [0.9, 1], the discount rate in each agent's utility,costpunish∼ U [0, 0, 1], the percentage of wealth lost by agents monitoring others, costpunished∼ U [0, 0.2], the percentage of wealth lost by agents caught having too much wealth and percent-best-land∼ U [5, 25] which determines carrying capacity. We tracked four time series: average unharvested resource remaining, average wealth of the turtles and average threshold before harvesting. We turn these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We run each simulation for steps.
• COVID-Masks: Brearcli e ( ) is a simple epidemiological model where randomly moving agents progress through a COVID-SIR model with only masks to slow down the spread. We modified four parameters infectiousness∼ U [80,99], representing how easily the disease spread on contact and three parameters representing the availability of masks of di erent forms: masks-n95∼ U [0, 5],masks-medical∼ U [0, 30],masks-homemade∼ U [0, 65]. We tracked three time series: number of exposed, recovered and infected agents. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• Ebola Policy: Kurahashi ( ) replicated and enhanced the original smallpox agent-based model in Epstein et al. ( ) by adding public transportation and Ebola-specific treatment strategies. We modified three parameters: trace-delay∼ U [1, 10], days it takes to run an epidemiological trace for an infected individual, trace-rate∼ U [0.3, 0.7], the probability of tracing each individual correctly, serum-effect∼ U [0, 1] which represents the ability of the serum to inoculate the patient. We track three time series: number of infections, recoveries and deaths. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• FishMob: a socio-economic model introduced in Lindkvist ( ). FishMob involved four fishing areas and a diverse set of fleets moving and adapting to resource consumption. We varied five parameters: share.mobile∼ U [.01, 1] representing the percentage of fishers that can change port, profitmax-vs-satisficers∼ U [0, 1] the percentage of fishers that maximize profits (the remaining population acting like satisficers), intr_growthrate∼ U [0.1, 0.8] representing the average growth rate of the stock, globalcatchabilitySD∼ U [0, 0.5] representing the standard deviation in catchability across regions and min-viable-share-of-K∼ U [0.05, 1] which represents the depletion level below which biomass defaults to . We observed eight time series: the exploitation index and total biomass in each of the four region. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• Ger Grouper: a model of human adaptation to the mutable environment of northern Mongolia (Clark & Crabtree ). We varied four parameters: gerreproduce∼ U [1, 5] which represents the probability each step of a household with enough energy to reproduce, patch-variability∼ U [1, 100] which is the probability per time step of any grassland to turn to bare, ger-gain-from-food∼ U [2, 20] which represents the harvest to energy transformation ratio, and grass-regrowth-time∼ U [1, 10] which controls the regrowth of the resource. We observed five time series: total population and population of each of the four "lineages"(subpopulations that share the same cooperation rate). We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• Two-factor Theory: Iasiello et al. ( ) agentizes the interaction between motivation and environment hygene in a human resources management model. The model depends on four parameters motivation-consistent-change-amount, tolerance-consistent-change-amount, hygiene-consistent-change-amount, potential-consistent-change-amount all ∼ U [−1, 1] which govern the changes in motivation styles or hygene factor whenever there is a mismatch between satisfaction and dissatisfaction. We monitored three time series: the number of dissatisfied workers, the number of workers that moved and the number of new workers. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• Insulation Activity: a model of homeowners adoption of energy e iciency improvements responding to both economic and non-economic motivations (Friege et al. ). We varied four parameters: radius∼ U [0, 10], representing the spatial range over which homeowners compare their neighbors, av-att∼ U [−1, 1] the overall trend in general propensity to adopt insulation, weight-soc-ben∼ U [0, 10] a scaling factor increasing the importance of positive information, fin-con∼ U [0, 10] an inertia factor slowing down insulation adoption due to financial constraints. We observed five time series: average age of windows, average age of roofs, total and average energy e iciency rates and average heating e iciency rates. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• Peer Review Game: Bianchi et al. ( ) simulates a set of incentive and motivation structures that generates papers, citations and di ering levels of scientific quality. In this paper, we focused on "collaborative and fair" decision making where agents increased e ort whenever they believed their paper deserved the fate it received (high quality led to acceptance, low quality led to rejection). We varied five parameters: effort-change∼ U [0, .2] representing how much authors adaptad whenever a paper was peerreviewed, overestimation∼ U [0, .5] representing the bias authors had in evaluating their own paper, top∼ U [1,40] represented the number of best papers that made up the "top quality" publication list, published-proportion ∼ U [.01, 1] represented the percentage of authors that publish in a given time step, researcher-time∼ U [1,200] was the budget of time available to each agent. We tracked six time series: evaluation bias, productivity loss, publication quality, quality of top publications, gini coe icient for publications and reviewing expenses. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• O ice Moves: Dugger ( ) simulates the interaction between three kinds of employees (workers, shirkers and posers). We vary three parameters, %_workers∼ U [2, 48] the percentage of employees that are workers, %_shirkers∼ U [2, 48] the percentage of employees that are shirkers, and window∼ U [2, 10] which represents the moving average smoothing factor of average performace that employees compare themselves to. We track four time series: percentage of happy employees, percentage of happy workers, percentage of happy shirkers and average performance. We turn these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We run each simulation for steps.
• Multilevel Selection: Sotnik et al. ( ) models a commons problem where agents may cooperate and share some of the benefits of higher value areas. We varied four parameters: initial-percent-of-contributors∼ U [10,80] representing the number of cooperating agents at the start of the simulation, resource-size∼ U [0.1, 2] which represents the size of resource shared by cooperating agents, multiplier-effect∼ U [1, 5] scales personal contribution when valued by the common, social-pressure-vision∼ U [1, 5] which governs how close agents need to be to force one another to contribute. We tracked five time series: di erence between between-group and within-group selections, the average payo , the percentage of non-contributors, percentage of people that were pressurized and assortativity among contributors. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• NIER: Boria ( ) is a thesis that deals with e iciency upgrades and the distribution of transition costs across income groups. We varied four parameters: %_EEv_upgrade_at_NR∼ U [0.01, 0.5] e iciency gain due to retrofit, diminishing_returns_curve∼ U [1, 10] is a slope of a function reducing retrofit e iciency when done at already e icient houses, %-Leaders∼ U [1, 50] is the percentage of agents that try to convince others to upgrade, %-Stigma-avoiders∼ U [1, 50] is the percentage of population that does not want to be the minority. We tracked ten time series: the number of households belonging to each quartile of the energy e iciency distribution both within and outside of the simulated district as well as the mean and standard deviation of energy e iciency within the district. We turn these time series into summary statistics by their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• RiskNet: Will et al. ( ) is a model of risk-sharing and insurance decision for stylized smallholders. We varied four parameters: shock_prob∼ U [0.01, 0.5] the per-step probability of an adverse shock, covariate-shock-prob-hh∼ U [0.5, 1] which correlates village-level shocks with households, shock-intensity∼ U [0, 1] percentage of sum insured that is part of the insurance payout, insurance-coverage∼ U [0, 1] which shows the initial insurance coverage rate. We track five time series: gini coe icient, budget for all households, budget for uninsured households, fraction of active households and fraction of uninsured active households. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps. ), a simple problem where spectators have to synchronously choose whether to clap standing or stay seated at the end of a performance. We varied three parameters: intrinsic-prob-standing∼ U [0.1, 0.9] which represents the original probability of a spectator to stand,noise ∼ U [0, 0.1] which represents the probability of a spectator randomly changing their stance and cone-length∼ U [1, 5] which represents the cone of vision the agent has to imitate other spectators. We tracked two time series: number of agents standing and number of agents feeling awkward. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• Schelling-Sakoda Extended: Flache & de Matos Fernandes ( ) implemented a -populations netlogo version of the famous Schelling-Sakoda segregation model (Schelling ; Sakoda ; Hegselmann ).We varied three parameters: density∼ U [50,90] representing the percentage of area that is already occupied by a hose, %-similar-wanted∼ U [25,75] representing the percentage of people of the same population required by any agent to prevent them from moving and radiusNeighborhood∼ U [1, 5] which determines the size of the neighborhood agents look at when looking for similiarity. We tracked seven time series: percentage of unhappy agents, unhappy red and unhappy blue agents; percentage of similar agents in any neighborhood as well as this percentage computed for red and blue agents; percentage of "clustering" for red agents. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• Sugarscape: Janssen ( ) replicates the famous Sugarscape model (Epstein & Axtell ) (restoring the trade functionality that the Netlogo model library does not have). We varied five parameters: initial-population∼ U [100, 500] the number of initial agents, pmut∼ U [0, 0.2] the probability of mutating vision on reproduction, maximum-sugar-endowment∼ U [6, 80] and maximum-spice-endowment∼ U [6, 80] the amount of initial resources available and wealth-reproduction∼ U [20, 100] the amount of resources needed to spawn a new agent. We tracked six time series: the number of agents, the gini coe icient in wealth, the total amount of sugar and spice that can be harvested, the total number of trades and the average price. We turned these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We ran each simulation for steps.
• Food Supply Chain: van Voorn et al. ( ) presents a model of a food supply network where higher e iciency is achieved only by lowering resilience to shocks. We vary five parameters: TotalProduction∼ U [100,200] and TotalDemand∼ U [50, 200] the total available source and sinks in the network, StockPersistence∼ U [.5, .99] represents spoilage rate per time step, base-preference∼ U [.01, .2] represents a normalizing factor in customer preferences for each trader and exp-pref∼ U [1, 100] which governs customers inertia into changing suppliers. We tracked nine time series: the inventory of the three traders, total produced, total consumed, and maximum, minimum, mean and standard deviation of consumer health. We turn these time series into summary statistics by looking at their value every time steps as well as their maximum, minimum, average, standard deviation and trend. We span up the model for time steps, then we ran the model to a stationary, a shock and another stationary phase, each time steps long.

Full performance table
-.

Introduction to freelunch package
The package freelunch can be installed from github at https://github.com/CarrKnight/freelunch. The package is composed of a series of functions that help estimate models and check the performance of estimated models by cross-validation. In this package, the methods to estimate parameters all start with fit_* and they all have the same interface requiring arguments: estimates<freelunch::fit_gam(training_runs = reference_table, target_runs = c(1.37,1.11), parameter_colnames = c("paramone","paramtwo"), summary_statistics_colnames = c("ssone","sstwo")) which estimates paramone somewhere between -. and . while paramtwo across its whole range:

An ill-posed example
The classic example of under-identified model: two people with weights weight1 and weight2 stand together on a scale and we read their combined weight total.
Reading only total we are asked to get the two individual weights. We observe the total weight of , pairs of people. There is already an example of this in the package so we just load it here: data("scale_experiment") glimpse(scale_experiment) #> Rows: 5,000 #> Columns: 3 #> $ total <dbl> 259. 8286, 244.7187, 213.4538, 248.0799, 213.7402, 278.4608,... #> $ weight1 <dbl> 135.94215, 126.21180, 121.82348, 123.92908, 125.67809, 129.... #> $ weight2 <dbl> 123.24073, 118.69401, 92.68997, 123.20384, 87.20287, 149.87... We can try to solve this problem with rejection ABC and random forests. We first performed a standard set of cross-validations: abc.cv<-cross_validate_rejection_abc(total_data = scale_experiment, parameter_colnames = c("weight1","weight2"), summary_statistics_colnames = c("total")) rf.cv<-cross_validate_random_forest(total_data = scale_experiment, parameter_colnames = c("weight1","weight2"), summary_statistics_colnames = c("total")) We then discovered that rejection ABC performed better than random forests for both parameters. This makes sense since the main advantage of random forests (weighing multiple conflicting summary statistics) is null here. Now, while it is nice to know the general performance, a bit more analysis will provide further insights. For example, here the performance of the rejection ABC was not consistent across the parameter space. The individual weights were easier to guess when both people were either very light or very heavy; the estimation will also work if both people on the scale are about of the same size since rejection ABC tends to default to averaging weights when in doubt. We can see this graphically plotting the average RMSE for a grid of parameters: freelunch::plot_grid_rmse(abc.cv,parameter1="weight1",parameter2="weight2", intervals_first_parameter = 15,intervals_second_parameter = 15) which is useful because it shows that the method is very poor at estimating individual weights when one person is heavy and the other is thin.
Eventually, if the applied work indicates that the "real life" observation we are trying to estimate has total=200, we can estimate with abc as follows: abc.estimate<-fit_rejection_abc(training_runs = scale_experiment, target_runs = c(200), parameter_colnames = c("weight1","weight2"), summary_statistics_colnames = c("total")) Which guesses a weight of approximately for both individuals, with a confidence interval between approximately and .