Towards the Right Ordering of the Sequence of Models for the Evolution of a Population Using Agent-Based Simulation

: Agent based modelling is nowadays widely used in transport and the social science. Forecasting population evolution and analysing the impact of hypothetical policies are often the main goal of these developments. Such models are based on sub-models defining the interactions of agents either with other agents or with their environment. Sometimes, several models represent phenomena arising at the same time in the real life. Hence, the question of the order in which these sub-models need to be applied is very relevant for simulation outcomes. This paper aims to analyse and quantify the impact of the change in the order of sub-models on an evolving population modelled using TransMob. This software simulates the evolution of the population of a metropolitan area in South East of Sydney (Australia). It includes five principal models: ageing, death, birth, marriage and divorce. Each possible order implies slightly different results mainly driven by how agents’ ageing is defined with respect to death. Furthermore, we present a calendar-based approach for the ordering that decreases the variability of final populations. Finally, guidelines are provided proposing general advices and recommendations for researchers designing discrete time agent-based models.


Introduction and Motivation
. Complex systems characterized by a large number of entities interacting with each others is a very attractive framework to model a large number of phenomena arising in our societies.Examples of such systems that can involve millions of agents include transportation, social interactions, the spread of contagious diseases and the evolution of populations.
. Agent-based models, or microsimulations, are tools that are now widely used to model and simulate such complex systems.The base unit of these models is the agent representing an entity of the population under scrutiny.As such, each agent is characterised by attributes and behavioural rules mimicking the real entity, and can interact with each other as well as with their environment.Even though the behavioural and interactions rules defined for each agent are typically simple, the resulting emerging behaviour of the system is o en non-linear and di icult to predict.
. Using agent-based model to simulate the evolution of a population consists of two major steps, each of them having its own set of challenges : . the generation of the synthetic population: the goal of this step is to generate a baseline population of agents which is statistically as similar as possible to the population of interest.The synthetic population generation has been extensively studied in the literature in the last two decades since the seminal work of Beckman et  ) for a review of existing approaches as well as their performances and drawbacks.
. the dynamic evolution of the population: in this step, the dynamic evolution of the baseline population of agents is simulated in order to forecast the future population.This is done by defining a set of models, rules and interactions for the agents.A large number of agent-based models aiming to reproduce the evolution of a population have been developed over the years, such as ILUTE (Miller et al. ), MOBLOC (Cornelis et al. ), VirtualBelgium (Barthélemy ) and its extension VirtualBelgium in Health (Dumont et al. b) and TransMob (Huynh et al. ).
. The second step usually involves many di erent models.For instance, we can have models to simulate ageing, births and deaths in the population, the evolution of the socio-professional status (i.e.student, retired, active, inactive) and the marital status (single, married, de-facto,...) of the individuals, their health etc.
. It is clear that the ordering in which such models are executed can have a significant impact on the final forecasted population as well as other factors such as the choice of the pseudo-random number generator, its seed and the quality of the data.Hence finding the ordering which allows to produce the most accurate results is a critical issue (Dumont et al. a).Despite its importance, to the best of our knowledge this problem has not yet been properly investigated in the literature.Indeed, the order is arbitrarily fixed in every application, without detailing why a particular order has been retained.This gap in the literature motivated this work aiming at providing reasons behind the selection of a particular order over others.

.
In order to achieve this goal, we will test every feasible order of the models implemented in TransMob, an agentbased model used to simulate the dynamics of a metropolitan area in South East of Sydney, with demographic evolution.The resulting populations will then be compared among them in order to characterize the impact of the ordering of the models.In addition, the sensitivity of TransMob to the seed of the random number generator used by the models will also be tested.Finally, we will propose a method to decrease the impact of the order by randomly assigning dates of births and deaths for every individuals.
. A preliminary analysis of the importance of the order for TransMob is described in (Dumont et al. a).The statistical analysis hereby presented improves previous research by describing the importance of the order.In addition, we present an original, calendar-based approach to attenuate the impact of ordering.

.
The remainder of the paper is organised as follows.Section gives a brief overview of TransMob, its agents and evolutionary models.In Section we investigate the impact of the both the ordering of the models and the seed on the simulated populations.We then present in Section a method to reduce the number of feasible orders, which also helps to lower the variability of the simulated populations.The performance of the new approach is investigated in Section .Concluding remarks and future perspectives are then discussed in Section .

TransMob
. This Section briefly introduces TransMob, an agent-based model for simulating the dynamics of a metropolitan area in South East of Sydney, Australia.This microsimulation integrates six major modules interacting with each other: synthetic population generation and evolution, perceived liveability, travel diary assignment, tra ic micro-simulator, residential location choice and travel mode choice.The interactions between those modules are described in (Huynh et al. ).
. Each simulated individual, or agent, is characterised by several attributes, including age, gender, household relationship, household type, identification of the synthetic household he/she belongs to, and the identification of the census collection district the synthetic household resides in.Complete details on the generation and the attributes of the synthetic population can be found in (Huynh et al. ).
. In this work, we will focus on the models responsible for the demographic evolution within the synthetic population module.TransMob evolves the synthetic population developed in (Huynh et al. ) with a time step of one year for a predefined time horizon, which is set to ten years in this work.A snapshot of the synthetic population is then generated every first of January.

.
The approach consists of five dynamical processes executed in this specific order: ageing, dying, giving births, divorcing and marrying.It is clear that out of these five processes, only ageing is deterministic (every individual age).On the other hand, the remaining processes are stochastic, i.e., they occur randomly depending on probabilities extracted from available data.Moreover, for death, divorces and marriages, the probability of these events is conditioned by age and gender, while the probability of giving birth is conditioned by the number of previous pregnancies and the age of the female agent.The overall procedure is illustrated in Figure .Depending on the event, the structure of the household can be updated.For additional information, these evolution algorithms are fully detailed in (Huynh et al. ).
. For each simulated year, a probability for each possible event is assigned to each synthetic agent.As any other stochastic simulation, these probabilities are then used to determine which events are triggered.As these simulations are not deterministic, several runs could result in slightly di erent final populations.To control this, a seed can be chosen for the random number generator used by TransMob.

Sensitivity of the Microsimulation
. Having introduced TransMob, we now consider di erent factors that can have a significant impact on the forecasted population.This section contains an overview of the e ect of both seed and order on the final simulated population.A preliminary analysis of the influence of these two factors is included in (Dumont et al. a).Our aim is to better investigate if the di erences in the simulation are due to the order and/or the seed.

.
As mentioned previously, the order of the models in TransMob was originally predefined (Huynh et al. ).Considering that the major aim of this paper is to analyse the impact of an order change, we will first focus on testing each feasible order.It should be noted that simulating birth before ageing implies a double generation of babies, since the initial baseline population already includes year old individuals.Therefore, only orders specifying ageing before birth are considered, resulting thus in feasible orders.

.
Our analysis also considers the sensitivity of the microsimulation with respect to the choice of the random seeds.Hence we will perform simulations using di erent seeds for each feasible order , resulting in , experiments simulating years.

.
Figure illustrates the average yearly population and the quantile interval IQ 95 defined by the .and .percentiles (i.e. containing 95% of the simulations).This graph supports the intuition that the di erence between several runs increases over the simulated years.We can see that for the last year IQ 95 = [212, 151; 214, 509] is narrow with the maximum relative deviation between the average and one extremity of the interval being 0.6% of the population.

Impact of the seed .
This subsection aims at determining if some random seeds influence the process in a specific direction.For example, one specific seed could systematically results in an older population.Using a statistical analysis based on the well-known ANOVA method (Chambers et al. ), Dumont et al. ( a) concluded the independence between the seed and the retained variables.We hereby confirm this result, see for instance Figure in Appendix A which illustrates that the seed does not influence the final results.

Impact of the order .
Does the order in which the procedures are applied influence the tendency of the results?The idea is now to determine if some orders result in a larger/younger population.A preliminary analysis indicates that the order significantly influences results (Dumont et al. a).To identify the di erences between the orders, two types of variables need to be introduced: . indicators of the final population: the number of men, women, as well as the number of individuals in each age class (less than years old, between and , and more than years old); . indicators of the order: the position of each process in the chosen order.For example, if we simulated ageing, then death, then marriage, then birth and finally divorce, the indicators of order are : position of ageing = ; position of death = ; ... .For the first set of indicators, the logarithm has been applied for each variable to reduce the impact of exceptionally large populations.By adding these transformations, our results significantly improved.  .
Two well-separated classes can be identified in the results of the simulation.The next step is to identify the discriminant factors for these two classes.We checked that each order over the seeds always lie inside the same class.Thus, only the order categorized simulations.The process to identify patterns for orders belonging to each class was successfully as shown in Figure .Indeed, the position of ageing relatively to the one of death is determinant.When ageing is before death, the simulation ranks into the second class (red) and in the opposite, death before ageing results in the first class (black).Intuitively, this can be explained by the fact that the death probability depends on age.Indeed when ageing, the probability to die increases.  .We can also notice two almost parallel lines for the combination of the total population and individuals less than years old.This means that by staying on the same class, any increase of the final population implies a constant increase in the number of less than years old persons.However, the two classes are well-separated.
This indicates that for two simulations producing the same population size, populations in the red class include more people who are less than years old.Similar conclusions apply when focusing on the number of individuals less than years old per gender, even if this is less prominent.For individuals between and years old, no clear distinction can be made between classes.In summary, the black class contains larger populations with more elderlies and less young people.Note that the average age per class confirms this as shown on Figure in Appendix B.
. In summary, the order significantly influences the final population.Indeed, performing ageing before death results in a smaller and younger population.
. At this level, the position of the other processes in the dynamical evolution loop does not significantly influence results.The following section proposes a method removing these two events from the possible orders, which enables the analysis of the impact of the order of the three remaining processes.

Reduction of the Number of Possible Orders: A Calendar-Based Approach
. Considering that the positions of death and ageing bias the simulation, we decided to propose an alternative method to reduce this impact.By proposing another way to consider ageing and death, the possible remaining orders involve only marriage, divorce and birth, reducing the feasible orders from to .

Method .
To avoid the high influence of the position of death and ageing, our proposition is to assign a specific date for these events for each synthetic individual.This means assigning a date of death and a birthday for ageing.This technique can be easily extended to other processes as dates could be assigned to every event.
. First, the model responsible for death is executed.For each person not dying during this simulated year, the remaining models stay unchanged.However, each individual dying in this year is assigned a date of death and he/she will remain in the population, possibly performing other actions if they arise before his/her death.If an event concerning this agent is planned to happen, we check that this arises before the death.For this, a date is randomly chosen for this event and it is considered only if prior to death.
. Secondly, a date of birth is also assigned to each individual.Figure illustrates the changes induced by adding this birthday.Colours represent the probabilities of occurrence of a specific event depending on age.We can see that the standard approach with ageing at the end (or at the beginning) considers the age at 1 st of January (or 31 th December) for the whole civil year, whereas the calendar-based approach adapts the probabilities for each individual at their birthday.Probabilities are thus adapted at the same moment for everybody with the standard approach while the calendar-based approach changes probabilities at a di erent moment for each person, depending on their birthday..

.
It can be noted that the computational cost is di erent from adopting a time step of one day.Indeed, a daily time step implies considering each process for each individual for each simulated day, whereas our approach still considers each process only once a year for each individual.
. The proposed methodology is possible only if we can establish the probability of the event occurring in the year depending on the age of the agent and its birthday.
. The naive approach consists in considering that the probability of an event occurring during a civil year can be refined using a convex combination of the probability of the event to happen at the present age and at the age + .For a person of age A at the beginning of the year and BD days from st of January to its birthday, the probability of an event E occurring during the year can be calculated by P (E) = P (E only before BD or E only a er BD) = P (E only before BD ) + P (E only a er BD) since "E only before BD" and "E only a er BD" are disjoints.For the sake of simplicity, the naive approach is to approximate "E only before/a er BD" with "E before/a er BD" without checking that the event is not happening in the other period of the year; this assumption gets exact in the limit of events arising only once per year.By making the assumption that the distribution of the event occurring each day of the year is known, P (E on day i ) .
In this example, we now make the simplifying but unrealistic assumption that E has the same likelihood to occur any day of the year.Thus, we have P (E on day i) = P (E|A) * 1 365 with P (E|A) the probability of the event for an individual during the whole year while he is of age A. Finally, the expression of the probability of an event during a civil year is given by: Intuitively, this splits the year into two di erent parts separated by the birthday, and each one having its own probability for E which depends on the age.It should be noted that this imposes to assume that the probability of the event is uniformly distributed through each day of the year once we fix the age of the agent.This could be improved by approximating the probability of each day using, for example, a spline or a regression.With this probability definition, ageing needs to be at the end of the process.
. It can also be noted that even if we assumed a uniform distribution for the dates, we could easily use any kind of distributions for each model (e.g. an empirical distribution if the data is available).
. When considering a uniform distribution of the dates for an event that can arise only once per year (each day has same probability), this can be seen as a sequence of Bernoulli experiments for each day that succeeds if the event happens.The formal analytical determination of this formula gives very similar results to the naive approach developed in this section.The detailed analysis of the formal development is in Appendix D.
. A schematic representation of the calendar-based approach is given in Figure . .Since we change the procedures for ageing and death, the only remaining possible events are marriage, birth and divorce, leaving only possible orders. .
The proposed approach of defining dates for events to avoid problems with possible orders is not limited to ageing and death in population evolution.It can be applied in all fields using these types of agent-based modelling and dates can be generated for each model.We focus here on dates for ageing and death to analyse the impact on the final population since we established above that these two processes strongly influence the size and age of the final population.In our case, divorces are proposed only to couples and marriage only to single individuals.Thus these models concern only a part of the population.Even if performing one a er the other can slightly modify the set of individuals going through the other model, their impact is limited.

Analysis of the new orders
.
Similar to the analysis presented in Section ., a classification of the indicators of the final population is also performed using the new improved method.Figure in Appendix C contains the elbow method to determine the number of classes showing an evident elbow at two classes.These two classes are reported on the PCA in Figure .The separation of the points is less obvious than for all previous orders.Indeed, no empty space divides the two set of dots.This seems to indicate that the final populations are more homogeneous than previously.However, it is worth analysing the influence of these classes on the final population.
Figure : PCA for the classification of the method with the dates.Each dot represents one simulation.The separation between the two classes is less evident using the calendar-based approach.
. Figure indicates less evident di erences between the classes than for the standard method.Nevertheless, the first class (black) has a smaller population composed of less individuals under years old.A slightly linear relation stands between the total population and the number of women, men and individuals less than years old.This means that the larger the total population is, the higher these indicators also are.Yet, the number of individuals older than years old does not follow this linear tendency.
. Identifying the patterns in the same classified orders is the next step.A decision tree highlighted the importance of the position of marriage regarding to the birth.Nevertheless, this pattern is less determinant than the one illustrated in Figure .
. The relation between marriage and birth is very important in the model, since only married women can give birth (see (Huynh et al. ) for more details on models and (Huynh et al. ) for the definition of "married" women, which also includes de facto relationships).

.
Table presents the number of simulations in each class for each order in more details.One can appreciate from the latter table that any given order can belong to both classes, even though there a clear tendency for one of the class.This indicates that the seed has now a larger impact than previously and supports the observation about the homogeneity of the final populations.

Comparison
. In this section we compare the performance of the calendar-based approach against the classical one.The main purpose of the new method is to reduce the variability of the final populations.For the comparison, the final population indicators a er years are computed for random seeds and for all feasible orders with and without the introduction of dates of birth and death.Table : Classification of orders with the addition of dates .
The homoscedasticity of the total population indicator over the two groups with and without dates is tested.Note that the group with calendar-based includes simulations ( orders and seeds), whereas the other group includes simulations ( orders and seeds).Due to the imbalance and small size of groups, a careful choice of the method to test homoscedasticity is required.(Parra-Frutos ) analysed di erent statistical tests and concluded that in unbalanced and small samples, the best ways to test homogeneity of variance includes the James test, the Welch test and the Alexander and Govern test.(Dag et al.
) incorporated these tests in a package for the R programming language (R Core Team ).The three tests allow to conclude the non homogeneity of the variances with a confidence level of . .The standard deviations are .for the classical simulations and for the simulations using dates.This indicates that the proposed method reduces the variance between runs.
.  .Results show that the calendar-based approach produces final populations similar to the ones in the first class in terms of total population.Performing again the tests suggested by Parra-Frutos ( ) to examine the homogeneity of variance, we found that for the three tests that variances inside the first class and the simulations using dates are not significantly di erent.

.
Here, it is interesting to compare the distributions.As the assumptions for the classical ANOVA test are not met, we use the non-parametric Kruskal-Wallis test.The p-value of .indicates that no distribution stochastically dominates the other.This confirmed that the relative di erence between the average of the two groups is only 0.03%.
. Considering that the size of the populations produced by the calendar-based approach and the first class of standard methods were statistically similar, we then looked at the structure of the di erent populations.Indeed, the population size could be equal while their age structure di ers.This is illustrated in .This explains why the calendar-based approach and the first class of standard methods generate populations of similar sizes.Therefore, it is possible to argue that the proposed calendar-based methodology is the most appropriate approach.

Conclusion and Discussion
. A er investigating all the feasible orderings of models and presenting a promising calendar-based approach, we believe that our work made two contributions to the field of agent-based models for demographic evolutions that could be extended to other agent-based models.
. First, this work showed the importance of the order of the models in agent-based modelling, a er having checked the stability against random seeds.For TransMob, including five major processes, i.e., ageing, death, birth, marriage and divorce, we highlighted significant di erences in the results of the simulation if death is performed a er or before ageing.
. Secondly, we proposed to assign dates to key events and redefine the probabilities depending on these dates.We found that this method decreased the variability of the simulations.Furthermore, this is not restricted to the evolution of synthetic populations.Indeed, for each process interfering with probabilities of other model, we can assign a date for this event (either from a uniform distribution amongst the days of the year or from a defined distribution if you for example have the prevalence of birth per day in the year).Thanks to this date, probabilities of dependent events can be adapted with a weighted linear combination of the probabilities before and a er the determinant event.
. The code associated to the calendar-based approach is available at the following address: https://github.com/smart-facility/calendar-based-microsim.
. The proposed method allows simultaneously to avoid any bias induced by choosing a specific order, reduce the variability of the results and approximate a daily time step with a reduced computational cost.
. This work allows us to propose certain guidelines for future agent-based models (with a discrete time step).Indeed, for one iteration of the evolution loop, we propose the following flow: . Processes implying to remove agents are evaluated to identify the agents that will disappear.However, these agents are not removed directly.Instead, a date of removal of the agent is determined.
. Processes changing agent characteristics influencing the probabilities of other processes are executed and dated.For individuals disappearing this iteration, we check if each event is before or a er the removal date.
. Remaining processes are launched with updated probabilities.For individuals disappearing this iteration, we check if each event is before or a er the removal date.
. Agents disappearing during this iteration are removed.
. It should be noted that this is a general proposition limiting the influence of the order.Unfortunately, some questions are still unsolved.For instance, if processes are interdependent or if we have several processes in third step, several orders are still possible.
. Obviously, the proposed approach has certain limitations.For instance, there is need for additional data if one wants to draw dates from a realistic distribution through the year and events are unpredictable.The practitioner should also be aware that all these agent-based simulations are always highly dependent to the type and quality of input data (garbage in -garbage out process).Finally, it should also be noted that this does not necessarily allow to extend the time horizon of good predictions.
. In future works, a more generalised analysis could be performed, considering additional (fictive) models with probabilities generated randomly with respect to constrains such as dependencies with agent characteristics and/or interdependencies with other modules and/or orders between modules.Any generalisation to agentbased models is a big challenge, since each case is di erent and generating a population kangaroos in Australia cannot be similar to simulating particles in an accelerator or examining demographic evolution.Isolating the probability for a specific day P (E i | A), we obtain :    .The di erence is noticeable for the birthday on the middle of the year, but the two methods stay really close to each other.For this reason, the naive approach can be considered, since it is easier and gives similar results.

P (E
Figure : Probabilities with the naive and formal approach when probability at first age is .and at age + , . . This corresponds to the probabilities of having a first child at age and respectively (le panel).And same probabilities when probability at first age is .and at age + , . to illustrate a more remoted (right panel).

Figure :
Figure : Transmob: Flowchart of the evolutionary models.

Figure :
Figure : Le panel: Average and range IQ 95 for simulated years, random seeds and the possible orders.We can see that IQ 95 is becoming larger through the years, even if staying relatively small.For instance, for the last year IQ 95 is[212, 151; 214, 509].Right panel: Evolution of the range of IQ 95 .The observation made in the le panel is confirmed as one can see that the range is increasing over the years.

.
To quantify these di erences, a classification is applied on the indicators of the final population.(Dumont et al. a) show two distinct classes.Hence, a k-means classification (Hartigan & Wong ) with k = 2 followed by a principal component analysis (Wold et al. ), or PCA for short, is executed to visualize the di erent classes.The graphs in and dimensions in Figure , confirm the two very distinguishable sets of points.Note that the three first components computed by the PCA already explain , % of the total variance.

Figure
Figure : PCA to illustrate the classification of the simulation for seeds and orders.Each dot represents one simulation.Two clearly separated classes can be identified.

Figure :
Figure : Patterns resulting from the relative order of death and ageing and the associated division into classes

Figure :
Figure : Graph of all pairs of final population indicators per class.Each dot represents a simulation.

Figure :
Figure : Illustration of the addition of a birthday.Each colour corresponds to an age conditioning the probabilities used by a given process.

Figure :
Figure : Flowchart of the new method.

Figure : First
Figure : Graph of all final population indicators per class for dates simulations.Each dot represents one simulation.First Model Second Model Third Model #Simulations in Class #Simulations in Class Marriage Divorce Birth Marriage Birth Divorce Divorce Marriage Birth Divorce Birth Marriage Birth Divorce Marriage Birth Marriage Divorce Figure : Uncertainty analysis of standard and proposed models.Evolution of the total population for the calendar-based and classical method.It can be observed that the calendar-based approach produces IQ 95 with smaller ranges than the standard method.The calendar-based approach tends to generate slightly larger populations.

Figure :
Figure : Uncertainty analysis of classified standard and proposed models.The calendar-based approach results in populations similar to the first class of standard simulations (ageing before death).
Figures and where death and birth evolutions are displayed.Unlike expectations induced by the Figure , Figure shows that the calendar-based approach produces a number of deaths in between the ones generated by the two classes of standard methods.This indicates that the proposed approach actually produces populations that are di erent from the ones belonging to the first class.The evolution of the numbers of births in Figure shows a di erent picture.Indeed, the calendar-based approach generates a larger number of births compared to the others two methods.

Figure :
Figure : Evolution of the number of deaths per method.

Figure :
Figure : Evolution of the number of births per method.

( 1 −
P (E i | A)) 365 = 1 − P (E | A) 1 − P (E i | A) = 365 1 − P (E | A) P (E i | A) = 1 − 365 1 − P (E | A)As the P (E j | A + 1) can be obtained in a similar way, we now have all the elements to be able to generate the probabilities P (E | A, BD) required by the model.It should be noted that results mimic the ones of the naive method as we can see in Figures and .

Figure
Figure : Uncertainty analysis of standard and proposed models using exact probabilities.Evolution of the total population for the calendar-based and classical method.It can be observed that the calendar-based approach produces IQ 95 with smaller ranges than the standard method.The calendar-based approach tends to generate larger populations.

:
Figure : Uncertainty analysis of standard and proposed models using exact probabilities.Evolution of the total population for the calendar-based and classical method.It can be observed that the calendar-based approach produces IQ 95 with smaller ranges than the standard method.The calendar-based approach tends to generate larger populations.

Figure :
Figure : Uncertainty analysis of standard and proposed models using exact probabilities.The calendar-based approach results in populations similar to the first class of standard simulations (ageing before death.)