Application Independent Heuristic Data Merging Methodology for Sample-Free Agent Population Synthesis

: This work proposes a novel application independent heuristics specifying framework and a household structures construction process, for sample-free population synthesis. The framework decouples heuristics and the algorithm by defining a set of generic constructs to specify heuristics on relationships and household structures. The algorithm uses Iterative Proportional Fitting, Monte Carlo sampling and combinatorial optimisation to synthesise the population. Decoupled nature of the system allows it to be used in different applications relatively easily by changing the heuristics. We demonstrate that this is a robust technique capable of producing synthetic agent populations highly consistent to input data distributions using two case studies. Apart from contributing to synthetic population reconstruction, this work will form one of the building blocks for integrating independently developed models to build complex new agent based models.


Introduction
. Agent Based Simulations (ABSs) have evolved from simple early applications such as Schelling's segregation models (Schelling ) to very complex decision support systems like MATSim (Raney & Nagel ) that model thousands of intelligent agents. The strength of ABSs is their ability to model phenomena that emerge through agent interactions, which cannot be represented in alternative methods like mathematical modelling. Advances in computational power and modelling techniques have enabled constructing large scale ABSs from the ground up and also by combining existing modules (Dahmann et al. ; Singh & Padgham ). .
In any agent based model, specially in social simulation, obtaining a synthetic population that accurately represents the underlying real population is very important. The synthetic population has to conform to the observed person and household distributions and also have realistic household structures and person relationships. The research in synthetic population reconstruction can be grouped either as methods that use disaggregated sample data (microdata) (Beckman et al. ; Williamson et al. ) or application specific heuristics (Barthelemy & Toint ). The sample data based approaches are unsuitable for constructing synthetic populations in many applications, due to privacy related restrictions and data unavailability. The current heuristic based approaches are also restricted in general applicability due to the tight coupling between the population specific heuristics and the algorithm. This makes them very hard to extend with new properties even for the same population and, in modular based agent based models, limits the ability to extend an existing composition of models by adding new ones. .
The work presented in this paper is part of a project that models the evolution of housing choices in Melbourne, Australia with respect to the changing household family structures and transport needs. The agent based model for this project can be developed in two ways: developing a model ground up in the traditional fashion or combining existing agent based models as modules to build a new model. The latter is particularly interesting given the availability of previously developed models, for instance, housing market models, such as Ettema ( ), .
The above methods assume the availability of a disaggregated data sample, which is not the case in many applications either because of privacy concerns or data unavailability. The latter is the case with merging agent populations from di erent simulation models because there is no common agent sample that represents all component simulations (Wickramasinghe et al. ). Alternative methods proposed in the literature circumvent the need of sample data by employing heuristics to infer person relationships and household structures (Huynh et al. ). Additionally, Ye et al. ( ) show that populations can be synthesised without a sample given that joint marginal distributions with su iciently overlapping characteristics (attributes) are available as inputs at the same aggregation level (either person or household level).
. Theoretically, a heuristic approach can be devised to generate all the household configurations, by taking combinations of person and household types, and use samples from them to generate a synthetic population considering person and household level marginal distributions. In practice, however, as the number of person and household types increases the number of household combinations also grows exponentially making it computationally infeasible. Our initial experiments of this approach with person types and household types (Case study ) failed to complete even a er hours on a super computer with TB RAM and × . GHz cores. Gargiulo et al. ( ) have also reached similar conclusions in their experiments for Auvergne, France.
. Intelligent heuristic search techniques generally start with a pool of empty households adhering to the household level marginal distribution and a pool of persons adhering to the person level marginal distribution. Then household instances and suitable persons for them are sampled from corresponding pools without replacement according to population heuristics describing person relationships and household structures (Barthelemy & Toint ; Huynh et al. ). Another method is forming households by selecting persons probabilistically from marginal distributions based on the heuristics derived from household compositions (Gargiulo et al. ). A major limitation of these population construction procedures is they are interleaved with the heuristics, which are application specific, thus not easily transferable to a di erent application. The solution here is developing a new population synthesis process for every new application. This becomes even more cumbersome when integrating di erent ABMs. We address this problem by proposing a generic heuristic framework for synthetic population construction. .
The remainder of the paper is as follows. The next section formally describes the proposed framework and the population construction procedure. Then we discuss two case studies and their results. The first of the two demonstrates merging agent populations from two agent based simulations from the literature. The second case study discusses constructing merged populations using the Australian census data and compares it with a population generated using IPU approach. The paper concludes with a discussion on the proposed approach.

Methodology
. In this work, we use IPF to merge the input data distributions and propose using an abstract seed of s (indicating cells that can have agents) and s (indicating cells that cannot have agents) instead of an actual disaggregated data sample. This merged distribution is used to obtain a conditional probability distribution, which is used for constructing group structures with Monte Carlo sampling. The constructs specified in the framework ensure that group structures are legal, and all the relationships are represented according to the heuristics. The outcome of this process is an initial estimate of the population. The estimate is improved using a hill climbing approach to produce the final population. Due to the dependency on IPF, the proposed approach expects marginal distributions to be converted to the same aggregation level. Here we propose getting the number of persons in di erent household types by multiplying with household size. The proposed heuristics specification framework, however, does not have this limitation.
. A significant part of the proposed framework is framing the heuristics based population construction problem in a manner that allows specifying a series of generic formalisms capable of recording heuristics from di erent applications. Benefits of this approach are twofold, a) providing a unified interface for recording heuristics b) allowing to design a generic population construction algorithm that depends on the specified formalisms instead of the particular application heuristics. In this section, we first discuss the formalisms of the framework and then present the group construction process.
The data we obtain from di erent sources are binned distributions, each referring to some aspect of the synthetic population, for instance, the distribution of the number of persons by gender and the distribution of the number of persons by household sizes. We call them characteristics of the population, and there can be any number of characteristics in a population.
Definition . Given a synthetic population has n di erent characteristics, the set of all the characteristics (C) is represented as: C = {C 1 , C 2 , ..., C i , ..., C n } where C i is the i-th characteristic of the population, 1 ≤ i ≤ n and i, n ∈ Z + . Z + is the set of positive integers.
. Each characteristic consists of multiple categories, or bins. For example, male and female are the two categories of gender distribution and 1-5, 6-10, ... are the categories of age distribution. There can also be joint distributions, for example, a distribution of the number of persons by age and gender. males in age 12-20 and females in age 21-25 are two example categories from the above joint distribution.
Definition . Given C i represents the i th characteristic, the relationship between C i and its categories are represented as: C i = {c i 1 , c i 2 , ..., c i |C i | } where C i ∈ C and |C i | is the total number of categories of characteristic C i .

.
The framework divides characteristics into agent level characteristics and group level characteristics. The agent level characteristics capture concepts represented in agent entities, such as the distribution of the number of persons by gender, where gender is a concept represented in a person (an agent entity). Group level characteristics capture concepts represented in group structures, such as the distribution of the number of persons by household size, where household size is a concept represented in a household (a group structure). Note that this division depends on the concept captured in the characteristic, or distribution, not on the counting unit of the distribution. For example, the distribution of the number of persons by household size is a group level characteristic though, the counting unit is the number of persons. The set of agent level characteristics and the set of group level characteristics are defined as mutually exclusive sets.
The proposed framework defines an agent type as a tuple of categories each describing a property of an agent's state that relates to one of the agent level characteristics. There is a category in an agent type for each agent level characteristic represented in the population. An example of an agent type is (Male, Married, age 26-30). Male is a category from gender distribution characteristic, Married is a category from marital status characteristic and age 26-30 is from age characteristic. The set of all agent types in the population can be obtained by forming combinations by taking one category from each agent level characteristic.
Definition . The set of all agent types (A) is given by the Cartesian product of the categories of all the agent level characteristics in the population.
given {C 1 , C 2 , ..., C d } is the set of agent level characteristics with d < n and d, n ∈ Z + .
. In this document we use a to represent an agent type in a succinct format, i.e a ∈ A and a = (c 1 , ..., c d ).

Links .
A link is a representation of a relationship between two agent entities or persons. Understanding the links that an agent can form with other agents is important for reconstructing group structures. In this framework we propose a set of structured constructs for recording heuristics on links that agents can form depending on the agent's type. The idea here is enabling the formulation of an algorithm that forms groups based on constructs proposed in the framework rather than application specific heuristics. This allows using the proposed system on di erent applications only by changing the heuristics, whereas current heuristic approaches in the literature would require implementing a system from the scratch. .
We define a link as a labelled and directed edge between two agent nodes (as in graph theory). For example, the relationship between a mother (w 1 ) and a child (o 1 ) is represented as "w 1 is the mother of o 1 " from the mother's point of view. The same conceptual relationship from the child's point of view can be described as "o 1 is the child of w 1 ". Here parent of and child of are the two links. In reference to terminology the agent that forms the link is called the reference agent and the agent that the link is formed with is called the target agent. Formally,

Definition .
A link is an ordered triple of a reference agent instance (α r ), a link (λ) and a target agent instance (α t ): There are di erent types of links with di erent properties in a population. Here we are interested in the number of links, of a given type, that an agent can form. For example, a person can have up to two child of relationships, one with a mother and the other with a father. However, a young child is assumed to have at least one child of relationship with a mother or a father. In these situations, the framework proposes defining di erent link types for all the variations. For example, we can define two link types as AdultChildOf relationship, where minimum zero and maximum two relationship instances are expected, and YoungChildOf relationship, where minimum one and maximum two relationship instances are expected. The properties of a given link type are its name, the number of minimum link instances that an agent must have and the number of maximum links that an agent can have.

Definition .
A link type is an ordered triple of a link name (link_name), a minimum number of links (min) that an agent must form from the link type and a maximum number of links (max) that an agent can form from the link type, represented in the following format.
l = (link_name, min, max) The set of all link types in the population is represented by L, (l ∈ L).

.
Knowing the links that an agent can form with other agents is a major part of reconstructing realistic group structures. Traditional approaches predominantly depend on group structures in sample data to generate realistic groups in the synthesised population (Ye et al. ; Namazi-Rad et al. ). However, when sample data is not available we have to rely on population heuristics (Gargiulo et al. ; Huynh et al. ). The central idea in the proposed framework is constructing groups based on heuristics on the types of relationships that an agent of a given type can form. .
A Link rule is a construct proposed in the framework to record relationship heuristics between agent entities in a generic manner. A link rule consists of a reference agent-type (a r ), a link type that agents of the reference agent-type can form and a set of target agent-types (B), from which an agent is selected when forming a link of the given link type. Multiple link types of the same reference agent type are represented by specifying multiple link rules. For instance, we may have to specify three link rules for (married, male, age 26-30) agent type, for married to, parent of and child of relationships. If an agent type does not form some relationships, they are undefined in the link rules. For example, married to relationship of a single person is undefined. The proposed algorithm does not form the links that are not defined in link rules.
Definition . Link rules (R L ) applied on a population are a set of ordered triples each with a reference agent type (a r ), a link type (l) and a set of target agent types (B).
The link rules described here are flexible enough to describe links between any two entities. They can be human relationships, the adjacency of plant types in a forest or any other type of relationship. For example, given that the partner of a married male must be in the same age category or one below, the marital link rule for (

Group .
A group is a coherent entity formed with a subset of agent instances in the population filtered based on agent properties and links between agents. In the simplest form, groups can be constructed by selecting agents based on some property, for example, the set of all male agents. A relatively more complex group is a couple family household with two children, which is constructed based on both agent properties and links between the pairs. The group can have a male adult and a female adult who are married and two female children, whose parents are the two adults.

Definition .
A group (γ) is an ordered tuple of a set of member agent instances (Ω) and a set of link triples (Λ) describing all the links between member pairs: The links between member agent instances are represented as (α r , λ, α t ) ∈ Λ, where λ is the link formed by agent instance α r with agent instance α t and α r , α t ∈ Ω.

Group type
.
Group type is a categorisation of groups based on agent composition, agent types and/or relationships among agents in a group. An example of a group type is eight member households, which describes households of eight persons. Another example is couple family household with children, which categorises households based on agent types and agent relationships. In the proposed framework a group type is represented as a tuple of categories selected from group level characteristics. The set of all group types in a population can be obtained by forming combinations by taking one category from each group level characteristic.
Definition . The set of all group types (G) in the population is given by the Cartesian product of all the group level characteristics.
.., C n } is the set of group level characteristics with d < n and p, n ∈ Z + . .
As agent level characteristics and group level characteristics are mutually exclusive sets, according to definition , agent types set and group types set in a population are also mutually exclusive sets.
A group rule determines the group type of a group based on the composition of agents and agent links. One way of achieving this is defining a group template that represents the expected agents and the links composition of the group type, and matching a given group instance to the template. If the group's composition matches with the template, we can determine that group's type is what is represented by the template. However, this approach becomes expensive because in most real-world populations there are multiple group templates for a given group type. For instance, though the basic expected composition of a couple family with children group type is one child living with two parents, it is normal to have more than one child in a family, thus requiring to define multiple templates depending on the number of children in a family. The number of templates increases even more when considering di erent agent types that can be in a family. For instance, as age categories of parents and children vary across di erent families there need to be templates for all the di erent combinations. Defining all the templates in a population extremely di icult because the number of templates grows exponentially as more categories and characteristics are introduced to the population. .
Instead of defining complete group templates with all the categories of agent and group level characteristics in the population we propose determining the group type based on important features that only relate to categories of group level characteristics. A feature is a unique instantiated combination of agents and links that represents a category of a group level characteristic in a group instance. For example, if we take couple with children family as a category of the distribution of family compositions characteristic, the feature representing it is two Married persons with marital relationships between them and one Child with parental relationships with the two parents. This can be extended to determine the type of a group by combining multiple categories, as well. For example, 3 person, couple family household group type is identified based on two features because there are two categories in the group type. The first feature is having three members in the group and the second is two of them being married persons (the two belong to Married agent type and there is a marital relationship between them). The mapping between a feature and a group category can be represented with below bijective heuristic function. This specifies that there is a heuristic to map a category of group level characteristic to a unique agents and links composition in a group, and the same heuristic can also be used to map an agents and links composition to the corresponding category.
Definition . There is a bijective heuristic function that maps each group level category to a feature.
where h k is a heuristic function, f k is the feature of category c k (c k ∈ C k ) and C k ∈ {C d+1 , C d+2 , ..., C n }.
. Based on definition , which defines that a group type is a tuple of categories, each selected from a group level characteristic, and the above definition of features (definition ), we can derive that for each group type there is a unique tuple of features. Each of these mappings is called a group rule. A population consists of multiple such group rules, mapping each group type to a unique feature tuple and vice versa.
Definition . Group rules (R G ) in a population are represented as a bijective function that maps the set of group types (G) to the set of feature tuples (F ) and vice versa. .
We further define a function (Q) that uses the above defined group rules to determine the group type of a given group instance. The function identifies the features in the group and returns the corresponding group type. If the group has none of the defined features, its group type is undefined.
Definition . The function (Q) takes a group (γ) and the group rules (R G ) as input and returns the group type (g) of the input group. Q : γ, R G → g

Link Conditions
.
In this section, we discuss dependent links that may need to be formed as a result of forming another link. For example, in a human population, a marital relationship between two persons (m 1 and w 2 ) can be formed by marking m 1 is married to w 2 based on link rules. When doing that we also have to mark that w 2 is married to m 1 to make the state of the family complete. Forming this second married to link is a condition of first married to link. Similar link conditions can be observed when forming housing complexes by grouping housing units, for example, marking two housing units adjacent to each other when adding them to a housing complex.

.
There can be even more complex link conditions. Consider that we are probing for other relationships m 1 agent can form in above example and we have determined that m 1 can have a child (o 2 ). Apart from parent of and child of relationships between m 1 and o 2 agents, o 2 also have to form a child of relationship with w 2 to maintain consistency of the family structure. Here, forming the latter link is conditioned by existing (m 1 , married to, w 2 ) link.
. Link conditions provide a mechanism to maintain the consistency of a grouped population by forming the dependent links. In the proposed framework, we capture link conditions as a series of user defined transformation functions applied on a group. The structure of a link condition transformation function is as follows.

Definition .
A link condition is a user defined heuristic transformation function (Φ) that transforms the group's state by forming dependent links in response to a newly formed link between two agents (α r and α t ) in the group (γ). Φ : γ, (α r , λ, α t ) → γ where γ = (Ω, Λ), Λ represents existing links of member agents (Ω) in the group (γ) and triple (α r , λ, α t ) represents the new link (λ) formed by α r with new agent α t . γ is the transformed state of group γ a er forming dependent links between its agent pairs.

Constructing the population .
The proposed population construction framework consists of two phases. The first phase is merging distributions extracted from data sources using IPF, and the second is constructing groups based on the merged distribution. The Figure provides an overview of the proposed methodology. The first part of the figure is the standard IPF distribution merging process. The IPF procedure takes two marginal distributions converted to the same aggregation level, annotated with C 1 and C 2 , and a seed matrix as inputs and generates a joint distribution. If there are more than two distributions, though we have only shown two here, a multidimensional implementation of IPF can used . The grouping process takes a joint distribution from IPF and a set of Obtaining the marginal distributions .
The first step is to identify the important population characteristics that need to be included in the merged population. For example, assume that three distributions of the number of persons by age, sex and relationship status from a data source on agent level population characteristics and a distribution of the number of persons by household size from a data source on household level characteristics are given. The objective is to construct a population by merging these distributions in a manner that preserves the structural properties of all the four input distributions. If multiple characteristics are chosen from the same data source it is recommended to query them as a joint distribution to minimise errors introduced during processing. If the distributions are frequencies of agents we convert them to probability distributions in preparation to run IPF on them. The reason for this is converting all the distributions to a common scale.
Definition . The probability distribution of agents in all categories of characteristic C i is given by where C i is the i-th characteristic of a population of n characteristics (i = 1, ..., n; i, n ∈ Z + ), c i is a category under characteristic C i , P (C i ) is the probability distribution of characteristic C i and R [0,1] is the set of real values between and .

Merging agent distributions .
We employ IPF to obtain a joint probability distribution by merging the probability distributions derived from data sources. IPF relies on a disaggregated data sample of the target population, which is used as the seed (initial estimate of cell values). However, obtaining a disaggregated data sample on the population represented by marginal distributions is di icult due to data availability constraints. We propose circumventing this problem by indicating cells that can logically contain agents with s and cells that cannot contain agents with s. This is a deterministic assignment made based on domain knowledge. This abstract disaggregated data sample is represented by the seed matrix in the Figure . The data distributions are expected to be at the same aggregation level, though they may capture a hierarchy of concepts. For instance, when merging a person level distribution (e.g. age distribution) and a household level (e.g. household sized distribution), we expect the number of persons to be the counting unit of both populations. We denote the joint probability distribution constructed with IPF as I.

.
Given Π as the set of n probability distributions representing n characteristics obtained from data sources, the n dimensional joint probability distribution (I) is obtained by merging Π using IPF: Here, s is the n-dimensional seed matrix, Π is the ordered tuple of n probability distributions (P (C i ) ∈ Π), I is the n-dimensional joint probability distribution produced by IPF procedure. Furthermore, i-th dimension of I correspond to characteristic C i . .
The resulting joint probability distribution a er running IPF is the representation of the merged population at the lowest aggregation level. We can get the merged agent population without group structures if we multiply the joint probability distribution (I) by the expected total number of agents and then instantiate the number of agents given in each cell with corresponding agent properties. If the population size is unknown, a suitable number has to be chosen based on domain knowledge. It has to be a reasonably large number to su iciently capture important structural characteristics in the joint distribution. If the size of the target population is given as N , the population distribution by the number of agents (S) is given as below.
Although the above process allows instantiating all the agents with the correct properties, it does not produce the group (social) structures in the population. For example, we can obtain all the agents in four member households using the above process, however, it will not give us information on the composition of household structures. So, there needs to be a mechanism to reconstruct group structures in the population.

Groups construction
.
The grouping process consists of two parts. The first part constructs an initial estimate of the group structures in the population using a process based on Monte Carlo Sampling as described in Algorithm . The second part is improving the solution with hill climbing optimisation. The approach proposed in this work is a heuristic algorithm that progresses using specified rules and observed distributions. .
One of the inputs to the algorithm is the conditional probability distribution giving the probability of observing an agent of a certain agent type given its group type. Below is the function for obtaining conditional probability distribution from I joint probability distribution.
Here we first describe the Algorithm at a high level before going into specific details. The algorithm iterates over all the group types constructing the expected number of groups under the selected group type in each iteration. A group is constructed in two phases. In the first phase, an agent is selected for the group and all its compulsory links are formed (lines -) by adding suitable new agents to the group. For example, the marital relationship of a married person is a compulsory link. More specifically, compulsory links are given by the minimum number of a link type in an agent type's link rules. If new agents were added to the group during the process their compulsory links are formed as well. This is continued until all compulsory links are formed for all of the agents in the group. A er that we check the group type using function Q given in definition (line ).
If the group has the expected group type it is added to the population, and we start forming a new group from the beginning. If not, the algorithm starts the second phase, where agents' non-compulsory links are formed (lines -). This phase iterates over the current agents in the group, in the order they appear, forming their non-compulsory links until the expected group type is achieved. We ensure that no link rule violations occur when adding new agents. Once the expected group type is achieved, the group is added to the population, if not it is discarded.
. The inputs to the algorithm are Link Rules, Group Rules, the set of group types in the population, the conditional probability distribution (P (A|G = g)), the expected population distribution (S) and the maximum number of http://jasss.soc.surrey.ac.uk/ / / .html Doi: . /jasss.

Algorithm : Add new agent to group
iterations allowed when forming groups (Itr max ). The output of the algorithm is the final population with group structures (Γ). Following functions and formalisms are used in the algorithm.
• |.| -size of any set or tuple • type(α) -type of agent α • min(l) -minimum required link instances for link type l • max(l) -maximum allowed link instances for link type l • R L (r = type(α)) -returns the link types and corresponding target agent types that agent α can form links with according to the link rules The algorithm starts by initialising the output population to an empty set. Then we select the first group type (g) from the set of all group types and start forming group instances (line ). In line we get the expected number of groups of the selected group type from S. Initially the number of current groups (z) is set to . The algorithm keeps forming groups of current group type until the required number of groups are formed or it exceeds the maximum number of allowed iterations (line ). The group being constructed is represented by γ and initially empty. The first agent of the group is selected using a Monte Carlo sampling technique based on the P (A|G = g) distribution of di erent agent types appearing in a group of the selected group type g (line ).
Here the P (A|G = g) distribution ensures that agent types not in the selected group type are not added to the group. Line adds the new agent to the group and line initialises index j to to indicate the first member of the group. The next step is forming the minimum required links of the agents in the group. In line , we select the agent represented by j-th index in the group γ to form its required links. In the first iteration j refers to the new agent. The unformed required links of an agent can be found by taking the link types, in the agent's link rules, that the minimum required number is larger than the agent's existing links (line ). Agents to form the missing links are selected using Monte Carlo sampling and added to the group (lines -). Link conditions are applied whenever an agent is added to the group (algorithm ). This process is iterated for all the agents in the group until all the required links are formed. .
If the group has fulfilled the requirements of the expected group type a er forming all the required links it can be added to the population (line ). Otherwise, the algorithm adds the missing number of agents by forming optional links of the existing agents until the group reaches the expected size. The expected size is assumed to be one of the group level characteristics (line ). In line , we check whether the selected agent has formed the maximum allowed links from each link type it can form according to the link rules to identify optional links. Given that agents have formed all their required links during the first phase, any remaining link is considered an optional link. This computation is the same as finding the required links as described above except for max(l), which returns the maximum number of links allowed for link type l. The output of the computation (optionals) is a set of ordered pairs representing a link type (l) and a set of target agent types (B) that α can form type l links. Though the same target agent type can be present in multiple pairs in the optionals, they are considered di erent because the links they form are di erent. If the optionals set is not empty, we take all the di erent link type (l) and target agent type (b) pairs by taking the Cartesian product of the each pair in the optionals. The set of (l, b) pairs is represented by linkpairs (line ). If the optionals set is empty, we select the next agent in the group (line ) and start forming its non-compulsory links. The link type and the target agent type is selected using Monte Carlo sampling according to the P (linkpairs|G = g) conditional probability distribution (line ). The calculation given below shows how to obtain the conditional probability distribution of the linkpairs given JASSS, ( ) , http://jasss.soc.surrey.ac.uk/ / / .html Doi: . /jasss. g as the group type: Once the required number of agents are added to the group, we check whether the group matches the expected group type using function Q and add it to the population if it is valid (lines and ). If the group does not match the expected group type it is discarded. This process is continued until all the groups are formed. At the end of this process, we have an initial estimate of the whole population. .
The initial estimate of the population is improved using a standard hill climbing optimisation. The objective function used for hill climbing is based on the root mean squared error (RMSE). To calculate the error we define S 0 as a function giving the total number of agents in the current reconstructed population, analogous to the function S given earlier, which represents the expected population. The RMSE calculation is given below. When proposing a change to the current estimate, we randomly select a group from the current population and construct a new group of the same type. The construction process for the new group has the same as logic explained from line -but instead of Monte Carlo sampling, here, we perform random sampling. In each case agent types are selected only when P (A|G = g) > 0 and P (linkpairs|G = g) > 0 to avoid adding agent types that do not appear under a group type. If swapping the new group with the old group improves RMSE we accept the change. This is continued until a RMSE = achieved or the maximum number of iterations is exhausted.
At the end of the process, we are given an agent population that is structurally similar to the input marginal distributions. .
The population represents properties of person and household instances using the categories given in the input marginal distributions. In some situations, we have to assign persons and households specific values within a category. Age is such an example, as marginal distributions represent the population with age groups in most cases. We propose assigning an exact year to a person's age using a suitable method considering relevant heuristics. For example, we may assign an age based on the number of persons by age (year) distribution of the population considering parent-child and marital partner age constraints.
Forming groups with subgroups .
There are two ways to construct complex group structures that consist of multiple subgroups like multifamily households.
. The first method is applying the proposed system iteratively on the population. Here, the process starts by constructing the subgroups of the lowest group aggregation level using the agent entities. A er that subsequent iterations use the subgroups of the previous iteration as the agent entities to construct the groups of the next higher aggregation level. This approach requires specifying a new set of link rules, link conditions, group rules and marginal distributions representing agent and group entities for each iteration. In reference to a population of multifamily households, this method uses persons to form the families, in the first iteration, and the family instances to form the households, in the second iteration.
. The second method is defining group rules at the highest group aggregation level while considering the composition of subgroups, so that the algorithm would form the complete groups in one iteration, while adhering to subgroup compositions. In this method, link rules and conditions are specified at the agent level and group rules are specified at the highest group aggregation level. For example, in a multifamily population link rules and link conditions specify relationships between persons and group rules specify household structures considering families in them. This method is more suited if data on subgroups are incomplete, for example, data on families in a household is incomplete or unavailable as this approach only requires marginal distributions at agent level (e.g. person level) and highest group aggregation level (e.g. household level). The second case study described below is an example of this approach.  (Silverman et al. ) and Linked Lives (LL) (Noble et al. ) are two agent based simulations modelling the UK population to evaluate social care needs. In this case study, we investigate constructing a common population by merging WD and LL populations. Similar work is also presented in (Wickramasinghe et al. ), however, they use a di erent algorithm and do not propose a framework for recording population heuristics. The purpose of this exercise is validating the proposed framework's applicability in constructing merged populations for integrated agent based simulations. .
The highlights of the WD model are its familial relationship representation and demographic processes. WD's marital partnership formation model is based on social a inity of the agents. The spatial representation of the model is a toroidal space depicting agents' social networks. The agents choose their partners based on their social a inity and partnership formations results in two agents moving to a location between their original locations. Its demographic process uses the Lee-Carter model and is guided by statistical data. The objective of LL is evaluating social care demand and supply amid changes in household structures. LL uses the Gompertz-Makeham mortality model and a flat reproductive probability for all -females. An abstract geographical representation of UK is used for the spatial representation. Agents can move from house to house within this space at di erent stages of life. Marital partnerships are formed between randomly selected persons.
. A significant part of integrating ABMs is constructing an initial agent population consistent with all integrated models (Wickramasinghe et al. ). Here we apply the proposed methodology to construct a merged population for WD and LL. Based on the above analysis, we decided that the merged initial population should be consistent with the WD population's age, gender and marital statuses distribution, and the LL population's household sizes distribution. Table gives the categories represented in the joint distribution extracted from WD. .
The joint distribution of the person level (agent level) characteristics obtained from WD consists of person types: age categories with year gaps × two relationship status categories × two categories based on gender. The categories under these characteristics are represented in Table . An example of a person type in the population is (Male, Married, 28-31), which is succinctly represented as (X ,M ,O ), using the category labels in Table . The only group level characteristic used in this population is the distribution of household sizes. Household types range from one member households (H ) to member households (H ).

Male (X )
Female (  The first step of the population construction process is obtaining the conditional probability distribution of agent types in a given group type by merging the distributions obtained from the two models using IPF. To obtain the data distributions, we executed the WD and LL simulations independently and evolved them to the year , so that the two populations conceptually represent the same UK population from a temporal point of view. At the year , there were around persons in LL and in WD. The person level joint probability distribution was obtained from WD and the household level joint probability distribution from LL, by counting the number of persons that belong to di erent person and household types. The two distributions were then merged using IPF. The seed was constructed in the manner described in the Methodology section, where impossible cells were set to and possible cells were set to . An example of an impossible cell is (Male, Single, 0-3) child living alone in a one person household. The IPF output is a matrix of proportions each cell representing the proportion of persons with a given combination of properties in the target population. This is the distribution presented by I in Equation in the Methodology section. This distribution is converted to conditional probability distributions using Equation taking household sizes as group types (g) and combinations of age, gender and marital status as agent types (a). To obtain the number of persons under each category the matrix is multiplied by the target population size, persons in this case (Equation ).

Population heuristics .
The population heuristics govern the relationships and household structures formed by the algorithm. Given the nature of WD and LL models, we assume that all the households are family households and that there are no unrelated individuals in them, except in one member households. Although the data extracted about households do not identify multifamily households, we have to allow them in the merged population to be able to create large households such as person households. Following are the list of heuristics applicable to this case study, which are later represented as link rules and group rules.
. A person can have only one marital partnership.
. A person can have up to children.
. A person can have only one father and one mother.
. A person over years old can live alone.
. Only persons aged years or more can form marital partnerships.
. A male can only have a marital partnership with a female from the same age category (group) or a category up to years younger.
. A female can only have a marital partnership with a male from the same age category (group) or a category up to years older.
. A child must be at least years younger than the parent.
. Only persons aged years or more can have children.
. All people in the household are related.
. The children of a person also considered children of the person's partner.
. Household types are decided by households' sizes.

Specifying link rules
. The first step of defining link rules is identifying link types in the population. The link types in the population capture familial relationships among agents as heuristics to in the above list. These link types are su icient in this case study because WD and LL only represent familial relationships. Table gives all the link types used in the link rules. A None relationship is a representation of a non-existing relationship. This is used in the proposed population construction methodology for single persons living without any relatives in one member households.

MarriedTo
Marital relationship formed by a married person MotherOf maternal relationships a person forms with another person FatherOf paternal relationships a person forms with another person ChildofFather A person can have a father who is living in the same household ChildofMother A person can have a mother who is living in the same household None Indicates empty relationship -for single persons living alone Table : WD and LL link types .
Link rules for the population can be derived based on heuristics to . For example, relationship types of (Male, Married, 28-31) person type are MarriedTo, FatherOf, ChildOfFather and ChildOfMother.
When we consider marital relationships of the person type, marital partner needs to come from one of Female, Married, 16-19 to 28-31 person types. The FatherOfrelationships of the above person type can be formed with a person from any gender and marital status category, but in a younger age category with at least a years gap. These and the remaining two link rules of (Male, Married, 28-31) person type are given in Table . Reference agent type Link type Target agent type  As manually writing all the link rules is a cumbersome task, we automated the process using the statements shown in Table . The table gives  Here the marital and gender categories of the target agent types are constant because partner must belong to Female (X ) and Married (M ) categories. However, the age categories that a partner can be selected change with reference agent type's age because the heuristic specifies that the female partner of a male must be in the same or in a younger age category with no more than a year gap. We represent this by taking married females from Oω to O age categories, given male reference agent type's age category is Oω (where = ω−3). Additionally, we have also included a special condition to limit the youngest age category of a female partner to be 16-19 -O to avoid selecting person types younger than years old for marital partnerships. The same approach was used to construct other link rule statements in Table . Specifying group rules .
The group types in the merged WD and LL population are the household types observed in the LL population distribution, which represents the number of persons in a household. Generating templates for all these house-hold types is computationally expensive because person type combinations for a household grow exponentially with the household size. For example, there are person types ( combinations of person categories) that can form realistic one person households. More specifically, there are 2 1 ways to select one from two genders, 1 1 ways to select single persons (because married persons cannot live alone) and 22 1 ways to select one age category from age categories that are over years old. This amounts to 2 1 × 1 1 × 22 1 = 44. When we consider two person households, there can be households of married couples or single parent households, which produce a total of household configurations. There are even more combinations for three member households. This computation is not feasible with resources available to most researchers. Our attempts to generate all the templates in this manner failed even with a system of TB RAM and . GHz cores. .
In this work, we avoid having to define a large number of household templates by taking the number of persons in the household as the unique feature that maps a given household to its type, as in definition . Here defining group rules is relatively simple because there is only one group level characteristic, household sizes distribution. The function Q wdll that determines the type of a given household is represented below and its logic is given in algorithm . Here, η is the household instance.
We generated pairs of WD and LL populations with di erent random seed values (for the random number generator) and used their marginal distributions to construct merged population instances. The above specified link rules, link conditions, group rules and IPF seed matrix were used for merging all the population instances. Finally, for age, persons are assigned random years within their age category considering parent-child and marital partner age gap constraints. .
To evaluate the results we compared each merged population's joint distribution of the number of persons by gender, relationship status and age against the corresponding marginal distribution obtained from WD and the distribution of household sizes against the corresponding LL household sizes distribution. When performing the tests we removed impossible person categories (e.g. male, married, age 0-3) from both the expected and observed distributions. .

The goodness of fit of each constructed population was evaluated using the Freeman-Tukey goodness of fit test (Freeman & Tukey ). The test statistic is given by
with O and E as the observed and the expected distributions, respectively. The FT test statistic is a representation of the error between the two distributions and it follows a χ 2 distribution with the degrees of freedom equal to one less than the number of categories in the compared distributions. In general terms, the p-value is the probability of observing the error represented by the FT statistic if the null hypothesis (H 0 ) was assumed true. The null hypothesis of the test is that the two distributions are similar and the alternate hypothesis (H a ) is that they are di erent. If the p-value is less than a significance level ( . ) the null hypothesis is rejected, that is, the observed distribution is deemed not a good fit to the expected distribution. On the other hand, high p-values indicate that when the two distributions are assumed to be similar there is a high probability of having errors as extreme as the observed. . We further compare the synthesised population to census distributions using graphical representations later in this section to complement the statistical analysis results.
We further performed a power analysis on the test to explore the probability of detecting an e ect in the reconstructed population when such an e ect is actually present. For the test, we used . as the significance level, as sample size (the population size) and . e ect size. The e ect size was selected according to general guidelines provided by Cohen ( ) for social sciences where . , . and . were proposed for small,

Lowest p-value
Mean SD WD LL . . . Table : WD and LL Freeman-Tukey test results medium and large e ect sizes respectively. Table gives the results of the power analysis. It shows that both tests will correctly reject the null hypothesis with a probability higher than the widely accepted . level when the probability of correctly rejecting the null hypothesis when it is false is set to . . The probability of type II error in LL comparison test is only . and in WD comparison test, the probability is . .

Degrees of freedom Power
WD . LL .   Table . Here, we show only some of the labels due to space limitations. According to the graph, errors are relatively small in the reconstructed population and its structure is very similar to the input distribution, which supports the claim that populations synthesised using the algorithm are consistent with input distributions. Figures b, c and d show the structural consistency of person level characteristics individually. The characteristics used here are the distributions of the number of persons by gender, marital status, and age. Figure e compares the distribution of the number of households by sizes in the input and in the synthesised population. It is evident from the graphs that the populations constructed with the proposed methodology are structurally consistent with its input distributions. .
In this case study, we demonstrated how the proposed framework can be used to construct an initial consistent agent population for integrated ABMs. While the algorithm produces promising results consistently, one of the reasons for observed mismatches is round o errors. Another factor is discrepancies in the two input distributions because they come from agent populations of di erent simulations, though we assume they conceptually represent the same population. Here we expect the algorithm to produce the best possible solution it can with the available data.

Case Study : Population Construction Using Census Data
. In this case study, we explore constructing a synthetic population by merging aggregated data from Australian census. Australian Statistical Geography Standard (ASGS) developed by Australian Bureau Statistics defines a hierarchical system that divides the country into smaller geographical areas . Statistical Area (SA ) is the third smallest area defined in the system and in most cases they correspond to o icially gazetted state sub-urbs and localities. To construct the population we collected individual level and household level information of all SA s that fall under the Greater Melbourne area in the state of Victoria. The proposed sample free technique is more desirable here, as any population generated with microdata samples would not be freely usable due to privacy restrictions.

Person level data .
Individual level information collected under each SA includes joint distributions for the number of persons by age, gender and the person's relationship in a household. There are age categories, relationship status categories and gender categories. Table gives the full list of categories under each characteristic. The labels prefixed with X,M and O are used in this document to refer to corresponding categories. The married category includes persons in a registered marriage or in a de facto partnership. Children category includes persons categorised as dependent students aged -, dependent under children and non-dependant children over children. Relative is short for other related individuals, which encompasses individuals who live in the same household with a family but as not part of a family nucleus and persons related with relationships other than marital or parent/child relationships, for example, siblings. The relationship status of a person is decided in the prioritised order of marital, then parent/child, then relative relationships. Lone persons are the person living alone and group households are members of a households consisting only of non-related individuals like tenants. Additionally, though the Married category in census include persons in homosexual partnerships, for simplicity, we assume all married persons are in heterosexual partnerships. Also, parent/child relationships are treated from social/legal perspective rather than from the biological perspective. A complete description of the relationship types, family types and other special terms can be found in Australian Bureau of Statics website .
Here, a person type can be obtained by taking combinations of categories from each characteristic, for example (Male, Married, age 25-39) -(X ,M ,O ).

Sex
Male (  Household level information extracted for each SA includes the joint distribution of the number of persons by household size and family-household composition. Table gives the categories under the two characteristics. Family household composition categories include the number of family units in the household and the type of the primary family unit. For example, Two or more family household: Couple family with no children refers to households with two or more family unit and the primary family is a couple family with no children unit. A household type in this case study is represented by a combination of household size categories and family-household composition categories, for example, (4 person, One family household: One parent family) -(H ,U ). .
A person belongs to only one family in the Australian census family categorisations and a household can have multiple families. A person's family nucleus is determined in the prioritised order of the person's marital relationship, parent/child relationships and other relationships. Any un-grouped children are added to the same family as their parents. In case of multi-generation parent/child relationships, ties are broken by prioritising the younger generation's relationship, and the unmarried grandparent is categorised as a relative of the younger family. Same applies to older single parents of a married couple. The primary family of a multi family household is selected in the prioritised order of couple family with children, one parent family and finally, couple family with no children and other family with equal priority. .
Though the census identifies up to three family units in a household, the categories used here do not distinguish between households consisting of two and three families. They also do not distinguish among six to eight person households though the census does. The categories shown are selected to match the categories in disaggregated data samples available to us: in the interest of doing a more reasonable comparison with the sample data dependent IPU based population synthesiser proposed by Ye et al. ( ). It is noteworthy that the sample data can contain households of three families, and seven or eight persons though not categorised (a) Family household composition characteristic

Categories Code
One family household: Couple family with no children U One family household: Couple family with children U One family household: One parent family U One family household: Other family U Two or more family household: Couple family with no children U Two or more family household: Couple family with children U Two or more family household: One parent family U Two or more family household: Other family U Lone person household U Group household U )'s method to generate those households. These households are categorised under Two or more family households and six or more persons households accordingly. To generate a similar population with the proposed method, group rules are specified to categorise households with two and three family units as Two or more family households and households with six to eight persons as 6 or more persons households. If one needs fully descriptive family household composition categories there are no restrictions to specifying suitable group rules.

.
We construct households in a single run by grouping persons directly considering both household and family compositions at once. Here, link rules specify persons' relationships and group rules specify composite household and family structures. The alternative is constructing households in two iterations, one to form families using persons and the other to form households using families. This is not suitable here as marginal distributions only describe primary families, thus not enough information to construct all the families in the population in the first iteration. The approach used here also has the advantage of creating inter-family relationships of all the members in the household while assigning them to the correct family, where as the other approach would create families as standalone units only with intra-family relationships of members, unless explicitly created as an ex post step.
. Prior to constructing the population, the census data need to be cleaned to minimise the data inconsistencies. In Australian context, census data inconsistencies are caused either due to limitations in data collection process or deliberately introduced errors to protect privacy. Following is the list of heuristic adjustments made to the population to minimise the inconsistencies. It is important to note that data set is not descriptive enough to remove the inconsistencies completely. One such example is inability to know exact number of required married persons due to lack of information on secondary and tertiary family units in multifamily households.
. If the number of group household persons in person level data does not match with the number of persons expected according to household level data, update the person level group households distribution proportionally while preserving sex and age distribution.
. Proportionally update the number of lone persons in person level distribution to match persons required to form lone person households, if they are di erent.
. If the married number of males and females are di erent, proportionally increase the one with less persons to match the other.
. If there are not enough married males and females to form all primary family units that contain married couples, proportionally increase males and females in person level distribution.
. If number of lone parent persons is less than the number of lone parent family units in households, increase the lone parent persons proportionally to match the required number of persons.
. There must be enough children to form enough couple family with children and lone parent family units at least with one child in them. If not increase the number of children proportionally.
. If there are not enough relative persons to form all primary other-family units, increase the number of relative persons proportionally.
. If the total number of persons is less than the number of persons required by households, increase the persons proportionally. If there are more persons no changes are made due to the possibility of information loss.

Identifying population heuristics .
The population construction starts by encoding the population heuristics using the constructs specified in the proposed framework. Following is the list of heuristics and assumptions describing important aspects of the population. They describe relationships between persons in the same household. The relationships across households are not represented in the available data, and we do not intend to represent them in the synthesised population. Here, an independent child is either a lone parent or a married person who is a child of another older lone parent or a married person. A dependent child is a person in the children category.
. A person can have a marital relationship with only one person.
. Only married or lone parent persons can have parental relationships.
. A lone parent must have at least one and a maximum of seven dependent children.
. A lone parent can have up to three independent children.
. A married person can have up to six dependent children.
. A married person can have up to three independent children.
. Married and lone parent persons can form none or up to three relationships with persons from "relative" category.
. A dependent child must have at least one parent in the same household.
. The only relationship that a group household person has with other members of the household is the relationship of living in the same household.
. Lone persons have no relationships.
. Persons in "relative" category cannot form any marital or parental relationships.
. Relatives can form one "relative of the family" relationship with a married or lone parent person.
. Relatives can form none or up to seven relationships with other "relative" type persons.
. By definition persons who fall under "children" category do not form marital partnerships or be a parent.
. Marital partnerships are assumed to be heterosexual relationships.
. A person must be over years to form marital partnerships.
. A married male can have a marital relationship with a married female from the same age category or one below, and a married female can have marital relationship with a married male from the same age category or one above.
. A child is assumed to be to years younger than the parent.
. A group-household person can only be in a group household.
. The primary family is the family unit with most number of dependent children . Couple family with no children -must include two married persons. There can also be up to three relatives in the household. The three relatives have no relationship between them.
. Couple family with children -must include two married persons and at least one person from children category. The family can include up to three relatives who have no relationship between them.
. One parent family consists of a lone parent and at least one person from children category. The family can include up to three relatives who have no relationship between them.
. Only the primary family unit can have relatives . "Other-family" members consists of relative persons and the relationships between them cannot be categorised as either marital or parental.
. Lone person household consists of an adult person living alone.
. Group household only consists of persons from group-householder agent type.
. When a person forms a relationship with another person, the second person also forms a relationship with the first person. Both relationships are di erent point of views of the same conceptual relationship.
. The relationships in the population are not limited to biological ones and if a married person is also a parent of a child, the married person's spouse is assumed as the other parent of the child.
. Every person in an other-family is related to each other.

Specifying link types .
When defining link types we consider the individual types that form the links and the types of households and families they form.

Specifying link rules .
Link rules determine the relationships that each person can form with other persons based on the person's type. In the above heuristics list items from to describe the concerns that need to be encoded as link rules. Similar to the previous case study following is an example of a link rule that specifies a (Male, Relative, 85+) person type can have RelativeOther relationships with any Male (X ) or Female (X ) Relative (M ) person type from any age category (O , ..., O ). We can specify similar link rules for all the agent types in the population. . Table show how link rules can be generated for the first five link types given in Table . The link rules specified in the first row of Table refer to marital partnerships that can be formed by male married persons of all ages. According to the heuristics, a person must be at least years old to form marital partnerships. Because of that we only select married males in ( -) to ( +) age categories. This is specified by the first statement in row . The second statement in the same row gives the link rules for male person types selected in line . The link rule specified that the females chosen for the marital partnership must be from the same age category or one category below. If the male's age category is -(O ), then that person can only form a marital partnership with a female from the same age category, because persons of age -cannot form marital partnerships. The second row of the The relationship of a lone parent with their children. A lone parent must have at least one child. ParentOfIndependentChild

The statements in
The parental relationship between a parent and a person who is married or a lone parent. DependentChildOf The relationship that a dependent child forms with the parents. A dependent child must have a parent. IndependentChildOf The relationship that a married or a lone parent person forms with the parent. GroupHouseholdOf The relationship of persons living in a group household. The maximum group household size is . FamilyRelative The relationship formed by a Married or a Lone parent person with relatives who belong to the same family unit. FamilyRelativeInverse The relationship formed by relative with a married or lone parent in the primary family. RelativeOther This is the relationship that persons in family type "other family" can have with each other. We allow "other family" family type to have up to members. None Indicator for a null relationship. E.g. Lone person in a member household Table : Census population link types a population. The items -in the above heuristics list relate to link conditions. Table gives the two types of link conditions applicable to the census population. Here, α r , α t and α r indicate the agents that form the relationship. The first part shows the link conditions that need to apply because of the bidirectional nature of the human relationships. New link is the newly formed link and inverse link is the link formed because of the new link. The second part of the table shows the relationships that need to be formed depending on the existing relationships of a participating agent. Existing link gives the already formed relationship and dependent link gives the relationship that needs to be formed when forming the new link.

Specifying group rules .
Group rules determine the type of a given household. In this exercise, the features used to determine household type are the household size and the household composition. The latter feature considers the number of family units, the primary family's type and whether the households members are related individuals. The family units in the population are: couple family with children, couple family without children, lone parent family, and other family. Group households and lone persons are non-family units observed in households. Heuristics on household composition are given in lines -in the above heuristics list.
. Group rules function (R census G ), formulated as per definition , determines household type considering the two group level characteristics in the data: household size and family household composition (number of family units and primary family unit type). Given Hi represents a household size category, heuristic function for mapping features with household sizes is same as the previous case study (i ∈ Z + [1,8] ).  Table : Generating census population link rules The features of categories in family household composition characteristic consist of di erent agent compositions and relationships. For ease of representation here we loosely define family units to include group household persons and lone persons, in addition to usual families. If members of a household belong to group household person type the household is considered a group household and all the members are considered to be part of one family unit. For lone person households the person alone is considered a lone person family. Apart from above, following agent compositions are considered basic family units: married couple, a lone parent with a child and two relatives with relative other relationships. The number of family units can be identified by counting these agent compositions. The family type of the first person added to the household is considered the primary family type. If the first person's family nucleus consists of a married couple that has at least one child, the family type is couple with children family, if there are no children then it is a couple only family. If there is a lone parent in the family nucleus it is considered a Lone parent family. If there are relatives with relative other relationships then the family type is other family. Below two heuristic functions are formulated according to definition . Here U refer to One family household: Couple family with no children category and U refer to Group household category.
h 9 : U 1 ↔ 1 family unit, primary family has two married persons but no children h 22 : U 10 ↔ Household members belong to group household person type .
The group rules set for census population can be represented in following manner: .., H6} is the set of di erent household size categories and U = {U 1, U 2, ..., U 10} is the set of family household composition categories. Algorithm below gives the logic encoded into the proposed function Q census to determine the household type of a given household η in the census population. Note that the logic for extracting di erent features depends on the application. Here, the household type depends on the number of persons (line ), the number of families (line ) and the composition of the primary family (line ) in the household. The household type (group type) is determined using these features based on the set group rules (line ). The number of families and the composition of the primary family is combined into one feature in reference to Family Household Composition characteristic in input data (table ). (Hi, U β) is the categories tuple, i.e. the household type (group type), determined based on group rules set R census G .

.
The population of an SA is constructed by running the program with the above specified link rules, group rules, link conditions, marginal distributions from census data and the seed. To perform the IPF step, which requires (a) Inverse links
To assign an year to a person's age property, we selected an year within the person's age category according to the number of persons by age (year) distribution of each SA . Age assignment further considered relevant population heuristics related to person's age.
As the method is guaranteed to only generate legal household configurations according to the specified rules, main focus of this section is the structural aspects of the synthesised population. First, we illustrate that the synthesised populations are similar to the input distributions and then, we show that, our method generates statistically superior populations than the IPU based method described by Ye et al. ( ).
. We selected SA s from Darebin and Banyule local government areas and for each person and household distribution pair constructed di erent synthetic populations with varying random seed values. The distribution of persons of a constructed population was obtained by counting the number of persons that fall under each person type and household distribution counting the number of households under each household cat-egory. The goodness of fit of each synthesised population was evaluated using FT test. Categories that represent impossible person types (e.g. (Male, Married, age 0-15)) and household types (e.g. 1 person, 2 families: couple family with children ) are not included in the tests. So the number of person categories used for the tests reduced from to and household type categories from to .
. Table presents the FT test outcome with the H 0 rejected of population instances out of , the highest observed p-value, the lowest observed p-value, the mean over the p-values and the standard deviation (SD) for each SA . The degrees of freedom for the person level FT tests is and for the household level is . In the table none of the population instances were deemed inconsistent, in fact, the p-values of all the population instances shown in the table are over . , which is very promising. Very small standard deviations indicate that the algorithm's results are consistent and multiple runs are not required to obtain the best result in most cases. Below, we further compare synthesised population to the census distributions using q-q plots.  The post hoc power analysis showed high power in all the cases of individual level FT tests. We used . significant level, degrees of freedom (because there are agent types) and a small e ect size of . according to guidelines proposed by Cohen ( ). All the tests have very high power in this exercise because of large population sizes (sample size). The lowest power was observed in the tests conducted on Ivanhoe East -Eaglemont SA , which is .

SA
(population size = ). This indicates . probability of type II error. Additionally, SA s resulted in a power of because of their relatively large population size. For power analysis of tests on household level distributions, we used the same significant level and e ect size, but for degrees of freedom because corresponding to household types. The lowest power was again observed in tests conducted on Ivanhoe East -Eaglemont SA , which is .
, a probability of . type II error, again attributed to the relatively small population. The highest power of .
was observed for Preston.
For the evaluation, we obtained % microdata samples of the SA areas covering Greater Melbourne, which fully encompass the SA areas, under strict obligations not to share any disaggregated data. An SA consists of multiple non-overlapping SA s. The person and household marginal distributions were same as before. To generate SA populations with IPU based method we used the whole microdata sample of the corresponding SA area and selected the best out of runs as proposed by Ye et al. ( ). The number of IPU iterations were also increased to , to allow the algorithm to achieve the . goodness of fit level. For our algorithm generating one population instance per SA was deemed su icient as there are minimal variations across di erent runs according to the The table a shows that . % ( out of ) of the population instances generated with our algorithm had pvalues over . at household level and % ( ) at person level, but with IPU based method only ( %)SA s produced p-values above . at household level and ( . %) at person level. It is also important to note that more than half ( ) of the IPU generated populations are inconsistent at household level, while none with our algorithm. At person level only ( . %) SA s produced di erent population with our algorithm, but IPU based method produced ( %).
. Brighton (Vic.) and Eltham are the two SA s that have significantly di erent synthesised populations of our algorithm at person level. Their p-values are . and .
. The census distributions of all SA s have dis-crepancies between the number of persons required to form the households and the number in the persons distribution, that in this two SA s are, respectively, and persons, comparatively large numbers but only about a . % error considering each SA 's population size. However, the algorithm has successfully handled much larger errors in other SA s. Another common observation is that Married and Children categories are the highest contributors to errors in the synthesised populations. Further investigations are required to understand the exact reasons why these two SA s fail to produce satisfactory results.

.
In the table b we further analyse how well the two algorithms have preserved individual characteristics of each joint marginal distribution obtained from census data. Results show that our method preserves the distributions of individual characteristics comparatively better than the IPU based method. Sex characteristic shows weaker results in both methods, as a result of having only two categories under it causing errors to be prominent when tested. .
In general IPU based method's results reported here is weaker than reported by Ye et al. ( ). Apart from the obvious di erence of the two populations, our experiments are di erent to them because: a) zero household and person categories are not removed when synthesising the populations, b) a di erent statistical test is used and c) only impossible categories are removed from the statistical evaluation. It is also noteworthy that they have reported detailed results of only two blockgroup areas though the population was constructed for a much larger area. .
Current prototype implementation of the algorithm on average takes about five minutes for an SA on a computer with a Core i -. GHz processor and GB RAM. This is slower than the minutes that IPU based method takes for the whole population. While there is room to improve the e iciency, the current implementation is still usable given the high accuracy rate and the synthetic population has to be generated only once for any simulation. The source code is available on Github .

Discussion
. The paper proposed an application independent heuristic methodology for reconstructing agent populations with social structures without depending on disaggregated data samples. The methodology consists of a generic framework for specifying application heuristics and an algorithm that uses the framework constructs to synthesise the population. The main constructs specified in the framework are link rules, group rules and link conditions. The group construction process takes binned data distributions from di erent sources and produces the population by forming group structures according to the heuristics specified through the three framework constructs. The main steps of the process are merging the data distributions with IPF using an abstract seed, constructing an initial estimate of the group structures in the population based on Monte Carlo sampling according to input distributions, and improving the initial estimate using a combinatorial optimisation technique. Currently, the approach requires converting both distributions to the same aggregation level due its dependency on IPF. Heuristics specifcation framework, however, does not have be in the same aggregation level. .
We demonstrated the versatility of the proposed approach by applying it to two case studies. The first one caters the interests in building integrated agent based models (Singh & Padgham ; Wickramasinghe et al. ), by merging two populations from di erent ABMs to obtain a consistent merged population. The second case study constructs a synthetic population using Australian census data. These show that the proposed method can be used in di erent applications simply by changing heuristics via the framework without developing a completely new program. Freeman Tukey's goodness of fit test results indicate highly consistent results out performing state-of-the-art IPU (Ye et al. ) based approach. .
The main contribution of this work is the generic heuristics specification framework. It allows generating different populations by changing the population heuristics without changing the underlying population construction algorithm, where as, existing sample-free heuristic population synthesis algorithms would require re-writing population synthesis logic completely. Though a similar work is also discussed in (Wickramasinghe et al. ), it does not elaborate the generic heuristic framework and the algorithm is di erent to we have presented here. .
Using IPF to merge data distributions is a common approach in synthetic population construction. However, in this work, we approximate the seed to a matrix of s and s indicating possible and impossible cells. Lovelace et al. ( ) experimentally show that initial weights in the matrix have no significant influence on the final IPF outcome a er iterations. While it can be argued that the di erence between the abstract seed we use and the correct hypothetical seed is rather extreme than the seed errors introduced in Lovelace et. al's experiments,