Population Synthesis Based on Joint Distribution Inference Without Disaggregate Samples

Synthetic population is a fundamental input to dynamic micro-simulation in social applications. Based on the review of current major approaches, this paper presents a new sample-free synthesis method by inferring joint distribution of the total target population. Convergence ofmultivariate Iterative Proportional Fitting used in ourmethod is also proved theoretically. Themethod, together with othermajor ones, is applied to generate a nationwide synthetic population database of China by using its overall cross-classification tables as well as a sample from census. Marginal and partial joint distribution consistencies of each database are compared and evaluated quantitatively. Final results manifest sample-based methods have better performances on marginal indicators while the sample-free ones match partial distributions more precisely. Among the five methods, our proposed method can significantly reduce the computational cost for generating synthetic population in large scale. An open source implementation of the population synthesizer based on C# used in this research is available at https://github.com/PeijunYe/PopulationSynthesis.git.


Introduction
. Computer-based micro-simulation has been increasingly applied to transportation, land use, demographic evolution, and other human participated social complex system researches (Zhao et al.
), (Geard et al. ), (Ye et al. ).This trend is attributed to several advantages brought by this approach.First, traditional mathematical models especially those involving statistics tend to build a homogenous model for a specific human group and then, replicate it in large numbers to examine travel behavior.Such approach naturally sacrifices heterogeneity of individuals in its scenarios.In this aspect, micro-simulation is one of preferred choices due to its modeling ability for the di erences of individuals, if di erences matter, and for the many possible combinations of considered characteristics.Second, a challenge faced by social scientists is that certain types of controlled experimentation are di icult, or sometimes impossible, to implement.Hypotheses regarding responses of individuals to specific policies or social events cannot be easily tested.Thus micro-simulation has become an alternative approach for policy evaluation and its relevant social problem analysis.

.
The foundation of micro-simulation used in social applications is composed of an initial baseline population with interested attributes (like residential location, car ownership) and their subsequent behavioral models.
Various social phenomena at systemic level are emerged from updates of individuals.To enable this type of micro-simulation, being able to generate a synthetic population is of great importance.A realistic population will provide the simulation a reasonable and reliable basis, keeping its result valid and credible.Research on population synthesis generally concerns two aspects: methodology and emblematical synthetic database.Four types of methods-synthetic reconstruction (Wilson & Pownall ), (Deming & Stephan ), combinatory optimization (Williamson et al.
), sample free fitting (Gargiulo et al. ), (Barthelemy & Toint ), and Markov Chain Monte Carlo (MCMC) simulation based method (Farooq et al. ) -are deemed as mainstreams in the current methodologies.The former two were proposed decades ago and they both have been applied in several di erent projects, such as SimBritain and USA population database from Research Triangle Institute (RTI), to construct regional or nationwide population databases (Ballas et al.
), (Ballas et al. ), (Wheaton et al. ).The latter two are more recent, both emerged in the s.Up to now, they have only been used in a few population scenarios (VirtualBelgium and PRIMA) and thus need further test and validation (Barthelemy ), (EU ).
. The current synthetic populations are mostly based on western countries, all supported with relatively fruitful data.For example, some countries, like Switzerland, have opened their entire census datasets.This provides many details to researchers.Others, like the USA and UK, o en publish a small proportion of individual sample records for public use.However, many developing countries, like China, are usually lack of such type of micro level data.In addition, their population sizes as well as demographic features are significantly di erent.As a developing country, China possesses the largest scale of population in the world, which leads the synthetic process more complicated.Specific characteristics should be considered in population synthesis as well.
For example, apart from the identity card, current social management system of China grants each citizen a household registration.This unique regional attribute dramatically influences the distribution and migration of people.Traditionally, people gathered in their own registered locations.But the current fast urbanization has brought this old phenomenon a rapid change, causing large amount of groups dwell in a di erent city.These factors have brought new challenges to current methodologies.According to whether the individual samples are essential, the existing methods for generating synthetic population can be categorized as samplebased methods and sample-free ones.The sample-based methods (see, e.g.Beckman et al. ( ), Guo & Bhat ( ) and Auld & Mohammadian ( )) require disaggregate samples essentially as their initial populations and subsequently expand them to a desired scale by introducing overall constraints.For instance, combinatory optimization needs an individual dataset to be the candidates during its selection.Without such micro level samples, it cannot complete the population synthesis.Evidently, this type of methods relies on the availability and quality of samples.In many countries, detailed individual information like census data or a small proportion of sample is inaccessible for the general public.Even if a sample is available, it may not include all types of individuals.Thus the generated synthetic population will also lack those missing groups (it is called the zero element problem (Beckman et al. )).To overcome the deficiency, methods that do not rely on samples have been proposed in recent years.We refer to these methods as sample-free methods in this paper.On the contrary to sample-based methods that require disaggregate samples essentially, sample-free methods can generate synthetic population without samples.For instance, despite the transfer probabilities in MCMC can be derived from sample, they are likely to be calculated directly from the total population constraints in general.In other words, MCMC does not depend on the micro level samples.However, existing sample-free methods mostly operate individuals directly, which results in high computational costs when the scale of synthetic population is large.These factors motivated us to develop a new e icient method that is suitable for large scale population generation.Starting from marginals and partial joint distributions, the developed method directly estimates total target population distribution without sample.

.
The contribution of this paper is threefold: ) to develop a new e icient synthetic method that does not rely on disaggregate samples; ) to give a theoretical proof for the convergence of multivariate Iterative Proportional Fitting; and ) to test and compare the new method and other existing ones via a large scale population synthesis (specifically, the nationwide synthetic population of China ).However, the former three have merely investigated the sample-based methods and concentrated on relatively small area micro-populations, while the latter involves only one sample-free method.In this paper, the population scale is greatly expanded (over .billion) and our emphasis is particularly put on the sample-free methods.An open source implementation of the population synthesizer based on C# as well as the statistical input and evaluation data used in this research is available at https://github.com/PeijunYe/PopulationSynthesis.git.

.
The remainder of this paper is organized as follows.Section provides a background on the four major methods that will be tested in subsequent population synthesis.Their advantages as well as deficiencies are also summarized briefly.Section elaborates the new population synthesis method based on joint distribution inference.To make the method more suitable for general use, we avoid using sample data which is not usually available.Convergence proof of multivariate iterative proportional fitting used in this section is attached in appendix.Section introduces the data source which serves as the input data and evaluation criterion.Our evaluation methods and indicators are also contained.Detailed results are presented in Section .And finally, Section concludes this paper with some additional discussions and possible future directions.

Background of Four Major Methods
. The primary objective of population synthesis can be summarized as generating an individual dataset in full compliance with statistical characteristics of various input data.In other words, the synthetic process must generate a population list, or sometimes with its corresponding instances, which conformed to the aggregate indicators.This sort of population is deemed as one of the "best possible" estimates for the actual one.It retains particular demographic properties with actual personal details omitted, thus can be treated as an alternative to micro data acquisition for social phenomenon demonstration and prediction.
. Traditionally, the synthetic process starts with collecting various data on target population to acquire essential information.Partial or full attributes from the whole studied attribute set, such as age group, gender ratio, geographic distribution, will be contained in each individual record.A census typically serves as the primary data source which contains the most comprehensive features of the target population.However, complete census data is never accessible to the general public due to national security concerns and individual privacy protection.The only input data available for researchers is o en a set of statistical indicators published by the Bureau of Statistics and a small proportion of detailed sample with some attributes concerning personal privacy omitted (in some worse cases, the detailed sample is even deficient).These two types of input information are called aggregate and disaggregate data respectively.Though faced with the dilemma of limited data, scholars have still developed a series of population synthesis methods.To our knowledge, four major methods perform as the dominants in this field.According to their essential input data type, they are categorized into two classessample-based method and sample-free method, which will be briefly reviewed as follows.
Iterative proportional fitting synthetic reconstruction .
The Synthetic Reconstruction method, published by Wilson & Pownall ( ), is the first population synthesis approach.It is the most extensively used one.The central task of this method is composed of two steps: estimating the joint distribution of target population and realizing the individual data set.Thus the synthetic process can be separated as two phases called "Fitting" and "Allocation" respectively.Usually, the Iterative Proportional Fitting (IPF) procedure is adopted to calculate the sample distribution in the former stage (Deming & Stephan ), (Beckman et al. ).Thus this method is also expressed as Iterative Proportional Fitting Synthetic Reconstruction (IPFSR).In the "Fitting" phase, the IPF procedure requires both a set of disaggregate sample (the seed) and statistical indicators (the marginals) covering all the studied attributes.Obviously, IPFSR is a samplebased method.The basic hypothesis lies behind this approach is straightforward and clear.It deems that the joint distribution from the sample is consistent with the association of the target population.Therefore, it is only required to fit the frequency under each attribute combination into the marginal constraints.When the attribute number is small, the joint distribution is represented as a Contingency Table (CT).But in a higher dimensional case, it is represented as f (X1 = x1, X2 = x2, • • • , Xn = xn), a distribution function where n is the number of attributes and xi is the value of the i-th attribute.Initial frequencies are the individual counts from input sample.During the k-th iteration, the frequency under each attribute combination is updated according to . . .
Where Ni is the marginal of the i-th attribute.The convergence of -dimensional IPF procedure is proved by Pukelsheim ( ), where a necessary condition -the marginal sum should be all equivalent -is given.To move forward, a proof of multivariate case will be presented in Appendix A. Among all the tables that satisfy marginal constraints, the result that IPF yields most resembles the initial sample (Ireland & Kullback ), (Little & Wu ).Once the joint distribution of target population estimated, the "Allocation" phase seems much easier.The synthetic population can be simply drawn via Monte Carlo method.

.
Objectively, IPFSR is straightforward and has a solid theoretical foundation.It also has many derivatives (Auld & Mohammadian ), (Muller & Axhausen ), (Pritchard & Miller ).However, one major problem, emerged from practice, is that the method relies on sample excessively.This has undoubtedly limited its application due to the unavailability of disaggregate input.Even if the sample of target population is acquired, the result will greatly depend on its quality.In a worst case, if the sample does not include any individual of a specific attribute combination, the synthetic population will not contain this type of people either.To tackle this "zero element" problem, one simple approach is to replace the zero frequencies with a small positive value such as . .Yet it will introduce additional arbitrary bias into the association structure (Guo & Bhat ).The CO synthetic process is iterative: starting from an initial set of population chosen randomly from the overall sample, an assessment is conducted a er randomly replacing one of the selected individuals with a new one from the overall sample.If the replacement improves the fitness of the population set to the distribution table, the two individuals will be swapped.Otherwise the swap should not be carried out.This process will be repeated many times, with the aim of gradually improving the fitness of the selected population set.Given a convergence error, the final result will approximate to the statistical constraints.And the synthetic population will be achieved.Clearly, the basic idea behind CO is somewhat similar to genetic algorithm without crossover and mutation.Some important contributions to this method come from National Center for Social and Economic Modeling (NATSEM) in the University of Canberra (Williams ), (Hellwig & Lloyd ), (Melhuish et al.
. The CO method has several derivatives and is more suitable for generating a small range of synthetic population via a larger amount of sample (Abraham et al. ), (Ma & Srinivasan ), (Huynh et al. ).It is because when the scale of target population is large, fitness variation during swaps will be overwhelmed due to computational truncation.To avoid this, target population is usually divided into several smaller parts with each generated in turn.In addition, as a sample-based method, CO is also dependent on the input sample like IPFSR.The omitted groups in the sample will not be included in the final received population as well.

Sample free fitting .
In most countries, disaggregate sample is di icult to acquire.Thus a type of emerging synthetic techniques called sample-free method has been proposed in recent years (Gargiulo et al. ), (Barthelemy & Toint ).This method only adopts marginals or/and conditionals of partial attributes from various data sources as its input.It gives more flexibility in terms of data requirements.In the synthetic process, an individual pool at the scale of target population is firstly generated according to the most disaggregate data source.Then the missing attributes are initialized by randomly drawing from their value sets.Once all the interested characteristics are defined, the initial individual pool has been constructed.In an ideal case, this individual set should satisfy all the conditionals and marginals, which are determined by the unique joint distribution of target population.However, it does not always happen, due to conflicts among conditionals/marginals from di erent data sources.When the conditionals/marginals are not consistent, an attribute shi of some individuals needs to be performed further.The shi s are for the discrete attributes and allowed between two contiguous modalities only.

.
Objectively, sample free fitting (SFF) has relaxed restrictions to data source.Yet it is much time consuming and memory expensive due to its generation of individual pool.Specifically, when generating a large size of population, each individual will be operated in attribute shi s.This will cost a lot of time on the I/O operation for database.Compared with updating the distribution directly, this approach seems much more complicated.

Markov Chain Monte Carlo simulation .
Markov Chain Monte Carlo (MCMC) simulation is a stochastic sampling technique to estimate overall distribution when the actual joint distribution is hard to access.Its theoretical foundation is that, if the stationary dis-tribution of a Markov chain is a multivariate probability distribution, a sequence of observations can be approximately obtained through this chain.When applied to population synthesis, it firstly constructs a Markov chain with conditional transfer probabilities of each interested attribute (Farooq et al. ), (Casati et al. ).Then samples are extracted from the chain at a particular interval, which is called Gibbs sampling.A er the Gibbs sampler run for many iterations and reached a stationary state, this process is deemed as individual drawing from the actual population.Thus the individual data set obtained can be directly used as synthetic population.Although the conditional transfer probabilities may be calculated from disaggregate sample, it is not essentially to do so.Actually, the conditional transfer probabilities are usually constructed by the input partial views from various data sources.Clearly, MCMC simulation is another sample-free synthetic method. .
MCMC simulation is able to deal with both discrete and continuous attributes.It extends the scope of studied characteristics.However, when conditionals from di erent data sources are inconsistent, the Gibbs sampling may never reach a unique stationary state, which prevents the valid population drawing.In addition, the state transfer of Markov chain during discrete sampling may take an expensive time cost.As an emerging approach, this method needs further test and validation.

Population Synthesis Based on Joint Distribution Inference
. As indicated in the previous section, IPFSR treats the association of disaggregate sample as that of target population, while CO does not preserve the complete association.Both of them ignore sample deviation and thus the generated population will not contain the type of individuals not included in the samples.Sample free fitting directly operates on individuals which seems computationally expensive.When generating a large scale population, MCMC simulation firstly constructs an individual pool to represent joint distribution, and then directly draws the synthetic population from the pool.When the combinations of variables are complicated, however, the pool will expand quickly in order to cover each potential type of individual.In this case, it will cost much time to conduct discrete sampling.Consequently, it is better to develop a new e icient synthetic method with the following highlights: . Infer the joint distribution of target population directly.This could avoid redundant operations on individuals when adjusting each attribute.
. Use marginals and partial joint distributions as much as possible.It is able to maximize the utility of overall target population information and minimize sample deviations.
. Not necessarily require disaggregate samples.This enables our approach more applicable to general cases even if samples are not available.
. Should have a solid statistical and mathematical basis.
. Starting from these objectives, our new method is composed of two steps: independence test and association inference.Figure shows the main steps of the method, in which the association inference is most crucial.

Independence test .
Independence test aims to validate whether two variables are independent of each other.According to probability theory, two random variables, X and Y , are called independent if and only if Where D is the field of probability definition, F is the cumulative distribution function, FX and FY are the marginal distribution functions.Suppose the studied individual attributes are (X1, X2, • • • , Xn).For convenience, only discrete cases are considered here because the continuous variables can be spilt into several intervals and converted into discrete ones.Consequently, the definition of independence can be written in the form of probability mass function At the beginning of the synthesis, independence test is conducted between every two attributes.Chi-square testing is usually introduced to complete this task.For any two attributes Xi and Xj, as shown in  (1) j and the alternative hypothesis be Chi-square value can be computed by Ars stands for the statistical individual number of Xi = x (r) i and Xj = x (s) j .The corresponding degree of freedom is (R − 1) • (S − 1).According to the hypothesis testing theory, given a significance level, if the Chi-square value is in the acceptance region of χ 2 ((R − 1) • (S − 1)) distribution, Xi and Xj are deemed to be independent; else, they are associated.

.
If the studied attributes are partitioned into two sets and validated that ∀Xi ∈ {X1, X2, • • • , Xm} is independent with ∀Xj ∈ {X (m+1) , X (m+2) , • • • , Xn}, the joint distribution can be denoted as In other words, the joint distribution can be directly computed from partial distributions.Specifically, if one set degenerates into a variable, which means Xi is independent with ∀Xj ∈ {X1, X2, • • • , Xn} \ {Xi}, then there is Consequently, joint distribution can be represented as the product of a marginal and a partial distribution.It reduces the dimension of the problem and focuses the study on the last partial probability item only.
Without loss of generality, assume the last (n − m) attributes are independent, and rewrite Equation as The following task is to estimate the f (x1, In theory, this contains infinite solutions without any further information.Thus the basic idea is to construct one particular joint distribution that conforms to both of the partial views.The problem is categorized into two cases. . where fX 1 (x1) can be easily calculated from f (x1, x2) or f (x1, x3).Since x2 and x3 are not independent, their associations should be estimated.This task could be completed via IPF procedure.The Contingency Table involves two variables, (x2, x3), and its initial seed is set according to f (x2, x3).For each given x1, IPF procedure is conducted and the corresponding marginals are computed as follows: The result that IPF yields is denoted as f (x2, x3) and its marginals will conform to the above two conditional distributions.That is The joint distribution is estimated by Comparing Equation and Equation , it is easy to find that the conditional probability The reason for this operation is f (x2, x3) not only retains the associations between x2 and x3 but also satisfies the marginal constraints.It should be noted that the input f (x2, x3) may be acquired from a more complicated partial distributions by summarizing other unconsidered dimensions.This is a more general case which will be discussed later in this section.
. f (x2, x3) is unknown.In this case, all of the distributions do not contain associations between x2 and x3.It prevents us from estimating their joint distribution.Thus their associations can simply be treated as the product of their conditionals.That is Clearly, operations of the two cases above have extended the joint distribution.When this extension repeats until all the interested variables are included, the ultimate distribution is inferred.
. Now let us move to a general case.Suppose the start distribution is ) is known, then the IPF procedure is adopted to estimate the associations that conform to the conditional marginal, In order to construct the consistent marginals, it is required to exclude the unconcerned dimensions by summarizing these variables.That is where Then the joint distribution is Where f (X (k+1)•(k+s) , X (k+s+1)•(k+m) ) is the result from multivariate IPF procedure.If the partial distribution f (X (k+1)•(k+s) , X (k+s+1)•(k+m) ) is not available, the joint distribution can only be treated as the product of the marginals above as .
Similar to the -dimensional case, the partial views f (X (k+1)•(k+s) , X (k+s+1)•(k+m) ) which used as the initial seed of IPF may be calculated from other more complicated distributions or even from sample data if it is accessible.The convergence of multivariate IPF in a general case is proved in the following.The convergence of multivariate IPF in a general case is proved in Appendix A. Results from these two kinds of questionnaires are the original individual records.They are usually confidential and will be further summarized over a few variables to get marginal frequencies.For instance, if the original records concern about (Gender × ResidentialP rovince × Age × EducationalLevel), the processed marginal frequencies may contain partial attributes like (Gender × ResidentialP rovince) or EducationalLevel.This operation is referred to as the data aggregation.Some but not all marginal frequencies are published on the website of National Bureau of Statistics (NBS) in the form of cross-classification tables (NBS a), (NBS ).Appendix B gives three examples of them.This sort of data for public use is the fundamental information for population synthesis.Though these tables have not revealed the complete joint distribution of the whole attributes, they present its partial views in various dimensions. .
In our study, we use the data of year to generate a synthetic population.The overall cross-classification tables concern a number of characteristics including population scale, gender, residential province, ethnic group, age, educational level, household scale and structure, dead people information, migration, housing condition, etc.According to the census evaluation, about .% of population is omitted in the survey .It may slightly change the joint distribution.In addition, the census result does not contain information of military forces and organizations which are around , , people.These two groups of people are also ignored in the synthetic process.In other words, our target population is those investigated in the overall data set.Since the objective of our study is to generate nationwide synthetic population, and the overall cross-classification tables cannot provide detailed attributes as the sample and Long Table do, we concentrate on the basic individual attributes shown in Table (Where city refers to the municipality with the population density over /km 2 .Town, outside the city, refers to the borough where the town government locates.Rural refers to the area other than city and town.).The corresponding partial distributions given by cross-classification tables are listed in Table .Note that the Table Codes are the ones used by the NBS.

Sample .
While total statistical characteristics can be easily acquired, original results directly from questionnaire are strictly protected by the government.Nevertheless, a small proportion of sample of the year have been collected.The data set includes , , records, which accounts for .%of the total population.These records all come from the Long Table, each of which gives detail information of a particular individual.The attributes provided by the sample can be categorized into two types.One is the basic household information such as household type (family or corporate), number of members, housing area, number of rooms, etc. while No.

Distributions
where O ki is the generated count for the i-th cell of the k-th tabulation; E ki is the given (known) count for the i-th cell of the k-th tabulation; N k is the total count of tabulation k; C k is the % χ 2 critical value for tabulation k (where degrees of freedom are treated as n − 1 for a table with n cells). .

Results
. The five methods, IPFSR, CO, SFF, MCMC and the proposed method (referred as JDI), are used to generate a synthetic population of China.This section presents the results and gives an analysis about each synthetic population.Before showing the evaluation of generated synthetic populations, independence test results are presented according to Equation . .Note that our proposed JDI method is a sample-free method, and aims at providing a more e icient way than the two existing sample-free methods for a large scale population synthesis.Thus besides comparing the different methods from the accuracy point of view, their computational cost (in terms of execution time) are also measured and compared.Our goal is to show that the new method can generate synthetic population with the same level of accuracy as the existing sample-free methods, but is much more computational e icient.Meanwhile, since our data includes samples, it is possible to generate synthetic population using the sample-based methods.The results from these sample-based methods are also presented, and comparing all these di erent methods (sample-based and sample-free) for a large scale population synthesis is also considered as one of the contributions of this paper.
. In our study, all the tables listed in Table are used as the inputs of the sample-free methods.The scale of synthetic population generated by the five methods is , , , (total Chinese population).According to the Long Table scale, .% of the synthetic populations are stochastically extracted for quantitative evaluations.In order to reduce the impacts of randomness, each experiment is conducted times and averaged to obtain the final results in this section.The evaluations are composed of two parts.Firstly, since Registration Type and Registration Province are missing in the statistical results of the Long Table, marginal consistency of the rest attributes (Gender, Age Interval, Ethnic Group, Residential Province, Educational Level, Residence Type) are investigated.Secondly, partial joint distributions of specific attributes are also given in the Long Table results.They can be used to calculate RSSZm value in a more detailed way.The RSSZm indicator has been introduced in previous section, and the partial joint distributions adopted as our reference criterion are shown in Table .

Marginal consistency .
The population sizes derived from Long     .For the educational level, the census only investigates individuals over years old.Thus our evaluation also focuses on these populations.The Long .The marginal mean absolute errors (MAE) and root mean squared errors (RMSE) are computed in Table .Generally, the CO method has better performance than the others.And all of them show a relative worse result in ethnic group, especially CO.The large MAE and RMSE deviations are also caused by the deficiency of specific type of individuals in sample.The three sample-free methods have a similar accuracy in marginal consistencies.

Partial joint distribution consistency
. Table gives the RSSZm results which have been introduced before.The main deviation comes from Gender × Ethnic Group distribution, and the three sample-free methods generated better population databases than the sample-based ones.The reason for this phenomenon is that the sample-free methods treat associations among individual attributes reflected by partial joint distributions as their inputs, and these associations are derived from the whole target population.Thus the sample-free methods are able to directly manipulate total population associations rather than the sample which most likely carries deviation in sampling process.The results also show SFF and JDI have relatively smaller RSSZm values.
. MAE and RMSE of partial joint distributions are also calculated in Table .It can be seen that CO and MCMC get the largest MAE and RMSE (most come from Gender×Ethnic Group and Gender×Residence T ype×Educational Level distributions), and IPFSR and SFF perform the best.Among the three sample-free methods, SFF is a little better .In summary, the population databases synthesized by the sample-based methods, especially CO, have better performances on marginal indicators, while the sample-free methods generated populations that match partial joint distributions more precisely.The SFF and JDI methods have similar accuracy among the sample-free methods, both of which are a little higher than MCMC.

Computational performance .
Synthesizing nationwide population database of China is a large project.Consequently, computational performance is another important metric that should not be ignored.To achieve meaningful comparison from the computational cost point of view, we implemented all the five methods in the same programming environment and run the methods on the same computer.More specifically, we implemented the five methods in C# .netframework environment and run the programs on an Intel Core i -CPU with GB RAM.The execution time is divided into two parts: distribution computing (if there is) and population realization.The averaged results of the five methods are listed in Table . .
It is clear that among the five methods, our proposed method JDI costs the least amount of execution time.
Although CO has relatively accurate results as aforementioned, it is also the most computationally expensive one.During its computation, the total target population is partitioned into parts and generated one by one for its convergence.The MCMC method has the second worst performance when comparing with others.The main reason is that MCMC builds its joint distribution by constructing an individual pool via discrete sampling from the Markov Chain.According to the MCMC theory, a er Gibbs sampler achieved its stationary state, successive sampling should be prevented in order to avoid any correlations between two adjacently generated individuals.To this end, synthetic individuals are usually drawn at a particular interval rather than from each chain update.In our study, this interval is set to be iterations.The size of the individual pool is set to be million.Therefore, it costs a large amount of computation.The SFF method is a little better than MCMC.This is mainly due to frequent updates on the individual pool, which causes substantial I/O operations on database.In contrast with RAM access, database manipulation seems much slower.IPFSR is much faster than the above three.It directly fits the joint distribution of the population and draws individuals.It directly fits the joint distribution of the population and draws individuals at a time.This approach, however, usually su ers from rapid growth in calculation as the investigated attributes increase.In our scenario, the theoretical number of attribute combinations is about 1.4 × 10 8 (including some zero cells in joint distribution).Fitting these combinations is, undoubtedly, much complicated and soars its computational time.On the other hand, Monte Carlo realization of individual records needs conditional probability rather than the joint distribution itself.Thus in contrast with IPFSR, our JDI method directly calculates conditional distributions a er each variable expansion.To a certain extent, this reduces the computational complexity since the attribute combinations are much fewer at the beginning of its computation.It should be pointed out that the access to database in all the five methods uses a A small area test .
In order to further validate our proposed method, numerical experiments for a smaller area are also conducted.
We choose the population of Qingdao -a coastal city of middle size -to be our target population.The total scale of the target population is about .million, much smaller than that of the whole nation.Similarly, our available data source contains the regional overall cross-classification tables and the relevant Long Tables.But the attributes and their input distributions revealed from this regional data are not as su icient as those of the nationwide.Specifically, the Long Tables only give the Gender × District frequencies of individuals whose age are beyond . .Like the nationwide scenario, the experiment is repeated times and a small proportion of each result is stochastically sampled according to the Long Table population scale.The sampled populations are evaluated respectively, and the averaged final result is shown in Figure .As can be seen, the largest error is -.% and most of them are below %.The three RSSZm values are .
(JDI).It indicates that the latter two perform much better in general.As RSSZm measures the deviation between real and synthetic data, the results also manifest the proposed algorithm is robust at the district level.

Conclusions and Discussions
. The paper reviewed the existing four major population synthesis methods-synthetic reconstruction, combinatory optimization, sample free fitting, and MCMC simulation-and then presented a new sample-free method based on joint distribution inference.These methods are applied to synthesize a nationwide population database of China by using its cross-classification tables and a .%sample from census.The methods are evaluated and compared quantitatively by measuring marginal and partial joint distribution consistencies and computational cost.Our results indicated that the two sample-based methods, IPFSR and CO, show a better performance on marginal indicators, whereas the sample-free ones especially SFF and JDI give a relatively small error on partial distributions.Moreover, the proposed JDI gets a better computational performance among the five methods.
. Cross-classification tables, sample and long tables used in this paper are all from the census data, where data inconsistency does not arise explicitly.As explained in Section IV, however, various data sources can be used as the inputs to generate synthetic population.Adopting multiple input data sources may probably su er from There are several mechanisms that determine the initial distribution.For example, a confidence evaluation of the two data sources can be conducted in advance and the more confidential one is used as the start point.Or alternatively, if m = n (the variables in v but u can be folded to satisfy this assumption), the following averages can be used as the initial distribution: Where α, β ∈ [0, 1], α + β = 1 are the degrees of confidence.Once the initial distribution determined, it is analogous to expand other variables by investigating the marginal and partial distributions from both data sources.The full joint distribution will be finally achieved since each attribute is covered by at least one distribution.
In essence, this approach estimates a unique initial distribution as a benchmark, and retains the associations among attributes from multiple sources rather than a single one.
. As shown in Section II and Appendix D, CO usually partitions the target population into a set of smaller identical parts, and generates each of them as the final result.Accordingly, the fitness criterion for each small part is proportionally scaled down by the overall statistical marginals.Such operation will not bring bias to the CO method.Here we give a brief analysis.Let P opN be the total target population (unknown), SN be the overall statistical marginals used as the criterion, k be the partition number (integer).Then for the i-th part, we have P opi = P opN /k and Si = SN /k.If we initially set a common convergence relative error εi = ε for every part, the following equation holds when the iteration of part i stops: The above equation means that the error of total target population equals to the one for each part, regardless of how many parts are separated.Thus the operation is identical to directly computing the overall population.
In practice, the partition number is determined by the sample size (to guarantee each small part can be directly generated) and the computer memory (to guarantee each part can be stored).
. For sample-based methods, the bias of input sample will be retained in the final synthetic population inevitably.This can be avoided to some extent by sample-free techniques which treat the total population features as their start points.However, sample-free methods cannot merit the advantage of disaggregate sample-associations among all the investigated attributes.Like others, the JDI proposed in this paper uses partial joint distributions instead.But for those variables whose association with others cannot be estimated from known partial views (for instance, only one marginal distribution involves a particular attribute, and the associations between this attribute and other variables cannot be well estimated), sample is usually a direct and beneficial supplement.
. Several possible directions can be extended in the future research.To name a few, two of them are put forward here.Firstly, it is obvious that the census data play a much important role in population synthesis.It is even more important when using sample-free methods.However, the available input data with desired quality are not always possible, since the census results may contain errors to some extent in their survey and statistical process.Therefore, other sorts of data, such as family and household, need to be introduced to reduce that kind of errors.)).Yet how to solve the conflict between individual and household level constraints requires further study.Secondly, large scale population synthesis and the subsequent micro-simulation are much time consuming and computational expensive.This low e iciency usually leads to the loss of e ectiveness in the computer-aided decision making.Two possible approaches may contribute to the solution of this problem.One is expanding the computational resources by introducing parallel and distributed computing.It involves massive data and its synchronization in a distributed pattern, and thus deserves carefully researched.The other concerns the optimization of algorithms and simulation models.Building an extremely detailed agent model in the subsequent micro-simulation is unfeasible and unnecessary.Thus, the principle and framework of establishing appropriate agent behavioral models under available computational resources also require to be carefully considered.
Define L error as Then the L error monotonously decreases during its iteration.
The general case is discussed below.
Then the L error monotonously decreases during its iteration.
P roof.The L errors are . . .
As the -dimensional case, the first item of L (1) has has the same sign) On the other hand, This is because each has the same sign given x1.Thus Similarly, there is Thus L (1) (k) ≥ L (1) (k + 1).Other L (i) (k) ≥ L (i) (k + 1), (i = 2, • • • , n) can be proved analogously.

Figure
Figure : Flow Chart of Population Synthesis

Figure
Figure : Comparison of Gender (Frequencies and Relative Errors)

.
The age interval marginal comparison is shown in Figure.Similarly, the two sample-based methods result in more accurate match with the Long Table data, while the three sample-free ones have brought minor deviations.The proposed JDI method shows a similar performance with other sample-free methods.The trend of lines also indicates two "baby booms" in Chinese history.
• • • , xm) by its marginals and partial joint distributions.Suppose that the information related to each variable is contained in the known partial and marginal distributions (this is always satisfied).For a particular distribution, the disaggregate level is defined as the number of variables it contains.The partial distribution with the highest disaggregate level should be selected as the start point.
For example, if there are two distributions Residence T ype × Residential P rovince × Ethnic Group × Gender (disaggregate level: ) and Age Interval × Ethnic Group × Gender (disaggregate level: ), the former should be the start point.If two partial joint distributions have the same disaggregate level, it is preferred to select the one which contains more attribute values.This is because more direct details from partial views will lead to a more accurate estimation.For example, when considering Gender × Residence T ype × Residential P rovince × Ethnic Group ( values) and Gender × Residence T ype × Residential P rovince × Age Interval ( values), it is preferred to choose the former.Once the start point is determined, it needs to be expanded by investigating other partial distributions.In the following, our discussion will begin with a simple example to illustrate the method and then go to a general case. .Suppose the start distribution is f (x1, x2).Consider another distribution f (x1, x3) where x1 is the common variable.Our objective is to estimate f (x1, x2, x3).

Table :
Individual Attributes Considered for Nationwide Population Like many other countries, China conducts its national census for entire population every years.Between two adjacent censuses, a % population sample survey is conducted.In this paper, only the national census is considered.The most recent two censuses are in and .The whole target populations are investigated through questionnaire under guidance of census takers.Two kinds of questionnaires are applied in the census.One is called the Short Table which involves several basic characteristics, whereas the other is the Long Table which not only contains all the content of the short one but also includes additional detailed features like migration pattern, educational level, economic status, marriage and family, procreation, housing condition, etc..The Long Table is filled by individuals stochastically selected in advance (about .% according to the o icial sampling rule), while the Short Table is completed by the rest.

Table : Partial
Distributions Used in Population Synthesis the other is the personal detailed attributes that involve employment status, occupation, marital status, number of children, etc.. Individual attributes contained in sample and the Long table are illustrated in Appendix C. Individual attributes contained in the sample and the Long Table can be referred to the website (NBS b).In contrast with the Short Table, the Long Table includes additional details, which reflects the structure and composition of target population much more concretely.People surveyed by this type of table are stochastically determined by NBS and each provincial government.Their scale accounts for .
% of total population.Their statistical results, in the form of cross-classification tables, can be also accessed from the o icial website.Since there is not any other data in this study, these tabulations are treated as our criterion to evaluate each synthetic population.The indicator adopted in our evaluation is the modified overall Relative Sum of Squared Z-scores (RSSZm) (Huang & Williamson).For a given similar scaled subset of population being compared to the corresponding tabulations of the Long Table, that is: Table are identical to the sample.It should be pointed out that the statistical results from the Long Table have not provided us information about Registration Province and Registration Type.Thus our evaluation is only established on the rest of 6 characteristics.

Table : Independence
Test Results.The first number is χ 2 value (×10 4 ); the two in parenthesis are degree of freedom and p-value(significance level α = 0.1) TableshowsChi-square values between every two attributes.These results all come from cross-classification tables of the total population.Note that the first line in each box is the Chisquare value whose magnitude is 10 4 , and the two items in the second line of each box are degree of freedom Table:Partial Distributions Used in Population Evaluation and p-value (significance level α = 0.1) respectively.As can be seen, no attribute is independent from others.This causes us to infer associations among all variables.

Table and
The minor di erence stems from the stochastic sampling according to .%. Figure gives the statistical result of gender marginal.
As can be seen, all errors are below , which accounts for .% (male) and .% (female).The sample-

.
Figure gives comparison of ethnic groups, among which the major -Han group -accounts for over % and is shown by the line with .%,.%,.%,.%,.%relative errors.Individual numbers of other minor ethnic groups are summarized and shown by the bar figure.Their relative errors are all below %.Unrecognized ethnic group and foreigners are calculated in Table.It is important to note that IPFSR and CO do not contain individuals from the unrecognized group.The reason of this phenomenon is our input sample does not include this type of person.This "zero element" problem cannot be solved by the sample-based methods.Thus under this condition, the limitations of the sample-based methods are apparent.
. The populations for the provinces of China are compared in Figure.Absolute deviations between the sampled synthetic populations and the Long Table data are represented by di erent color in the map.Note the numerical range represented by each type of color in subfigure (b) is di erent from others.As can be seen, Table : Numbers of Unrecognized Ethnic Group and Foreigner Table and five synthetic results are drawn in Figure.As can be seen, each educated group percentage of these methods is nearly the same.Quantitatively, the average relative errors are .
% (IPFSR), .%(CO),.%(SFF),.%(MCMC)and .%(JDI).It seems that the sample can elevate the accuracy of this marginal indicator..The last marginal attribute is residence type, the results of which are presented in Figure .From the figure, about two thirds of the people are living in rural areas in , and each residence type has a similar percentage.Figure : Comparison of Residence Type (Frequencies and Relative Errors)Table : RSSZm Values of Partial Joint Distributions It shows the five methods all have good results when measured by this marginal.

Table : Two
Types of Tables used as the Inputs and Evaluation Criteria single-threaded pattern.Multi-threaded techniques are expected to accelerate the programs but more attention needs to be paid to the data synchronization.
Table shows the attributes considered.The distributions of the Short and Long Tables are listed in Table .Since the disaggregate sample cannot distinguish among di erent districts, two samplebased methods are not able to generate synthetic population at this granularity.Here we only focus on the sample-free methods.
Related work has been conducting by many scholars (see, e.g.Anderson et al. ( ) and Casati et al. ( Synthetic population dataset.: Divide the studied area into several regions (in this paper, we set parts).The region number is denoted as RegN um.: for each region do : Extract a random sample from D with the scale of P opSize/RegN um as the initial population P op : Calculate the fitness F of the initial dataset P op : Swap two random individuals from P op and D respectively : if F (bef oreswap) > F (af terswap) then : let P op = P op(bef oreswap)