Population Synthesis with Quasirandom Integer Sampling

Established methods for synthesising a population from geographically aggregated data are robust and well understood. However, most rely on the potentially detrimental process of integerisation if a wholeindividual population is required, e.g. for use in agent-basedmodelling (ABM). This paper describes and investigates the use of quasirandom sequences to sample populations from knownmarginal constraints whilst preserving those marginal distributions. We call this technique Quasirandom Integer Without-replacement Sampling (QIWS) and show that the statistical properties of quasirandomly sampled populations to be superior to those of pseudorandomly sampled ones in that they tend to yield entropies much closer to populations generated using the entropy-maximising iterative proportional fitting (IPF) algorithm. The implementation is extremely e icient, easily outperforming common IPF implementations. It is freely available as an open source R package called humanleague. Finally, we suggest how the current limitations of the implementation can be overcome, providing a direction for future work.


Introduction
. Iterative Proportional Fitting (IPF) is a popular and well-established technique for generating synthetic populations from marginal data. Compared with other methods of population synthesis, IPF is relatively fast, simple and easy to implement, as showcased in a practical introduction to the subject by Lovelace & Dumont ( ). Furthermore, a number of so ware packages aimed at facilitating IPF for spatial microsimulation and other applications have been published such as Barthelemy & Suesse ( ) and Jones et al. ( ).
. However, the method has various limitations from the perspective of social simulation via agent-based models: IPF generates fractional weights instead of integer populations (an issue that can be tackled by integerisation); handles empty cells poorly (Lovelace et al. ); and requires a representative individual-level 'seed' population. These issues and the mathematics underlying them are explored in Založnik ( ). .
Integerisation is only a partial solution to the issue of fractional weights. Whilst algorithms for integerisation such as 'truncate replicate sample' maintain the correct overall population, the adjustments can create a mismatch between the original and simulated marginal data (Lovelace & Ballas ). Using IPF-generated probabilities as a basis for a sampling algorithm is an alternative solution but again care must be taken to correctly match the marginal distributions and generate a realistic population. .
In this paper we investigate sampling techniques where a synthetic population is built exclusively in the integer domain. The research reported in this paper forms part of, and is partly motivated by, a larger body of work in which there is a requirement to generate individual simulated agents for an entire city. In the larger project, the ultimate goal is to impute journey patterns to each member of the simulated population using app-based mobility traces in order to generate travel demand schedules for an entire urban population in pseudo-realtime. .
The motivation behind the development of this method came from the fact that IPF, in general, does not generate integral populations in each state. If the resulting population is used as a basis for Agent-Based Modelling (ABM), then some form of integerisation process must be applied to the synthesised population. Whilst this process preserves the overall population, it may not exactly preserve all of the marginal distributions. Although the integerisation process may not introduce significant errors, the authors considered whether there was an alternative method of synthesis that would directly result in an integral population that exactly matched the marginal distributions.
. We could not find any references in the literature on the use of quasirandom sequences for the purpose of population synthesis, and so decided to investigate the suitability of sampling algorithms using such sequences.
. Working exclusively in the integer domain requires that marginal frequencies are integral. O en, these are expressed as probabilities rather than frequencies. We propose that integerising each marginal at this stage is preferable to integerising the final population. Mathematically, integerisation in a single dimension is straightforward and a demonstrably optimal (i.e. closest) solution can always be reached.
. We first show that by using without-replacement sampling guarantees that the randomly sampled population will exactly match the marginal data. We then compare the statistical properties of populations sampled using pseudo-and quasirandom numbers showing that the quasirandom sampling technique provides far better degeneracy, achieving solutions closer to the maximal entropy solution that IPF converges to.

Theory
. In this section we first quantitatively describe the core concepts that comprise the sampling technique. We then formally state the problem that the technique is used to solve, and finally define the statistical metrics that we use for assessing and comparing the results.

Quasirandom numbers .
Quasirandom numbers, o en referred to as low discrepancy sequences, are preferential to pseudorandom numbers in many applications, despite not having some of the (appearance of) randomness that good pseudorandom generators possess. In this work we focus on the Sobol quasirandom sequence (Sobol' ) with subsequent implementation and refinement by Bratley & Fox ( ) and Joe & Kuo ( ). .
To compare and contrast quasirandom and pseudorandom numbers, we use the MT variant of the Mersenne Twister pseudorandom generator (Matsumoto & Nishimura ).
. Figure qualitatively illustrates the di erence between pseudo-and quasirandom sampling. Each plot contains uniform points in two dimensions. The quasirandom samples fill the sample space far more evenly and display an obvious lack of randomness, clearly showing a lack of independence between samples. In contrast, the pseudorandom samples show no discernible pattern with clusters and gaps present in the sampling domain, suggesting independence between variates.

.
Sobol sequences, as with other quasirandom sequences, have an inherent dimensionality and a relatively short period -the sequence is exhausted once every integer in the sample space has been sampled. Successive samples evenly fill the sample space, and thus lack independence. Conversely, a good pseudorandom generator has no discernible dependence between variates o en has a much longer period, allowing for very large samples to be taken. .
For applications such as numerical integration, Sobol sequences converge at a rate of ≈ 1/N (more accurately (lnN ) D /N , Press et al. ( )) compared to ≈ 1/ √ N for a pseudorandom generator, and thus require far fewer samples to achieve the same level of accuracy. This makes quasirandom sampling particularly suitable for Monte-Carlo integration of high-dimensional functions: far fewer samples are required to obtain the same level of accuracy that would be achieved using pseudorandom sampling.
. Figure illustrates the superior convergence of quasirandom numbers. The histograms contain samples in bins, and clearly the quasirandom sequence is far closer to the true distribution. Furthermore, consider the following simple example: we wish to calculate the expected value of the sum of fair and independent dice, sampling times. Using pseudorandom sampling, with a standard error proportional to the square root of the number of samples, we would expect to get an estimate of 7±10%. However, using quasirandom sampling, since the standard error is inversely proportional to the number of samples, the estimate will be 7 ± 1%. To achieve the same level of accuracy with pseudorandom numbers would require samples.  . Quasirandom sequences are not seeded like pseudorandom generators. To avoid repetition, and for better degeneracy, it is recommended to discard a number of variates on initialisation (Joe & Kuo ), and on subsequent sampling continue the sequence from its previous position.

.
When using quasirandom sequences it is extremely important to ensure that every sample is used. Discarding a subset of the samples will potentially destroy the statistical properties of the entire sequence. For this reason quasirandom sequences must not be combined with rejection sampling methods, which are o en used to convert the 'raw' uniformly-distributed variates into e.g. normally-distributed ones.

Sampling without replacement .
Given a discretely distributed population P of individuals which can be categorised into one of n states, each with integral frequencies {f 1 , f 2 , ...f n }, a random sample i ∈ {1...n} has probability Once a sample i is taken the distribution is adjusted to {f 1 , f 2 , ...f i − 1, ...f n }, with resultant impact on p. Once f i has reached zero no further samples can take the value i since p i = 0.
. Since f i cannot be oversampled, this implies that all the other states f j =i cannot be collectively undersampled. By extension, if all but one states f i =j cannot be oversampled, then f j cannot be undersampled. Thus each state can neither be under-nor oversampled: the distribution must be matched exactly.
. Extending this to two dimensions is straightforward: given marginal frequency vectors f and g, each summing to P and (for argument's sake) of length n, take a random two dimensional sample (i, j) ∈ {1...n} 2 with probability and subsequently adjust values in each marginal vector This process can be repeated up to P times, when both marginal frequency vectors will be exhausted. The process extends easily to higher dimensions.

Integerisation of probabilities .
Given a vector of discrete probabilities p i for each state in a single category, and an overall population P , the nearest integer frequencies f i can be obtained using the following algorithm: firstly obtain a lower bound for integer frequencies by computing where g i represents the di erence between the unconstrained frequency and the integerised frequency and X is the shortfall in the integerised population. To remove the shortfall we simply increment the f i corresponding to the largest X values of g i . This ensures that matching the population, and that g i is minimised. In many cases there may be multiple optimal solutions.

Problem statement .
Population synthesis (Orcutt ) refers to the (re)generation of individual-level data from aggregate (marginal) data. For example, UK census data provides aggregate data across numerous socioeconomic categories at various geographical resolutions -but it expressly does not provide data at an individual-level, for privacy reasons among others. Only the most basic form of microsynthesis is described here -that where only the marginal data governs the final population, and no exogenous data is supplied. This corresponds to the use of IPF with a unity seed. .
A mathematical statement of the problem is as follows: we wish to generate a population P of individuals which can be characterised into D separate categories, each of which can occupy l discrete states. Each category is represented by a known frequency vector m i of length l i giving the marginal occurrences of a state within a single category. Thus we have S possible overall states and the aim is to populate a contingency table T in D dimensions such that in other words each element of m i is the sum of all elements in T for a fixed value of k i . It is perhaps clearer to express this in visual form -see Figure . We define this as the marginal constraint. .
Each marginal sum and the sum of the elements of contingency table must equal the population P : We define this as the population constraint. Finally, the integral constraint restricts the elements to the natural numbers: The probability of a given state k being occupied is thus the product of the marginal probabilities: for which a two-dimensional example is illustrated for clarity in Figure . .
The degrees of freedom d of the system (required in order to calculate the statistical significance of the population) is given by In the general case there are not enough constraints to determine a unique solution. Hence there is a need to resort to iterative or optimisation-type solutions, such as IPF, simulated annealing, likelihood estimation, chi-squared fitting, or least-squares fitting.

Degeneracy, Entropy and Statistical Likelihood
.
In this work we use the term degeneracy to mean the number of possible di erent ways a given overall population P can be sampled into a D-dimensional table T containing S possible states k, with occupancy probability p k , the system having d degrees of freedom. In this sense, the term degeneracy is closely related to (informationtheoretic) entropy: a high-degeneracy state can be considered a high-entropy state, and a low-degeneracy state a low-entropy one. .
Assuming that the marginal distributions are uncorrelated, then the higher the degeneracy/entropy of the population, the more statistically likely the population is. We measure the statistical likelihood of populations using a χ 2 statistic: and from this which we can estimate a p-value, which represents the statistical significance of the synthetic population.
where F is the cumulative χ 2 density function and d the degrees of freedom of the system. .
The purpose in this context is to use these metrics to demonstrate the ability of the sampling method described here to generate populations with a high p-value, i.e. high degeneracy, high entropy, high likelihood and low statistical significance.

The Algorithm
. This section describes the Quasirandom Integer Without-replacement Sampling (QIWS) algorithm. A commentary on the process by which we arrived at this algorithm follows in the Discussion. .
The inputs required for the algorithm to operate are simply a number of integer vectors defining the marginals. The elements of each vector must sum to the same value, satisfying the population constraint defined earlier.
. Given a valid input of D marginal vectors m i , the process then proceeds as follows: . Create a D-dimensional discrete without-replacement distribution using the marginals m i .
. Sample P random variates from this distribution to create a population T. Specifically, we sample a value of k and increment the value of T k , repeating until the distribution is fully depleted. Constructing the problem in this way automatically ensures that all constraints are automatically met, as explained in the previous section.
. Compute occupation probabilities p k for each state k.
. Compute a χ 2 statistic and a p-value, which represents the statistical significance of the synthetic population (in this case we prefer higher p-values, i.e. statistically insignificant populations).

The implementation
.
The algorithm is implemented in an R package called humanleague. A description of the technical implementation details can be found in the Appendix. The package is open source under a GPL licence and available at https://github.com/CatchDat/humanleague.

.
In order to use the microsynthesis function, the input data requirements are a number of discrete integer marginal distributions, i.e. counts of population in each state of some category. If any marginal data is expressed as probabilities, a function is provided to convert these to integer frequencies. For reference, we also provide a function for direct generation of Sobol sequences.

Usage examples
. prob2IntFreq: this function converts an array of discrete probabilities p i into integral frequencies, given an overall population P . It also returns the mean square error between the integer frequencies and their unconstrained values p i P . . synthPop: This is function that performs the microsynthesis. The function supports dimensionalities up to , although this limit is arbitrary and could be increased if necessary. Input is simply a list of (between and ) integer vectors representing the marginals. Marginal vectors must all sum to the population (to satisfy the population constraint) P . The output is broadly compatible with the established mipfp (Barthelemy & Suesse ) R package: • D-dimensional population . sobolSequence: this function simply produces a Sobol sequence of the requested dimension and length, optionally skipping part of the sequence. The example below produces the first ten values of the six-dimensional sequence, without skipping.

Statistical properties
. Using IPF with a unity seed is equivalent to populating each state proportionally to the probability of occupancy of the state. This solution maximises entropy (Lovelace et al. ( )). Sampling algorithms generally cannot achieve the same entropy since they only populate states with integer populations. .
We assume that high-entropy populations are preferable for most applications and aim in this section is to show that quasirandom sampling is much better at achieving high-entropy populations than pseudorandom sampling. .
In order to compare psuedorandomly-sampled and quasirandomly sampled populations, we considered a twodimensional test case in which each marginal had ten equally probable states. We varied the total population resulting in population densities (

Discussion
Evolution of the algorithm . Firstly, it was demonstrated that naive pseudorandom sampling could not reliably generate populations matching the marginals. Since successive samples are ostensibly independent, there is no mechanism to prevent under-or oversampling of any particular state. .
The initial algorithm simply sampled the marginal distributions (with replacement), using quasirandom numbers, to build a population. For "toy" problems this method o en worked, but for more complex problems the  Table : Relative computation time of the algorithms (averaged over runs), using the first case as a baseline.
algorithm o en failed to exactly match all the marginals. .
In general, the resulting population mismatched the marginals only slightly. A correction algorithm was implemented which, applied at most once for each dimension of the population, adjusted the states by the error in the marginal for that dimension.
. It was also demonstrated that a correction step as described above could not be applied to a pseudorandomly generated population because the marginal errors were almost always too large to be able to apply a correction: such corrections would typically result in a negative population in one or more states.

.
Furthermore, it was shown that even using quasirandom sampling, it was not always guaranteed that a population could be corrected using the method described above without a negative population in one or more states. The sparser (the ratio of the total population P to the number of possible states S) the population, the more o en this would occur. O en the solution was simply to discard and resample the population until an exact or correctable population was generated. This raised a concern about the reliability (and e iciency) of the algorithm, and in fact further testing did reveal cases where the algorithm the number of resamples required was unacceptably large, even exhausting the period of the generator.
. This flaw drove the authors to develop an alternative formulation of the algorithm, and it transpired that withoutreplacement sampling would guarantee that the population matched all marginals by design (as explained in section . ), eliminating the need for the correction step entirely. In fact, this method is guaranteed to work regardless of the choice of underlying random number generation.
. The next phase of the work was to analyse generated populations statistically, focussing on the statistical likelihood of the resulting populations in order to determine whether quasirandom sampling o ers any additional benefit. The results from this work, presented in the previous section, are discussed below.

Randomness in context .
To put pseudorandom versus quasirandom sampling into context, and to explain at least intuitively why we observe di erent distributions of p-values (Tables to ), firstly consider a 'random' number generator that simply always returns zero. If we used this as the basis for without-replacement sampling of a marginal distribution {30, 30, 30}, we would get a sequence consisting of ten ones, ten twos and ten threes as the samples exhausted each bin and moved to the next. In two dimensions, this would result in a population matrix suggesting that the marginals are strongly correlated. This particular result has a low degeneracy -there are relatively few di erent ways at arriving at this result -and a p-value of e ectively zero implying that it is vanishingly unlikely that this population resulted by chance from independent marginal distributions.
. Simplifying Equation for this example, and assuming a variation X in the populations of each state (from their expected value), we can thus estimate χ 2 as where N is the number of states and P the overall population. (Respectively and in this example.) . Now, repeating the sampling process using a good-quality pseudorandom number generator, given that convergence is inversely proportional to the square root of the number of samples, we would typically expect a population matrix more like   10 ± 3 10 ± 3 10 ± 3 10 ± 3 10 ± 3 10 ± 3 10 ± 3 10 ± 3 10 ± 3   where is the closest integer to X, and X = (P/N )/ P/N = P/N = √ 10. Thus, it would not be unreasonable to expect (from Equation ) that which in this example is , corresponding to a low p-value of around . .
Repeating again, this time using a quasirandom sequence and bearing in mind that convergence is now linear, a typical population matrix would be   10 ± 1 10 ± 1 10 ± 1 10 ± 1 10 ± 1 10 ± 1 10 ± 1 10 ± 1 10 ± 1   This time the variation X = (P/N )/(P/N ) = 1, and thus which in this example is . , an order of magnitude lower, and a corresponding p-value of . .
. This result, whilst not mathematically rigorous -the analysis is complicated by the use of without-replacement sampling -is in line with the sampled uniform distributions shown in Figure . We believe that this example goes some way to explaining the histograms given in Figures to .
. Furthermore, the figures seem to suggest that whilst pseudorandomly-sampled populations converge to an even distribution of p-values (i.e. there is no bias towards high-entropy populations) as the population density increases, quasirandomly-sampled populations strongly converge to populations with a unity p-value. The latter is clearly preferable when the marginals are meant to be independent, confirming the null hypothesis.
. For low population densities, there is evidence of 'quantisation' of the p-value distribution, this is thought to arise from the fact that for sparse populations, there are fewer possible configurations of the population. Why there is an apparent bias towards p-values around . for pseudorandomly sampled sparse populations is unclear to us.

Statistical likelihood
. The minimum χ 2 value possible from a sampled integral population is generally greater than zero simply due to the fact that the expected population in each state may not be integral. Thus a p-value of exactly unity is not always possible. .
As explained is the previous section, QIWS has a strong bias towards high p-values but may nevertheless produce statistically unlikely populations. The χ 2 and p-value that are supplied along with the population should be checked against some suitable threshold, resampling as necessary.
QIWS performance in terms of computational e iciency is far superior to the IPF implementation we compared it to. One reason for this is that QIWS is not iterative, but the following points should also be noted: • the comparison is somewhat unfair since our implementation is compiled C and C++ code and Ipfp's implementation is interpreted R code.
• performance of population synthesis may not be a significant factor in most workflows, and thus speed may be of marginal advantage. However in large-scale ensemble simulation applications, the performance of QIWS may be an advantage.

Application / range of use .
For agent-based modelling (ABM) applications, QIWS has the advantage that it only generates integer populations. Thus integerisation, which may alter the population in such a way that the marginals are no longer met, is not required.
. An alternative to integerisation is sampling. Such techniques have been documented, although none mention the use of quasirandom sampling. For example, in the doubly-constrained model described in Lenormand et al. ( ), they use a fractional population generated with IPF as a joint distribution from which to randomly sample individuals. In Gargiulo et al. ( ) without-replacement sampling is also used with the joint distribution dynamically updated a er each sample. .
IPF can be used for microsimulation in cases where some marginal data is not fully specified. For example, some categorical constraint data may only be available for a subset of the whole population. By using this data as a seed population, rather than as a marginal constraint, the incomplete data is scaled to, and smoothed over, the overall population. This type of microsimulation is described and discussed in the "CakeMap" example in Lovelace & Dumont ( ). .
The authors believe that the current implementation of QIWS could be extended to deal with problems such as "CakeMap" or more generally ones where some exogenous data must be taken into account. The approach would be to construct a dynamic joint distribution from marginal and exogenous data, updating as each sample is taken. .
With given marginals and seed population, IPF will always result in the same overall population. QIWS di ers in this respect in that the same inputs can be used to generate multiple di erent populations, all of which meet the marginal constraints. These could potentially used in an ensemble simulation to test the sensitivity of a model to di erent synthetic populations generated from the same marginal data.
. QIWS may potentially be of use in the process of anonymisation by microsynthesising aggregate data from a known population. It is outside the scope of this work, but since full marginal data will likely be available, its use in this context may be worth investigating.

Conclusion
. This work forms part of a larger project in which a city-scale population microsynthesis is used as the basis for an Agent-Based Model of commute patterns, using census data combined with a crowdsourced dataset. The QIWS algorithm and the so ware implementation in an R package was developed for this purpose, eliminating the need for integerising the synthesised population. This microsimulation involved categorical variables (i.e. marginals) representing home and work locations (at MSOA resolution), mode of transport, gender, age group, and economic activity type, with the crowdsourced data overlaid on the larger census population.
. Our analysis shows that quasirandom sampling results in more statistically likely populations than pseudorandom sampling as illustrated in the histograms in Tables to . The statistical likelihood/significance of the sampled population may vary and thus metrics are provided to guide the user in determining the suitability of the generated population.
. In situations where integral populations with high statistical likelihood are desirable QIWS presents an advantage over established techniques by either eliminating the need for integerisation, or by its use of quasirandom sequences to create high-entropy sampled populations.
. In its current implementation, QIWS is more limited in scope than IPF in that it is unable to deal with cases where some marginal data is not explicitly specified, instead being expressed in the seed population used to initialise the IPF iteration. .
The authors are working on extending the technique to allow for exogenous data to be specified alongside the marginals, akin to the use of a seed in IPF. The approach is to separate the sampling into two connected joint distributions: one from which the samples are taken (containing exogenous data), and one from which samples are removed (in order to preserve the marginals). Whilst the algorithm has been demonstrated to work, further testing is required, and will specifically focus on cases where IPF does not perform well, such as the prevalence of a high proportion of "empty cells".
. In applications where performance is a key factor, we have shown that our implementation of QIWS comfortably outperforms a popular R implementation of IPF (Barthelemy and Suesse ). For large-scale ensemble modelling, this could be advantageous.
. We have established that QIWS can be used for microsynthesis given aggregate data. Although outside the scope of this work, we believe that the algorithm could equally be applied to an anonymisation process whereby individual-level data is first aggregated and then synthesised, for example Nowok et al. ( ).
. We envisage QIWS as an evolving technique that complements rather than supplants established methods, and publicise the technique as such. With ongoing development of the algorithm we hope to extend it to deal with more complex microsynthesis problems.