The Coconut Model with Heterogeneous Strategies and Learning

In this paper, we develop an agent-based version of the Diamond search equilibrium model - also called Coconut Model. In this model, agents are faced with production decisions that have to be evaluated based on their expectations about the future utility of the produced entity which in turn depends on the global production level via a trading mechanism. While the original dynamical systems formulation assumes an infinite number of homogeneously adapting agents obeying strong rationality conditions, the agent-based setting allows to discuss the effects of heterogeneous and adaptive expectations and enables the analysis of non-equilibrium trajectories. Starting from a baseline implementation that matches the asymptotic behavior of the original model, we show how agent heterogeneity can be accounted for in the aggregate dynamical equations. We then show that when agents adapt their strategies by a simple temporal difference learning scheme, the system converges to one of the fixed points of the original system. Systematic simulations reveal that this is the only stable equilibrium solution.


Introduction
. Aggregate dynamical equations derived from individual-based models are usually exact only if the micro-level dynamics satisfy certain rather restrictive symmetry conditions brought about by homogeneity of the agent properties and interaction rules (Banisch ). Micro-level heterogeneity or complex interaction structures o en lead to correlations in the dynamics that a macro-level description may not account for. This means that it is impossible to describe the system evolution exactly as a closed set of macroscopic equations. This requires to understand the consequences of heterogeneous agent properties on macroscopic formulations of the system dynamics. In this paper, we present results of this e ort for the Diamond search equilibrium model (Diamond ; Diamond & Fudenberg ) -also known as Coconut Model -which has been introduced in by the Nobel laureate Peter Diamond as a model of an economy with trade frictions. .
Imagine an island with N agents that like to eat coconuts. They search for palm trees and harvest a nut from it if the tree is not too tall, meaning that its height does not exceed an individual threshold cost (c tree < c i ). However, in order to consume the nut and derive utility y from this consumption, agents have to find a trading partner, that is, another agent with a nut. Therefore, agents have to base their harvest decision now (by setting c i ) on their expectation to find a trading partner in the future. More precisely, agents are faced with production decisions that depend on their expectations about the future utility of the produced entity, which in turn depends on the global production level via a trading mechanism. For this reason, the Coconut Model is useful not only to incorporate heterogeneity, but also to analyse adaptive agents that -rationally or not -form expectations about the future system state in order to evaluate their decision options.
. In original work by (Diamond ; Diamond & Fudenberg ), this problem of inter-temporal optimization was formulated using dynamic programming principles and in particular the Bellmann equation. This allowed them to get a di erential equation (DE) that described the evolution of the cost threshold along an optimality path (where the individual thresholds are all equal c i = c) which was coupled to a second DE describing the .
The TD approach used here di ers mildly from these models but fits well with the abstract specification of adaptive behaviour proposed in Holland & Miller ( ). In our case, agents learn the value associated to having and not a coconut in form of the expected future reward and use these values to determine their cost threshold c i . This means that agents are forward-looking by trying to anticipate their potential future gains. While checking genetic algorithms or strategy switching methods in the context of the Coconut Model is an interesting issue for future research, in this first paper we would like to derive an agent-based version of the model that is as closely related to the original model as possible. The motivation behind this is well-captured by a quote from Holland & Miller ( ) (p. ): "As a minimal requirement, wherever the new approach overlaps classical theory, it must include verified results of that theory in a way reminiscent of the way in which the formalism of general relativity includes the powerful results of classical physics." .
In our opinion, this relates to the tradition in ABM research to verify and validate computational models by replication experiments (e.g. Axtell et al. ; Hales et al. ; Grimm et al. ; Wilensky & Rand ). The main idea is that the same conceptual model implemented on di erent machines and possibly using di erent programming languages should always yield the same behaviour. Our point of view here is that developing scientific standards for ABMs should not be restricted to comparing di erent computer implementations of the same conceptual model, but it should also aim at aligning or comparing the results of an ABM implementation to analytical formulations of the same processes. This, at least, whenever such descriptions are available.
. This is the case for the Coconut Model, and for its rich and sometimes intricate behavior on the one hand and the availability of a set of analytical results on the other the model is well-suited as a testbed for this kind of model comparison. In particular, when it comes to extend a model that is formulated for an idealized homogeneous population so as to incorporate heterogeneity at the agent level, we should make sure that it matches with the theoretical results that are obtained for the idealized case.
. Therefore, the main objective of this paper is to derive an agent-based version of the Coconut Model as originally conceived by (Diamond ; Diamond & Fudenberg ), where the model dynamics have been derived for an idealized infinite and homogeneous population. In Section , we will see that this implementation does not lead to the fixed point(s) of the original system. In order to align the ABM to yield the right fixed point behaviour, we have at least two di erent options which again lead to slight di erences in the dynamical behaviour that are not obvious from the DE description.

.
Once we have such an aligned model, it is possible to study the e ects that result from deviating from the idealized setting of homogeneous strategies (see Section ). Heterogeneity may arise from interaction structure, information available to agents as well as from heterogeneous agent strategies due to learning in finite populations, which will be our main point here. In particular, we will show that heterogeneity can by accounted for in a macroscopic model formulation by the correction term introduced in Olbrich et al. ( ).
. Section will present agents using TD learning to learn the optimal strategy. As the learning scheme used in this paper can in fact be derived from the Bellman equation used to set up the original model, an agent population that adapts according to this method should converge to the same equilibrium solution in a procedural way. However, as such an approach implements rationality as a process (Simon ) it describes the route to optimality and allows to analyse questions related to equilibrium selection and stability. However, by implementing rationality as a process (Simon ), this approach describes the route to optimality and allows to analyse questions related to equilibrium selection and stability. However, before proceeding, let us describe the original model more carefully.

Description of the Original Model
. Consider an island populated by N agents. On the island, there are many palm trees and agents wish to consume the coconuts that grow on these trees. The probability that agents find a coco tree is denoted by f and harvesting a nut from the tree bears a cost c tree (the metaphor is the height of the tree) that is described by a cumulative distribution G(c) defining the probability that the cost of a tree c tree < c. In what follows we consider that the costs that come with a tree are uniformly distributed in the interval [c min , c max ] such that G(c) = (c − c min )/(c max − c min ). Agents cannot store coconuts such that only agents without a nut may climb and harvest a new one. On encountering a tree (with probability f ), an agent without coconut climbs if the cost c tree of harvesting the tree does not exceed a strategic cost threshold c i (referred to as strategy) above which the agent assumes that harvesting the coconut would not be profitable. In other words, the probability with which an agent without nut will harvest a tree is given by f G(c i ). In the original model, the agent strategy is endogenously determined (and then written as c t i ) as described below. .
In what follows, we denote the state of an agent i by the tuple (s i , c i ) where s i ∈ {0, 1} encodes whether agent i holds a coconut or not and c i is the strategy of the agent. We define the macroscopic quantities e = i s i and = e/N with the first being the number of coconuts in the population and the second the ratio of agents having a coconut, respectively. The time evolution in the limit of an infinite population can then be written as a di erential equation where the first term corresponds to the rate at which agents harvest a coconut and the second to trading with 2 being the probability of randomly choosing two agents with coconut Diamond ( ). c * corresponds to agents' optimal strategy. In the original works that inspired our model, this is homogeneous for the population but endogenously defined (time-dependent). G(c * ) is the fraction of trees that will be harvested at this cost threshold. .
A crucial ingredient of the coconut model is that agents are not allowed to directly consume the coconut they harvested. They rather have to search for a trading partner, that is, for another agent that also has a coconut. The idea behind this is that agents have to find buyers for the goods they produce. If an agent that possesses a nut encounters another agent with a nut both of them are supposed to consume instantaneously and derive each a reward of y from this consumption. In e ect, this means that the expected value of climbing a tree depends on the total number of coconuts in the population or, more precisely, on the time agents have to wait until a trading partner will be found. .
Rational agents are assumed to maximize their expected future utility: where r i (τ ) corresponds to the cost of climbing or respectively to the utility y from consumption of agent i at time τ and γ to the discount factor. A fully rational agent has to find the strategy c * i that maximizes its expected future reward and given that agents cannot consume their coconut instantaneously, this reward depends on expectations about trading chances. This can be formulated as a dynamic programming problem with dV i (t)/dt = − r i (t) + γV i (t). Considering that there are two states (namely, s i = 0 or s i = 1) there is a value associated to having (V i (s i = 1, t) := V t i (1)) and to not having (V i (s i = 0, t) := V t i (0)) a coconut at time t. As a rational agent accepts any opportunity that increases expected utility, a necessary condition for an optimal strategy is c * i = V t i (1) − V t i (0). Following this reasoning, by assuming homogeneous strategies c * i = c * , Diamond derives another DE that describes the evolution of the optimal strategy The model ( )-( ) is interesting thanks to its a rather rich solution structure and because many macroeconomic models incorporating inter-temporal optimization have a similar form. First, the model may give rise to a market failure equilibrium which arises as a self-fulfilling prophecy where everyone believes that nobody else invests (climbs the coco trees). It has multiple equilibria and it has been shown in Diamond & Fudenberg ( ) that the model could also exhibit cyclic equilibria. This has been interpreted as an abstract model for endogenous business cycles. A complete stability analysis of the system ( ) and ( ) has been presented by Lux ( ). .
The formulation of the model as a two-dimensional system of DEs ( )-( ) (Diamond ; Diamond & Fudenberg ; Lux ) assumes an infinite and homogeneously adapting population. While many aspects of the DE system have been understood by using corresponding analytical tools, some other important aspects of general theoretical interest cannot be addressed within this setting. This includes equilibrium selection but also stability and out-of-equilibrium perturbations, bounded or procedural rationality (Simon ) and learning, and finally the influence of micro-level heterogeneity.
. In order to address the first of these points -equilibrium selection, Aoki & Shirai ( ) have re-examined the model with finite populations. They related equilibrium selection to finite-size stochastic fluctuations and showed that transitions between di erent basins of attraction occurred with positive probability. The benefit of implementing Diamond's model as an ABM is that we can eventually relax the other assumptions as well.

Homogeneous Strategies
. As argued in the introduction, when reimplementing the model in an ABM, we should make sure that the model incorporates the solution(s) of the original model in that the reimplementation of original idealizations leads to the same behaviour. Therefore, let us first look at the dynamics of the model when all agents adopt the same which characterizes the equilibrium point * for a given c. If we implement the coconut model with homogeneous strategies c in form of an ABM we can expect that the average level of coconuts in the population approaches and fluctuates around the point * (c).

Model Alignment
Intuitive Implementation .
Intuitively, the Diamond model as set up in the original paper Diamond ( ) could be implemented in the following way (we refer to this «intuitive» implementation as IM):   .
Some short comments on this procedure are in order. For the initialization step ( ), we usually define a desired level of coconuts in the initial population ( 0 ) and a random initial state adhering to this level is obtained by letting each agent have a coconut with probability 0 . The initialization of the strategies c i will be according to the di erent scenarios described in the next section but in general the strategies will ly in the interval [c min , c max ].
In this section, the strategies are homogeneous and do not change over time so that all agents have identical strategies during all times. Point ( a) in the iteration process means that at each step we randomly choose one agent from the population. Note that this means that within N iteration steps some agents may be chosen more than once whereas others might not be chosen at all. For point ( b) the climbing decision with probability f G(c i ) is evaluated by drawing two random numbers, one for the rate of coco trees f and another for the cost of the coconut which is uniformly distributed in the interval c tree ∈ [c min , c max ] for all the experiments we perform throughout this paper. Agents will climb the tree if c i ≤ c tree . Finally, if agent i has a nut to trade ( c), a second agent j is randomly chosen from the remaining agent set which means that there are no specific trading relations between the agents that direct the search for trading partners. .
In this section, we compare the outcome of the simulation model as specified above with homogeneous strategies c i = c to the behaviour predicted by the DE ( ). We look at a small system of only agents in order to see how well the theoretical results obtained for infinite populations approximates the small-scale system. For the comparison the system is initialized with 0 = 0 (no coconuts in the population) and run for steps to reach the stationary regime. Then another update steps are performed during which we measure the mean level of coconuts in the population. We run a series of such experiments for di erent strategy values c in between c min and c max . The results of this are shown by the stars points in Fig. . .
In addition, we show the strategy dependent solution curve * (c) ( ) obtained for the original DE (orange curve). It is clear that for the intuitive implementation the average coconut level is considerably below the DE solution. This is due to the fact that in a single time step, on average, two agents may trade but only one agent may climb a coco tree and harvest a nut which means that the rate − 2 actually underestimates the decrease of coconuts by a factor of two. Consequently, we should correct this term for −2 2 in order to obtain a DE corresponding to the behaviour of the ABM. That is, which is also shown (green curve) in . Note that the adjusted DE ( ) does not change the behaviour of the system ( )-( ) in a qualitative way. Note also that instead of multiplying − 2 by two, we could also rescale f to f /2 so that the dynamical structure of the model is not really a ected. For the experiments in Section we will stick to the intuitive model implementation (IM) as described above and use ( ) for the comparison with the DE description.
Two Ways of Aligning the ABM . In order to obtain an ABM that matches well with the original DE description, we have at least two di erent options. We could first allow two agents to climb at each time step by changing the iteration scheme to ( ) Iteration loop: (a) random choice of an agent pair (i, j) with probability ω(i, j) = 1/N (N − 1) Let us refer to this scheme as aligned model one (AM1). Then, we could decide that an agent i with a nut (s t i = 1) trades that nut with a probability proportional to the number t of coconuts in the population so that at most one coconut will be cleared at a single time step. The iteration (we shall call this scheme AM2) becomes: In a sense, this mechanism does not really comply with what we usually do in an ABM. However, it is a way to reproduce the fixed point solution ( ) of the original DE, which is shown in Fig. as well. Indeed, the two aligned schemes match very well with the DE solution. AM2 will be used in Section where we incorporate learning dynamics into the model.

Finite-Size Markov Chain Formulation
.
In addition to the DE and the ABM formulation we shall consider here a third description in form of a Markov chain (MC). For homogeneous populations one can show that an MC description accounting for the transitions between the di erent numbers of coconuts in the population provides an exact formulation of a finite size ABM with N agents, Banisch ( ). In the finite size case, the possible are given as the finite set {0/N, 1/N, . . . , N/N } which we shall write as Y = {0, 1, . . . , e, . . . , N } to simplify notation. We denote by e the (absolute) number of agents with a coconut (i.e., e = i s i ). .
For the intuitive scheme, the transition probabilities for the IM are given by and zero elsewhere. Note that trading decreases the number of coconuts by two whereas only one coconut can be harvested in a single transition as implemented by the intuitive model. If, instead, we choose P (e − 1|e) = e(e−1) N (N −1) and P (e − 2|e) = 0 we obtain a MC representation that is aligned to the AM2 scheme with only one agent consuming a nut on trade. .
A MC representation can be useful because it provides us with a good understanding about the finite size fluctuations around the equilibrium point. In Fig. ,   the intuitive model to the stationary vector of ( ) for the homogeneous strategy c = 0.4. Note, however, that the peak in the MC stationary vector and the solution of the DE ( ) do not match precisely. This is due to the fact that the trading probabilities (with 2 = (e 2 /N 2 )) in the mean-field formulation do not explicitly exclude self-trading, whereas this is excluded in the MC formulation as well as in the ABM (and therefore the respective probability is (e(e − 1)/N (N − 1))). While the di erence between the two probabilities is not significant if N is large, there is a notable e ect for N = 100.

Heterogeneous Strategies
. We now relax the assumption of homogeneous strategies by assigning a random individual strategy c i to each agent drawn from a certain probability distribution. As before, the strategies are fixed for the entire simulation. The four di erent scenarios considered in the second part of this section di er with respect to the distribution from which the individual strategies are drawn (in general, in the interval [c min , c max ]). Before looking at these specific scenarios, however, we derive a correction term that accounts for the e ect of strategy heterogeneity.
Namely, if strategies are di erent we can expect that those agents in the population with a lower c i will also climb less o en and are therefore less o en with a coconut s i = 1. That is, there is a correlation between the agent strategy c i and the probability P r(s i = 1) that an agent has a coconut. .
To account for this we have to consider that the rate P (e + 1|e) that an agent of the population will climb from one time step to the other is given by The equation covering the homogeneous case ( ) is satisfied because G(c i ) = G(c) is equal for all agents and can be taken out of the sum. For heterogeneous strategies, this is not possible but we can come to a similar expression by formulating P (e + 1|e) as the expected value (denoted as ) where σ[s i , G(c i )] is the covariance between agent states s i and individual climbing probability G(c i ). With G(c i ) = G(c) and s i = e/N (the former being true due to the linearity of G and the latter by definition) this yields which corresponds to the original term ( ) except for the additional covariance term σ(s i , G(c i )). .
Note that the covariance depends on the individual agent states (s i and G(c i )) so that the transition rate is no longer a pure function of the macroscopic average state. .
This correction term accounting for the correlation between the agent strategy and state should also be included into the infinite-size formulation ( ). This is accomplished by Olbrich et al. ( Note that covariance term σ(t) in Eq. ( ) is here written as a time-dependent parameter because it depends on the actual microscopic agent configuration at time t to be computed. In other words, this is no longer a closed description in the macroscopic variable as it is coupled to the evolution of the heterogeneity in the system contained in the covariance term.

Time Evolution of the Heterogeneity Term
. One way to address this issue is to derive an additional di erential equation that describes the time evolution of σ and is coupled to ( ). To this purpose, let us denote the probability for an agent i to be in state s i = 1 as P r(s i = 1) := p i . The evolution of this probability is described bẏ which is very similar to the macroscopic equation ( ) with the first term the probability of climbing and the second that of trading. The correction σ can be written as is the mean climbing probability. The di erentiation iṡ and substituting ( ) and ( ) yieldṡ is a new higher order covariance term between the second power of G 2 (c i ) and the state probability p i .

.
More generally, let us further denote the mean over powers of G as and write When we express the dynamical equations for the correction term and then for the higher moment that emerges at each step, we arrive at the following infinite system of equations: Therefore, we see that a relatively simple form of heterogeneity -namely, a fixed heterogeneous cost threshold -leads to a rather complicated system when we aim at a closed description in terms of macroscopic or average entities. On the other hand, if G(c i ) is strictly below one, these higher moments σ k tend to zero as k increases which allows, under certain conditions, to close the system. However, this goes beyond the scope of this paper and will be addressed in a future work.

Estimation of Correction Term from Simulations
.
Another way to illustrate the usefulness of the correction term derived for the heterogeneous model is to estimate σ from simulations. To do so, we run the ABM with a certain strategy distribution and compute σ(t) = cov(s t i , G(c i )) for each step. We then computed the time average over simulation steps which we may denote byσ and replaced the correction terms in ( ) and ( ) by this average heterogeneity term. The respective transition probability for the Markovian description then becomes and we can compute the corrected stationary vector on that basis. Likewise, from ( ) we obtain for the fixed point solution These terms are now confronted with simulations initialized according to di erent distributions of individual strategies. Namely, we consider four cases: . strategies uniformly distributed within [c min , c max ], . there are two di erent strategies distributed at equal proportion over the population . the probability of a strategy c i decreases linearly from c min to c max and reaches zero at c max . probability of a strategy c i decreases according to a Γ-distribution with shape 1 and scale 1/5 . Please, note that the first two cases are chosen such that the mean climbing probability is G = 0.5 whereas lower thresholds and therefore a lower average climbing probability are implemented with the latter two.  for the descriptions without correction (homogeneous case), the stationary statistics of the model are wellmatched a er the covariance correction is applied. This shows that very e ective macroscopic formulations of heterogeneous agent systems may be possible by including correction terms that e iciently condense the actual micro-level heterogeneity in the system.

Adaptive Strategies and Learning
. Having gained understanding about how to deal with heterogeneous strategies in the finite-size Coconut Model, we shall now turn to an adaptive mechanism by which the strategies are endogenously set by the agents. As in Section , we follow in this implementation the conception of Diamond Diamond ( ) as closely as possible. This requires, first, that the threshold c i has to trade o the cost of climbing against the expected future gain of earning a coconut from it. In other words, agents have to compare the value (or expected performance if you wish) of having a coconut V t i (1) with the value V t i (0) of staying without a nut. If the di erence between the expected gain from harvesting at time t and that of not harvesting (V t i (1) − V t i (0)) is larger than the cost of the tree c tree , agents can expect a positive reward from harvesting a nut now. Therefore, following Diamond ( ), it is reasonable to set c t i = V t i (1) − V t i (0). . Now, how do agents arrive at reliable estimates of V t i (1) and V t i (0)? We propose that they do so by a simple temporal di erence (TD) learning scheme that has been designed to solve dynamic programming problems as posed in the original model. Note that for single-agent Markov decision processes temporal di erence schemes are proven to converge to the optimal value functions Sutton & Barto ( ). In the Coconut Model with agents updated sequentially, it is reasonable to hypothesize that we arrive at accurate estimates of V t i (1) and V t i (0) as well. .
However, it is also important to note that the decision problem as posed in Diamond ( ) is not the only possibility to formulate the problem. Indeed, the model assumes that agents condition their action only on their own current state neglecting previous trends and information about other agents the consideration of which might lead to a richer set of solutions. In our case agents do not learn the dependence explicitly, which means that the agents will only learn optimal stationary strategies. The consideration of more complex (and possibly heterogeneous) information sets points certainly to interesting extensions of the model. However, we think that it is useful to first understand the basic model and relate it to the available theoretical results as this will also be needed to understand additional contributions by model extensions.

Learning the Value Functions by Temporal Di erences .
The learning algorithm we propose is a very simple value TD scheme. Agents use their own reward signal r t i to update the values of V t i (s t i = 1) and V t i (s t i = 0) independently from what other agents do. In each iteration, ) of the agents. .
Note that the discount factor γ as defined for the time continuous DE system is rescaled as γ r = e −γN −1 for the discrete-time setting and in order to account for the finite simulation with asynchronous update in which only one (out of N ) agents is updated in each time step (N −1 ). The iterative update of the value functions is then given by (0)) is updated only if agent i has been in state 1 (0) in the preceding time step.
. The idea behind this scheme and TD learning more generally is that the error between subsequent estimates of the values is reduced as learning proceeds which implies convergence to the true values. The form in which we implement it here is probably the simplest one which does not involve update propagation using eligibility traces usually integrated to speed up the learning process (Sutton & Barto ). In other words, agents update only the value associated with their current state s t i . While simplifying the mathematical description (the evolution depends only on the current state) we think this is also plausible as an agent decision heuristic.
. All in all the model implementation is .
Note that for trading we adopt the mechanism introduced as AM2 in Section . . If not stated otherwise, the simulation experiments that follow have been performed with the following parameters. The interval from which the tree costs are drawn is given by c max = 0.5 and c min = 0.3. A strategy c i larger than c max hence means that the agent accepts any tree, c i < c min that no tree is accepted at all. The rate of tree encounter is f = 0.8 and the utility of coconuts is y = 0.6. We continue considering a relatively small system of agent and the learning rate is α = 0.05. The parameter much of the analysis will be concentrated on is the discount rate γ with small values indicating farsighted agents whereas larger values discount future observations more strongly. The system is initialized (if not stated otherwise) with 0 = 0.5, V 0 i (1) = y and V 0 i (0) = 0 for all agents such that c 0 i = y > c max .
The first part of this paper (exogenously fixed strategies) has shown that the ABM reproduces well the fixed point curve obtained for the coconut dynamics by setting˙ = 0. Here we want to find out whether the simulation Note that the last value γ = 0.3 is so large that the and c-curves do not intersect so that there is actually no fixed point solution. .
In order to check these curves in the simulations, we fixed the expected probability of finding a trading partner by b( ) = f ix . Independent of the actual level of coconuts in the population, an agent finds a trading partner with that probability, consumes the coconut and derives a reward of y. In Fig. , for each f ix = 0.0, 0.05, 0.1, . . . , 1.0 the ABM has been run a single time for steps and the last system configuration (namely, c i at the final step) is used to compute the mean strategy c which is then plotted against f ix . .
The model generally matches with the theoretical behaviour, especially when γ is small (farsighted agents). However, for γ = 0.2 and γ = 0.3 we observe noticeable di erences between the simulations and the fixed point curve of the theoretical model. Note that the number of coconuts (which we fix for the trading step) actually also a ects the probability with which an agent is chosen to climb and that the actual level of coconuts in the simulation is generally di erent from f ix . This might explain the deviations observed in Fig. . Setting up the experiment so that the level of coconuts is constant at f ix , however, is not straightforward because an additional artificial state-switching mechanism would have to be included that has no counterpart in the actual model.
. However, Fig. indicates that agents which adapt according to TD learning align with the theoretical results in converging to the same (optimal) strategy for a given . The next logical step is now to compare the overall behaviour of the ABM with learning to the theory.

Overall Fixed Point behaviour with Learning
.
We first checked the overall convergence behaviour of the ABM as a function of γ and compare it to the fixed point solution of Diamond ( ), see also Lux ( ). There are two interesting questions here: . what happens as we reach the bifurcation value γ > γ * at which the two fixed point curves˙ = 0 anḋ c = 0 cease to intersect?
. in the parameter space where they intersect, which of the two solutions is actually realized by the ABM with TD learning?
.  the bifurcation takes place at slightly lower values of γ. This is probably related to the deviations observed in Fig. . In fact, further experiments revealed that the learning rate α governing the fluctuations of the value estimates plays a decisive role (the larger α, the smaller the bifurcation point). The larger α is, the more likely a perturbation takes place on the values of an agent (i) which takes c i < c min meaning that agent i does not climb any longer. Besides this small deviation, however, Fig. shows that on the whole the ABM reproduces the theoretical results with considerable accuracy. .
Regarding the second question -that is, equilibrium selection -it seems that the only stable solution for the simulated dynamics is the upper fixed point, sometimes referred to as «optimistic» solution. We will confirm this below by providing numerical arguments for the instability of the lower fixed point by a series of simulation experiments.

Instability of the Lower Fixed Point
.
Previous experiments indicated that the lower fixed point derived in the original system is generally unstable under learning dynamics. In this section, we present some further results to confirm this observation by initializing the model at the lower fixed point. We concentrate again on the parameterization used in the previous sections with f = 0.8, y = 0.6, c min = 0.3, c max = 0.5, climbing costs uniformly distributed in [c min , c max ] and stick to a discount rate γ = 0.1. Fig. shows the respective »pessimistic« equilibrium solution is given by * ≈ 0.102 and c * ≈ 0.303 just slightly above c min . However, for the true fixed point initialization of the simulation model, we have to use the respective values at the fixed point V * (1), V * (0) to initialize V 0 (1) and V 0 (0). They can be obtained by solving the three-dimensional system from which Eq. ( ) has been derived Diamond steps of the ABM with TD learning for various initial conditions close to the low fixed point (shown by the dashed dark line).
Each curve in the plot is an average over simulation runs. We first look at the bold curve corresponding to the trajectory starting at the lower fixed point (see indication in the figure). It is clearly repelled from the low fixed point into the direction of the «optimistic» solution; however, it fails to reach the upper state. The figure also shows trajectories that are initialized with slightly higher V 0 (1) leading to a slight increase of the initial strategy c 0 . The larger this initial deviation, the closer the trajectories converge to the expected »optimistic« strategy c * ≈ 0.44, but for a deviation smaller than 0.005 the asymptotic behaviour of the learning dynamics does not converge to this value. Interestingly, also simulations initialized slightly below with an initial deviation of V 0 (1) = V * (1) − 0.001 evolve into the direction of the upper solution, however, settling at a still smaller final strategy value. .
While this shows that the lower fixed point (or at least a point very close to it) is repelling, this e ect of convergence to almost arbitrary states in between the two expected fixed points seems somewhat surprising at the first glance. One possible explanation is that TD learning schemes may converge to suboptimal solutions if agents to not su iciently explore the space of possibilities. In order to check if this is the reason for the unexpected convergence behaviour when starting with the low initial values, we integrated a form of exploration by adding a small amount of noise to the strategies c i each time a er the agents computed their new values. This can be seen as if agents are not completely perfect in determining the values and the respective value di erence V (1) − V (0) or just as well that they believe to be able to form estimates only of a given finite precision. In the simulations shown below, a random value uniformly distributed in between −0.0015 and +0.0015 has been added.
. Fig. shows this «>exploration» by comparing the time evolution of the model starting exactly in the theoretical low fixed point with respect to di erent system sizes. In comparison with Fig. , where only N = 100 has been considered and the initial deviation from the fixed point has been varied across realizations, we now observe an S-shaped curve that approaches the upper fixed point and stabilizes there with high accuracy. This is observed for all N . .
As the size of the system increases, the initial period in which the system stays close to the lower fixed point increases. However, as shown on the right of Fig. , the di erences between the learning curves in systems of di erent size vanish when time is rescaled by the number of agents such that one time step accounts for N individual updates. This provides further evidence for the instability of the lower fixed point and shows that it is not merely a finite size e ect but inherent in the agent system with procedural rationality based on TD learning. .
To summarize the analysis performed in this section, we computationally constructed the phase diagrams for the dynamics with learning for a system of agents. Therefore, we performed a suite of systematic computations with samples of initial conditions in the plane spanned by 0 ∈ [0, 1] and c 0 ∈ [c min , c max ]. We computed 26 × 26 samples where c 0 was determined by letting V 0 (0) = 0 and V 0 (1) = c 0 set homogeneously for the entire population. The initial level of coconuts was set randomly with 0 ∈ [0, 1] such that the probability for each agent to have a coconut in the beginning is 0 . For each initial combination, we computed simulations à steps and compare the initial point ( 0 , c 0 ) with the respective outcome a er steps. our results for a discount rate of γ = 0.1 (l.h.s.) and γ = 0.2 (r.h.s.). .
The vector field indicates convergence to a state close to the upper fixed point for most of the initial conditions. For γ = 0.1, this is true even for very small initial strategies c 0 < 0.303. However, note that this point is very close to c min and that the sampling does not resolve the region around the low fixed point well enough. For γ = 0.2, where the strategy value in the lower fixed point increases to c * ≈ 0.316, the dynamics around that point become visible. In this case, we observe that initial strategies below this value lead to convergence to c < c min , that is, to the situation in which agents do not climb any longer (and therefore = 0). However, if the initial level of coconuts is su iciently high, the system is capable of reaching the stable upper solution because there is at least one instant of learning that having a nut is profitable (V (1)) for agents initially endowed with a nut.
. Finally, a close-up view on this region is provided in Fig. for γ = 0.2. It shows that the lower fixed point acts as a saddle under the learning dynamics. As indicated above, the exact fixed point values * and c * are slightly di erent for the DE system and the learning agents model which may be attributed to small di erences in the models such as explicit exclusion of self-trading (see Section ) or the discrete learning rate α (this section).

Outline and Conclusions
. This paper makes four contributions. First, it develops a theory-aligned agent-based version of Diamond's coconut model (Diamond ). In the model, agents have to make investment decisions to produce some good and find buyers for that good.
Step by step, we analysed the e ects of single ingredients in that model -from homogeneous to heterogeneous to adaptive strategies -and related them to the qualitative results obtained from the original dynamical systems description. We computationally verified that the overall behaviour of the ABM with adaptive strategies aligns to a considerable accuracy with results of the original model. The main outcome of this exercise is the availability of an abstract baseline model for search equilibrium which allows to analyse more realistic behavioural assumptions such as trade networks, heterogeneous information sets and di erent forms of bounded rationality but contains the idealized solution as a limiting case. .
Secondly, this work provides insight on the e ects of micro-level heterogeneity on the macroscopic dynamics and shows how heterogeneous agents can be taken into account in aggregate descriptions. We derived a heterogeneity correction term that condenses the present heterogeneity in the system and show how this term should be coupled to the mean-field equation. These mathematical arguments show that a full characterization of the system with heterogeneity leads to an infinite dimensional system of di erential equations the analysis of which will be addressed in the future. In this paper, we have provided support for the suitability of the heterogeneity term by simulation experiments with four di erent strategy distributions. We envision that the heterogeneity correction may be useful for other models such as opinion dynamics with heterogeneous agent susceptibilities as well.
. The third contribution is the introduction of temporal di erence (TD) learning as a way to address problems that involve inter-temporal optimization in an agent-based setting. The coconut model serves this purpose so well because the strategy equation in the original paper is based on dynamic programming principles that are also at the root in this branch of reinforcement learning. Due to this common foundation we arrive at an adaptive mechanism for endogenous strategy evolution that converges to one of the theoretical equilibria, but provides, in addition to that, means to understand how (and if) this equilibrium is reached from an out-of-equilibrium situation. Such a characterization of the model dynamics is not possible in the original formulation.
. Our fourth contribution is providing new insights into equilibrium selection and stability of equilibria in the coconut model. Under learning dynamics, only the upper «optimistic» solution with a high coconut level (high productivity) is realized. Furthermore, convergence to this equilibrium takes place for a great proportion of out-of-equilibrium states. In fact, the phase diagrams presented at the end of the previous section show that in a system with farsighted agents (γ = 0.1) the market failure equilibrium (no production, no trade) is reached only if agents are exceedingly pessimistic. If agents are less farsighted (γ = 0.2), this turning point increases slightly and makes market failure probable if the production level (f G(c t i )) is currently low for some reason. .
However, we do not want to make general claims about the absence of cyclic equilibria in the artificial search and barter economy that the coconut model exemplifies. It is possible -even likely -that a richer behaviour is obtained when agents learn not only based on their own state but take into account information about the global state of the system, trends or the strategy of others. This paper has been a necessary first step to address such question in the future.