The Explanation of Social Conventions by Melioration Learning

In line with previous research, the evolution of social conventions is explored by n-way coordination games. A convention is said to be established if the decisions of all actors become synchronised over time. In contrast to the earlier studies, an empirically well-grounded process of reinforcement learning is used as behavioural assumption. The model is called melioration learning. It is shown by agent-based simulations that melioration enables the actors to establish a convention. Besides the payo s of the coordination game, the network structure of interactions a ects the actors’ ability to coordinate their choices and the speed of convergence. The results of melioration learning are compared to predictions of the Roth-Erev model.


Introduction
. Social conventions play decisive roles in everyday life. These rules of conduct assist in social interactions by prescribing the choice of one particular alternative if several are available. Examples are the rule of right-or le -hand driving in a country, the way of greeting among members of a cultural group, or the usage of the same so ware in a company. In either case, multiple alternatives are feasible, but the agreement on one alternative is advantageous. Therefore, conventions di er from other social norms by being self-preserving once they are established. Compliance is sought by everyone because of an automatic punishment a er deviation.
. A more complicated issue in the explanation of social conventions is their initiation. Schelling ( ) addressed this di iculty in his study of social conflict. Due to limited communication and perception, the initial agreement on a behaviour can be problematic although the common behaviour is in everyone's best interest (Schelling , p. ). In any situation without central authority, the actors must coordinate their choices, and the outcome depends on the available information and the actors' way of decision-making. Hence, the main question in the study of social conventions concerns their establishment.  Table : A sample coordination game .
Following the work of Schelling, the evolution of social conventions has usually been modelled by n-way coordination games (e.g., Young ). These games refer to the sequential play of two-person coordination games with multiple partners. A sample two-person coordination game is shown by Table . The two actors, which are denoted by x and y, must decide between alternatives A and B. Given a pair of decisions, the table defines the rewards. Nothing is gained in case that actors choose di erent alternatives. If both actors take the same alternative A or B, they receive a reward of 10 or 6, respectively. .
In other words, the network structure had no e ect on the outcome of the n-way game. This result was due to random mistakes, which the actors made with strictly positive probability. Without mistakes, the network structure a ects the outcome (Buskens & Snijders ), and also risk-dominated equilibria occur. Furthermore, it is possible that two di erent conventions coexist in some networks (see also Berninghaus & Schwalbe ).
. Overall, the results suggest that the particular combination of behavioural assumptions and network structure is relevant. In regard to behavioural assumptions, the models of all previously mentioned studies can be characterised as "myopic best reply" (Berninghaus & Schwalbe , p. ). Given this model, actors learn about the partners' behaviour from past interactions and use this knowledge to choose a best action given the reward structure of the game. This means that information about the situation and past actions of the partners was presumed to be available in all former studies. .
While these are reasonable assumptions in most situations, this paper asks about the theoretical implications of dropping them: Is it possible to explain the emergence of conventions if actors are neither aware of the payo structure nor the choices of other actors. The contribution of this paper is, thus, mainly theoretical, but with relevance to future empirical research. For instance, in case that the results di er from previous studies, conclusions can be drawn from empirical macro-level observations to micro-level assumptions.
. A behavioural model in which an actor's decision is based only on her own previous actions and rewards is called completely uncoupled (e.g., Babichenko ). Despite this limiting setting, some learning models still ensure the convergence of behaviour to Nash equilibria (Foster & Young ; Germano & Lugosi ; Young ; Babichenko ). For example, Pradelski & Young ( ) introduced a model of completely uncoupled learning that yields welfare-maximising Nash equilibria in two-person coordination games.

.
However, the behavioural model of Pradelski & Young ( ) was designed to converge to equilibria. The assumptions were not justified by empirical observations or psychological experiments. In contrast, most psychological models of learning were developed to represent the development of human behaviour as realistic as possible (e.g., Staddon ). Some popular instances of realistic models implement a form of learning known as operant conditioning or reinforcement learning (Sutton & Barto ; Staddon & Cerutti ). These models are completely uncoupled but do not necessarily converge to an equilibrium in interactive situations.

.
In this paper, a simple and empirically grounded model of reinforcement learning is used to analyse the behaviour in n-way coordination games. Following past research, this model is called melioration learning. The details are given in the next section. A erwards, it is shown that actors who learn by melioration are able to coordinate their decisions in n-way coordination games and, hence, to establish a convention. The long-term outcome is a risk-dominant equilibrium of the two-person stage game if one exists. The results are compared to the predictions of another, well-known model of reinforcement learning: the Roth-Erev model (Roth & Erev ). While the outcomes are qualitatively similar, the models di er in their speed of convergence, especially in regard to the e ects of the network structure.

The model .
Unlike previous formal representations of melioration (Brenner & Witt ; Sakai et al. ; Loewenstein ), this paper uses a model that is perfectly consistent with the ideas of Vaughan & Herrnstein ( ) and builds on an algorithm of decision-making that is called ε-greedy strategy (Sutton & Barto , p. ). This strategy takes a parameter ε ∈ (0, 1), which is called exploration rate and specifies the probability of an alternative being chosen uniformly at random. With probability 1 − ε, an alternative with the currently highest value is selected. If multiple alternatives have the highest value, one of them is chosen randomly. In melioration learning, the value of an alternative is the average of the corresponding past rewards. Figure : The situation of sequential decision-making: A er every choice X t from a set of choice alternatives E, a reward R t ∈ (0, ∞) is obtained. .
Figure illustrates the decision-making process, which takes place along discrete time steps t ∈ N. Given a finite set of choice alternatives E, actions are emitted by the choice of an element X t ∈ E from the set of alternatives. A er every decision, a non-negative reward R t ∈ (0, ∞) is received from the environment and processed by the actor. The information processing is very simple and specified by algorithm .
Algorithm The melioration learning algorithm Require: exploration rate ε ∈ (0, 1), set of alternatives E : t ← 0 if ε > random number between and (uniformly distributed) then : choose a random action X t ← e ∈ E using a uniform distribution : else : choose action X t ← e such that e ∈ arg max j∈E V t (j) (uniformly at random if multiple candidates) In algorithm , an actor is assumed to maintain a set of values {V t (j)} j∈E that are iteratively updated. Initially, all values are set to zero. A set of frequencies {K t (j)} j∈E keeps track of the number of choices of each alternative. The reward realisation y = R t is used to modify the value of the chosen alternative such that it gives the average of all past rewards of e.
Comparison with other models of learning .
In economics, Brenner ( ) distinguished two types of learning: reinforcement learning and belief learning (a similar categorisation is found in Camerer , ch. ). Melioration is categorised as reinforcement learning because it is less cognitively demanding than belief learning models. The di erences are elaborated in the following. Additionally, a comparison with other models of reinforcement learning is given.

Belief learning models .
In algorithm , an actor learns the value of an alternative, which constitutes a belief about the environment. Since the actor responds to these beliefs in an optimal way, melioration learning can be seen as a rudimentary form of belief learning. However, melioration learning is di erent from most belief learning models. In the latter, the formation of beliefs generally exceeds the level of actions. Instead, the values of actions are externally given, and beliefs about the reinforcement mechanism or the behaviour of other actors are acquired. For example, in a two-person game-theoretic situation, the actors may know the structure of the game and learn the strategy of the opponent. A general belief learning algorithm for this situation is given by the following pseudocode (Shoham & Leyton-Brown , p. ): Initialize beliefs about the opponent's strategy repeat: Play a best response to the beliefs Observe the opponent's actual choice and update beliefs accordingly One example of belief learning is fictitious play: "in fictitious play, an agent believes that his opponent is playing the mixed strategy given by the empirical distribution of the opponent's previous actions" (Shoham & Leyton-Brown , p. ). In other words, the actor remembers the decisions of the opponent, forms the corresponding relative frequencies, and chooses an action with the highest expected reward assuming that the relative frequencies resemble the opponent's probabilities of choice. Fictitious play di ers from melioration learning because the latter ignores the behaviour of the opponent and the expected future reward of an action. Instead, it focuses on the average rewards of past actions, and no mental model of the situation is built.

Reinforcement learning models
. Unlike belief learning, reinforcement learning is a simple idea about behavioural change. It can be summarised by Thorndike's law of e ect: "pleasure stamps in, pain stamps out". More specifically, behaviour that is followed by a positive experience is likely to reoccur, but, if provoking negative reactions, it diminishes over time. Two examples of reinforcement learning are the Bush-Mosteller and the Roth-Erev model.

.
The Bush-Mosteller model (Bush & Mosteller ) states that a probability of choice changes linearly in the level of satisfaction. More specifically, an actor chooses an element e ∈ E at time t ∈ N with probability q e (t) ∈ [0, 1]. A er receiving a reward y t ∈ R for choosing e, the probability is updated by .
The dynamics of Bush-Mosteller learning di er from the dynamics of melioration. This is seen when comparing the probabilities of choosing an action. According to algorithm , the probability of choosing e at time t is: When comparing equations ( ) and ( ), behaviour that follows the Bush-Mosteller model changes more gradually than melioration behaviour. Furthermore, the dynamics of equation ( ) depend on the level of satisfaction σ(y t ). In the past, this function was implemented by comparing the actual outcome to an aspiration level (e.g., Macy & Flache ). This aspiration level is a key factor and significantly a ects the long-term behaviour (Macy ; Macy & Flache ; Bendor et al. ). .
The Roth-Erev model describes another form of reinforcement learning and is widely known in economics. Algorithm specifies its basic version (Roth & Erev , p. ). Instead of average values, the actor holds a set of accumulated values {P t (e)} e∈E , which are called propensities. At each time step, an alternative e ∈ E is chosen with probability Pt(e) j∈E Pt(j) . The parameter ε maintains a level of exploration.
Algorithm The Roth-Erev learning algorithm Require: exploration rate ε ∈ (0, 1), set of alternatives E : t ← 0 : initialise P 1 (e) ← 1, for all e ∈ E : repeat : t ← t + 1 : choose action X t ← e ∈ E randomly using the probabilities In the following analysis, the outcomes of melioration learning are compared to the predictions of the Roth-Erev model. In contrast to other learning processes, Roth-Erev is very similar to melioration. Both models take a "mechanistic perspective on learning", which means that "people are assumed to learn according to fixed mechanisms or routines" (Brenner , p. ). Additionally, simple versions with only one parameter (the exploration rate) exist. Other models of reinforcement learning, such as Bush-Mosteller, require additional assumptions or the specification of further parameters.

.
Bush-Mosteller and Roth-Erev are just two of many forms of reinforcement learning. Other models are, for example, developed and analysed by computer scientists in a field called RL (Sutton & Barto ). While these models di er from the ones in economics (Izquierdo & Izquierdo ), most of them are completely uncoupled as defined above. Moreover, melioration learning, as given by algorithm , constitutes a relatively trivial instance of an RL method that is called Q-learning (Watkins ). However, unlike the general version of Q − learning, melioration neglects any possible consequences of present actions on future rewards.

.
As pointed out at the beginning, melioration learning accounts for empirical observations in situations of repeated choice (see also Sakai et al. , p. ). However, generally, there is "tremendous heterogeneity in reports on human operant learning" (Shteingart & Loewenstein , p. ). In particular, melioration seems too simple to accurately represent the complexity of human decision-making (e.g., Barto et al. , p. ) and more sophisticated models of learning have been suggested (e.g., Sutton & Barto ; Sakai et al. ). Nevertheless, it may serve as valid micro-level model in the study of social phenomena.

Analysis
. Given that melioration leaning is implemented as instance of the ε-greedy algorithm with Q-learning, results from previous research can be adopted. On the one hand, algorithm converges to optimal behaviour under certain assumptions of stationarity (Watkins & Dayan ). These situations include Markov decision processes (Bellman ) and, therefore, many non-social settings. Besides stationarity, convergence also requires that the exploration rate decreases su iciently slowly towards zero, e.g. if a time-dependent exploration rate ε t := ε 1+ j∈E Kt(j) instead of ε is used in line of algorithm (Jaakkola et al. ). .
On the other hand, convergence is not guaranteed if multiple persons interact and reinforcements are contingent upon the decisions of everyone (Nowé et al. , p. ). While equilibria are reached in some two-person games (Sandholm & Crites ; Claus & Boutilier ; Gomes & Kowalczyk ), the behaviour fails to converge in general (Wunder et al. ). Moreover, there is no work about the convergence of Q-learning (and, thus, melioration learning) in situations with more than two actors. Because of the complexity of these situations, the convergence of any learning process is di icult to derive analytically. .
In particular, a Markov chain (MC) analysis of the model (e.g., Izquierdo et al. ; Banisch ) is impeded by the adjustment of the values {V t (j)} j∈E as historical averages. In order to obtain a time-homogeneous Markov chain, each state must contain the sets {V t (j)} j∈E and {K t (j)} j∈E of all actors. The resulting chain is not irreducible because the frequencies K t (j) cannot decrease with time. Only a time-inhomogeneous MC may be irreducible. However, either approach precludes the application of standard techniques. Fortunately, computer simulations can still be employed to analyse the model and derive hypotheses for particular situations. .
In the following simulations, algorithm (melioration learning) and algorithm (Roth-Erev) are applied to nway coordination games. In both cases, the exploration rate is set to ε = 0.1 and kept constant during the whole simulation. This strictly positive rate allows a trade-o between the exploitation of the currently best action and the exploration of alternatives. Because of the finite nature of every simulation run, a continuously decreasing exploration rate would actually hinder the appearance of stable results. The actors would react too slowly to changes in the environment.   .
If these games are repeatedly played by the same two persons, melioration learning as well as Roth-Erev predict a pure Nash equilibrium. In Figure   . More specifically, the frequency of (B, B) increases with b and is higher in game II than game I. The first e ect is due to the larger rewards for choosing alternative B. The second e ect occurs because, in both learning models, the attachment of values (V t (·) or P t (·)) to the alternatives takes place irrespectively of the choice of the other actor. Since ε > 0, also the outcomes (A, B) and (B, A) emerge occasionally. This implies that the value of action B is slightly higher in game II. .
The following simulations were run with groups of 50 actors, each of whom interacted with multiple partners. A network specified the structure of interactions. While the vertices of the network represent the actors, an edge exists between two vertices if the corresponding actors repeatedly take part in the same coordination game. The actors do not distinguish between the partners. Only one set of values is maintained, and the partner is not taken into consideration when choosing between the alternatives. This means that, given the games of Table  , all members of a connected component of the network should agree on a single alternative in order to avoid the inferior outcomes (A, B) and (B, A).

Figure :
Examples of the small-world network with 10 vertices and di erent parameter settings .
In particular, the small-world (β-)model of Watts ( , p. ) is used to specify the structure of interactions. This model has two parameters: the average number of neighbours d ∈ {2, 4, 6, . . . } and the probability of rewiring β ∈ [0, 1]. While the small-world model reproduces only some properties that are found in real networks, it covers two important ones: high clustering and low distances. If β = 0, clustering and distance are maximal. The network resembles a one-dimensional lattice in which each actor has exactly d neighbours (see Figure ). With an increasing β, more and more edges are rewired from a close neighbour to a random actor of the network. In case of β = 1, the average distance is minimal and no clustering remains. Networks with high levels of clustering but still low distances are found for small but strictly positive values of β. .
The small-world model is an excellent technique to study the e ects of restrictive network structures. If β = 0, interactions are limited to rigid clusters. In large networks, this hinders the establishment of a convention because a high number of rounds is required to coordinate the actions between distant parts of the network. If β increases, interactions take place also with distant regions. This may accelerate the agreement on a convention.
.  In case of melioration learning, groups that play game I with b < 8 or game II with b > 4 are able to coordinate their decisions within the first 1 000 rounds. With further rounds of the simulations, all groups eventually establish a convention as long as there is a risk-dominant equilibrium (game I with b = 10 and game II with b = 4). This is shown in the appendix (Figure ). The results of the simulations with Roth-Erev seem similar, but the convergence takes place substantially more slowly. Nevertheless, the simulations confirm the result of Young ( , p. ): the groups establish a convention by coordinating their members' choices to a risk-dominant equilibrium. This holds true even if the risk-dominant equilibrium is ine icient (game II with 4 < b < 10).
In situations without risk-dominant equilibrium, both alternatives persists. Figure shows nine of the 1 000 groups that played game I with b = 10. Di erent colours indicate di erent choices at the 1 000th round of the simulations (without exploration). The actors are partitioned into clusters, which are stable over time. Actors on the edge of a cluster have no incentive to change behaviour, for they receive a reward of ten from one of the partners and zero reward from the other one. Switching to the other alternative would not change this pattern, unless exactly one of the two partners switches as well. .
The di iculty of establishing a convention in games without risk-dominant outcome can be traced back to the restrictive structure of polygons (small-world networks with d = 2 and β = 0). First, the convergence to a single alternative is made possible by adding more connections to the network. Figure shows this e ect for the melioration learning model and game I with b = 10. The relative frequencies of alternative A are measured at the 1 000th round of the simulations and for each of the 1 000 groups separately. The histograms picture the frequencies of groups with a particular relative frequency. The plots indicated that a higher number of network partners d enables a larger fraction of groups to choose a single alternative. If d = 20, approximately half of the groups can already coordinate their choices within 1 000 rounds. All groups achieve a convention in complete or nearly complete networks (d = 40 or d = 50). While in half of the groups, everyone chooses A, in the other half, a convention of selecting alternative B emerges. In case of melioration learning, the frequencies of the two outer intervals increase with d, which means that more and more groups agree upon a common alternative. Since the relative frequencies are measured at the 1 000th round of the simulations, a comparison with Figure reveals that a high number of interaction partners (given by d) either accelerates the establishment of conventions (game I with b = 8 and game II with b = 2) or makes it possible in the first place (game I with b = 10 and game II with b = 4). In simulations of the Roth-Erev model, the number of contacts d increases the frequency of conventions only in game I and at a slower rate. In game II, Roth-Erev is incapable of quickly coordinating a group.   In summary, a more dense or a more random structure supports the establishment of a convention. In game I with b = 8 and game II with b = 2, a convergence to the risk-dominant outcome (A, A) is seen. In the games without risk-dominance relation, the results di er. While the groups are equally divided among the two e icient outcomes in game I with b = 10, the actors settle on the ine icient outcome (B, B) in game II with b = 4.

Conclusion
. With melioration learning, a simple and empirically grounded model of reinforcement learning was shown to explain the emergence of conventions. In contrast to previous research on this subject (e.g., Young ; Berninghaus & Schwalbe ; Buskens & Snijders ), this study proves that conventions emerge even if the actors are neither aware of the payo structure nor the decisions of other actors. However, melioration learning should not be seen as more general than the previous models. It applies to di erent settings. Since humans can be assumed to take various information into account, the previous models might be more appropriate in situations in which information on payo s and other actors is available. For other settings, melioration learning should be used. .
In some aspects, melioration learning is actually similar to the behavioural model of Young ( ). The actors are myopic, take past occurrences into account, and make random mistakes. However, unlike the earlier model, less strict assumptions about available information and the actors' cognitive skills are required. Although they must be able to observe their payo s and to aggregate them to average values, no advanced reasoning about the given situation is necessary. Moreover, apart from the exploration rate, an alternative with the highest average value is selected with certainty. No further assumptions about probabilities of choice or stochastically independent decisions (cf. Roth-Erev) are needed. .
The computer simulations revealed that the outcomes of melioration are largely in line with the results of Young ( ). In the long run, a convention is established by converging to a risk-dominant Nash equilibrium of the stage game. Given the particular settings of the simulations, the final outcome is independent of the network structure. However, the network structure is relevant in two other respects. First, it a ects the speed of convergence in games with risk-dominant Nash equilibrium. Second, it impedes or enables the establishment of conventions in games without risk-dominance relationship. .
While games without risk-dominance relationship have not been considered by Young ( ), Buskens & Snijders ( , p. ) stated that, in these situations (corresponding to RISK = 0.5), "there are no e ects of network characteristics whatsoever". For example, in game II with b = 4, the model of Buskens & Snijders ( ) predicts "an average percentage of actors playing [B] of 50% at the end of the simulation runs". However, in the simulations with melioration learning, this percentage depends on the network structure, and may be close to 100% if the randomness parameter β or the average number of partners d is high (Figures and ). Hence, the e ects of network structure di er between the model of Buskens & Snijders ( ) and melioration learning. .
In two-person games, even the risk-dominated Nash equilibrium emerges with high frequency. Only if interactions take place with multiple partners and in large groups, the risk-dominant outcome prevails. The same e ect was seen in simulations with the Roth-Erev model. Generally, the results of melioration and Roth-Erev correspond to each other. However, Roth-Erev converges considerably more slowly than melioration learning to a stable state with convention. Furthermore, the e ect of network structures is less pronounced and partly missing in simulations with Roth-Erev. .
Currently, there are no empirical confirmations of the predictions of the simulations. On the contrary, experimental studies yielded a likely convergence to the e icient (payo -dominant) outcome, even if it is riskdominated (e.g., Frey et al. ). Additionally, network e ects on the outcome have been observed (Berninghaus et al. ; Cassar ). In these experiments, the subjects knew the payo structure of the game, and information about the decisions of other actors was available. Therefore, melioration learning is inadequate in situations in which this kind of information is provided. However, in other situations, melioration learning might be a valid model of individual behaviour. Empirical studies that corroborate this hypothesis are still missing.   (Figure ). A group's ability to coordinate its members' choices still depends on the reward b. Only in case of Roth-Erev and game II, the establishment of a convention is further impeded by high levels of exploration (ε = 0.2).

Appendix A: Further results and sensitivity analysis
According to Figure , the results are also not altered by a smaller or greater network size, which is denoted by n. This is in line with a statement of Young ( , pp. -): the speed of convergence to a risk-dominant equilibrium is independent of the number of vertices if the network is close knit to a certain degree. Since the networks of the simulations are polygons, this condition is satisfied (Young , p. ).
Finally, the e ect of the rewiring parameter β was tested for robustness by altering the second parameter d.
In Figure , only results from simulations with melioration learning are included. On the one hand, the establishment of a convention is not facilitated by β if d = 2. In case of small d and β > 0, a network is o en disconnected, which hinders the coordination. On the other hand, the relationship between the randomness parameter β and the distribution of choices is stronger in networks with a high average number of partners (d = 20). This corresponds to the result that a large number of connections or a high level of randomness supports the establishment of a convention.  The NetLogo-file of the simulations can be found at https://github.com/JZschache/NetLogo-games/blob/master/models/n-way-games.nlogo The next section deals with the installation of both extensions. A erwards, the usage and architecture of the ql-extension is comprehensively described. It is the core of the simulations, for it implements melioration learning and handles the parallelisation of the simulations. In the last section, a short introduction to the gamesextension is given. It facilitates the definition of two-person games in NetLogo.

Installation
First, install NetLogo (tested with NetLogo . . ). Second, create a directory named ql in the extensions subdirectory of the NetLogo installation (see also http://ccl.northwestern.edu/netlogo/docs/extensions. html). Third, download all files from the repository and move them to the newly created directory. For example: git clone https://github.com/JZschache/NetLogo-ql.git mv NetLogo-ql/extensions/ql path-to-netlogo/extensions Ensure that enable-parallel-mode is set to true.

The ql-extension
The ql-extension enables a parallelised simulation of agents who make decisions by melioration learning. In order to explain the usage of this extension, listing contains some parts of n-way-games.nlogo.
Listing : Some parts of n-way-games.nlogo The ql-extension is able to parallelise the simulation and utilise multiple cores by deploying the Akka framework (Akka . . , http://akka.io). Akka handles the di iculties of data sharing and synchronisation by a messagepassing architecture. More concretely, it requires the implementation of "Akka actors" that run independently and share data by sending messages to each other.
Simulations are parallelised in two ways. First, the NetLogo threads are not used for the ql-extension, which means that the latter runs independently of the former. Second, the learning and decision-making of agents take place simultaneously because the ql-extension runs on multiple threads.
Nevertheless, many parts of the simulation are executed by NetLogo, which does not parallelise naturally. This is a major bottleneck of the simulations. The ql-extension must wait for NetLogo to finish its calculations. The ql-extension solves this problem by the operation of multiple concurrently running instances of NetLogo. This feature is enabled by setting enable-parallel-mode to true (application.conf).
For a better understanding of the parallel mode, the architecture of the ql-extension is illustrated by the class diagram of Figure . It clarifies the connection between the extension and the NetLogo package org.nlogo. It also shows how concurrency is implemented by "Akka actors". First, each NetLogo agent (a turtle or a patch) is linked to an "Akka actor". This is realised by the QLAgent class, which constitutes the counterpart of a NetLogo agent in the ql-extension. It is characterised by an exploration rate, a list of QValues, and a decision-making algorithm (e.g. "epsilon-greedy"). A QValue instance is created for each alternative and specifies its current value. The decision-making algorithm returns an element of a list of alternatives (a list of integers). It uses the exploration rate and the QValues.
Agents are grouped together by the class NLGroup. This is a subclass of org.nlogo.api.ExtensionObject, which makes it accessible within NetLogo code. It consists of NetLogo agents and the corresponding QLAgents. Objects of this class are created by the command ql:create-group.
The main "Akka actor" of the extension is the NetLogoSupervisor. There is only one instance of this class. The NetLogoSupervisor has mutliple tasks. For example, it supervises all NLGroups and continuously triggers the choices of agents. The speed of the repeated trigger is regulated by the corresponding slider of the NetLogo interface. When triggering the choice of agents, a list of NLGroups is forwarded to the NetLogoHeadless-Router. Depending on the number of NetLogoHeadlessActors, the router splits this list into multiple parts. A erwards, the NetLogoHeadlessActors handle the choices of the agents, and the NetLogoSupervisor is free to do other things.
When initialising the NetLogoSupervisor by ql:init, several headless workspaces of NetLogo are started in the background. Headless means that no graphical user interface is deployed. The number of headless workspaces is specified in the configuration file (application.conf). A separate NetLogoHeadlessActor controls each headless NetLogo instance. This actor continuously receives a list of NLGroups.
The headless NetLogo workspaces and the NetLogoHeadlessActors were added to the ql-extension in order to improve the performance. Their only task is to repeatedly calculate the rewards of a group of agents. The Figure : Class diagram of the ql-extension performance of repeatedly calling a function is optimised by compiling this function only once. This is problematic because the NetLogo extensions API does currently not support the passing of arguments to a compiled function (see https://github.com/NetLogo/NetLogo/issues/413). A solution was mentioned by Seth Tisue in the corresponding discussion (https://groups.google.com/forum/#!msg/netlogo-devel/ 8oDmCRERDlQ/0IDZm015eNwJ) and is implemented in the ql-extension. This solution requires that each Net-LogoHeadlessActor is identified by a unique number. This number is forwarded to the reward function when it is called by the NetLogoHeadlessActor. The reward function calls ql:get-group-list with the identifying number and receives a list of NLGroupChoices. Besides the agents, an NLGroupChoice also contains a list of agents' decisions. The agents and their choices are accessed by the reporters ql:get-agents and ql:get-decisions. The rewards are set to an NLGroupChoice by ql:set-rewards. The reward function can also be used to update the (NetLogo) agents directly, e.g. by moving the agents within the NetLogo world or by setting variables. Since the agents are passed from the main NetLogo instance, the changes take e ect in this instance as well. Finally, the reward function must return a new list of NLGroupChoices that correspond to the received list but with the rewards set.
The following list summarises the usage of the main commands of the ql-extension: • ql:init initialises the ql-extension by specifying a turtleset or a patchset.
• ql:create-group creates a group from a list of pairs. Each pair is a list of two elements: first, an agent and, second, a list of integers (the alternatives). An object of type NLGroup is returned.
• ql:set-group-structure takes a list of objects of type NLGroup as parameter. It sets a static group structure.
• ql:start or ql:stop starts or stops the simulation.
• ql:get-group-list can only be called from the reward function and must forward the headless-id.
It returns a list of objects of type NLGroupChoice.
• ql:get-agents returns the list of NetLogo agents (turtles or patches) that are held by an NLGroupChoice.
• ql:get-decisions returns the list of decisions that are held by an NLGroupChoice. The indices of the decisions correspond to the indices of the agents that are held by the NLGroupChoice such that the decision at index i belongs to the agent at index i.
• ql:set-rewards sets a list of rewards for the decisions that are held by an NLGroupChoice. It returns a copy of the NLGroupChoice with the rewards attribute set. The indices of the rewards must correspond to the indices of the agents that are held by the NLGroupChoice such that the reward at index i belongs to the agent at index i.

The games-extension
The games-extension provides a convenient way to define normal-form game-theoretic situations. Optimal points and Nash equilibria are calculated and returned to NetLogo in a well-arranged form. A two-person game can be defined manually or by a predefined name. The first way is demonstrated with the help of Figure   In Figure , two NetLogo input fields named means-x and means-y are seen. Each field contains the mean rewards of player x or player y, respectively, given the choices of both players. Player x is the row-player in both fields. In order to create a game from the two input fields, two game-matrices must be created by the reporter games:matrix-from-row-list and joint together by games:two-persons-game (see function set-game in n-way-games.nlogo).
The second way of creating a two-person game requires only a name and, occasionally, the numbers of alternatives for both players: l e t game games : two−persons −gamut−game game−name n−a l t −x n−a l t −y The reporter games:two-persons-gamut-game is based on the Gamut library (http://gamut.stanford. edu). Gamut makes available over thirty games that are commonly found in the economic literature. The games-extension currently supports the following parameters as name of a game: • "BattleOfTheSexes" • "Chicken" • "CollaborationGame" • "CoordinationGame" JASSS, ( ) , http://jasss.soc.surrey.ac.uk/ / / .html Doi: . /jasss.