©Copyright JASSS

JASSS logo ----

Keiki Takadama, Tetsuro Kawai and Yuhsuke Koyama (2008)

Micro- and Macro-Level Validation in Agent-Based Simulation: Reproduction of Human-Like Behaviors and Thinking in a Sequential Bargaining Game

Journal of Artificial Societies and Social Simulation vol. 11, no. 2 9
<http://jasss.soc.surrey.ac.uk/11/2/9.html>

For information about citing this article, click here

Received: 04-Aug-2007    Accepted: 13-Mar-2008    Published: 31-Mar-2008

PDF version


* Abstract

This paper addresses both micro- and macro-level validation in agent-based simulation (ABS) to explore validated agents that can reproduce not only human-like behaviors externally but also human-like thinking internally. For this purpose, we employ the sequential bargaining game, which can investigate a change in humans' behaviors and thinking longer than the ultimatum game (i.e., one-time bargaining game), and compare simulation results of Q-learning agents employing any type of the three types of action selections (i.e., the ε-greedy, roulette, and Boltzmann distribution selections) in the game. Intensive simulations have revealed the following implications: (1) Q-learning agents with any type of three action selections can reproduce human-like behaviors but not human-like thinking, which means that they are validated from the macro-level viewpoint but not from the micro-level viewpoint; and (2) Q-learning agents employing Boltzmann distribution selection with changing the random parameter can reproduce both human-like behaviors and thinking, which means that they are validated from both micro- and macro-level viewpoints.

Keywords:
Micro- and Macro-Level Validation, Agent-Based Simulation, Agent Modeling, Sequential Bargaining Game, Reinforcement Learning

* Introduction

1.1
The validation of computational models and simulation results is a critical issue in agent-based simulation (ABS) (Axelrod 1997; Moss and Davidsson 2001) due to the fact that simulation results are very sensitive to how agents are modeled. To overcome this problem, several validation approaches have been proposed for social simulations, which are roughly categorized as follows (Carley and Gasser 1999): (1) theoretical verification that determines whether the model is an adequate conceptualization of the real world on the basis of a set of situation experts; (2) external validation that determines whether the results from the virtual experiments match the results from the real world; and (3) cross-model validation that determines whether the results from one computational model map onto, and/or extend, the results of another model. All these approaches contribute to improving the validation of computational models and simulation results. It should be noted, however, that these approaches typically validate complex social phenomena from the macro-level viewpoint (e.g., organizational performance caused by interactions among individual agents). This is simply because such macro-level phenomena are usually of primary interest. However, Gilbert claimed that "to validate a model completely, it is necessary to confirm that both the macro-level relationships are as expected and the micro-level behaviours are adequate representations of the actors' activity" (Gilbert 2004). From this viewpoint, few studies conducted both the micro- and macro-level validation in agent-based simulation.

1.2
Toward a complete validation of computational models and simulation results, this paper aims at addressing both micro- and macro-level validation in agent-based simulation. For this purpose, this paper starts by comparing several simulation results of different agents in the same model with subject experiment results. The point of this approach is to compare different agents, which differs from the general model-to-model approach (Hales et al. 2003) that compares different models (e.g., a comparison between culture models (Axelrod 1997) and Sugarscape (Epstein and Axtell 1996) in the work of Axtell et al. (1996)). We employ such an approach for the following reasons (Takadama et al. 2003): (1) it is difficult to fairly compare different computational models under the same evaluation criteria, since they are developed according to their own purpose; (2) common parts in different computational models are very few in number, which makes it difficult to replicate either computational model with the other; and (3) simulation results are sensitive to even a small modification in a model, which makes it difficult to find the key elements or factors that make simulation results sensitive in a model (i.e., there are several candidates that affect simulation results, which makes it hard to find the most important candidates). Since these difficulties prevent a comparison of computational models and their fair comparisons, we start by comparing the results of ABSs whose agents differ only in one element as the first step toward our goal. An example of such elements includes learning mechanisms applied to agents. In this paper, different kinds of action selection mechanisms in learning mechanisms (i.e., the &epsilon-greedy, roulette, and Boltzmann distribution selections, which are all described in Section 3) are employed for agent modeling.

1.3
To address this issue, this paper explores agent modeling that is validated from both the micro- and macro-level viewpoints by comparing several simulation results of different agents with subject experiment results. Precisely, we conduct the simulation of agents with different action selection mechanisms and compare these simulation results with subject experiment results to conduct both micro- and macro-level validation.

1.4
This paper is organized as follows. Section 2 explains an example (i.e., the bargaining game) employed in this paper and an implementation of agents is described in Section 3. Section 4 presents computer simulations and Section 5 discusses the validity of agents from both the micro- and macro-level viewpoints. Finally, our conclusions are given in Section 6.

* Bargaining Game

Why the bargaining game?

2.1
In order to address both micro- and macro-level validation in agent-based simulation as described in the previous section, we focus on bargaining theory (Muthoo 1999; 2000) in game theory (Osborne and Rubinstein 1994) and employ a bargaining game (Rubinstein 1982) where two or more players try to reach a mutually beneficial agreement through negotiations. This game has been proposed for investigating when and what kinds of offers from an individual player can be accepted by the other players. We selected this domain for meeting our goal because it can investigate the change in both the payoff of human players from the macro-level viewpoint and the thinking of human players from the micro-level viewpoint. In particular, the sequential bargaining game is employed because (1) both viewpoints can be investigated longer than in the ultimatum game (i.e., one-time bargaining game) and (2) sequential negotiations are naturally conducted in general human society instead of one-time negotiation (i.e., it is a rare case when a negotiation process ends after only one negotiation).

What is the bargaining game?

2.2
To understand the bargaining game, let us give an example from Rubinstein's work (1982), which illustrated a typical situation using the following scenario: two players, P1 and P2 , have to reach an agreement on the partition of a "pie." For this purpose, they alternate offers describing possible divisions of the pie, such as "P1 receives x and P2 receives 1-x at time t," where x is any value in the interval [0,1]. When a player receives an offer, the player decides whether to accept it or not. If the player accepts the offer, the negotiation process ends, and each player receives the share of the pie determined by the concluded contract. If the player decides not to accept the offer, on the other hand, the player makes a counter-offer, and all of the above steps are repeated until a solution is reached or the process is aborted for some external reason (e.g., the number of negotiation processes is finite). If the negotiation process is aborted, neither player can receive any share of the pie.

2.3
Here, we consider the finite-horizon situation, where the maximum number of steps (MAX_STEP) in the game is fixed and all players know this information as common knowledge. In the case where MAX_STEP=1 (also known as the ultimatum game), player P1 makes the only offer and P2 can accept or refuse it. If P2 refuses the offer, both players receive nothing. Since a rational player operates the notion that "anything is better than nothing," a rational P1 tends to keep most of the pie to herself by offering only a minimum share to P2 . Since there are no further steps to be played in the game, a rational P2 inevitably accepts the tiny offer.

2.4
By applying a backward induction reasoning to the situation above, it is possible to perform a simulation for MAX_STEP>1. For the same reason seen in the ultimatum game, the player who can make the last offer is better positioned to receive the larger share by offering a minimum offer (Ståhl 1972). This is because both players know the maximum number of steps in the game as common knowledge, and therefore the player who can make the last offer can acquire a larger share with the same behavior as in the ultimatum game at the last negotiation[1]. From this feature of the game, the last offer is granted to the player who does not make the first offer if MAX_STEP is even, since each player is allowed to make at most MAX_STEP/2 offers. On the other hand, the last offer is granted to the same player who makes the first offer if MAX_STEP is odd.

2.5
After this section, we use the terms "payoff" and "agent" instead of the terms "share" and "player" for their more general meanings in the bargaining game.

* Modeling Agents

Why reinforcement learning agents?

3.1
For the bargaining game, we employ reinforcement learning agents (Sutton and Barto 1998) because a lot of research has shown that reinforcement learning agents have a high reproduction capability of human-like behaviors (Roth and Erev 1995; Erev and Roth 1998; Iwasaki et al. 2005; Ogawa et al. 2005). For example, Roth and Erev compared simulation results of simple reinforcement learning agents with results of subject experiments in several examples (Roth and Erev 1995; Erev and Roth 1998) revealing that (1) computer simulation using simple reinforcement learning agents can better explain the result of subject experiments than economic theory; and (2) the former approach has greater potential of predicting results than the latter approach. In related work, Ogawa and their colleagues compared simulation results with subject experiment results in monopolistic intermediary games (Spulber 1999), which more real-world complexity than examples addressed in Roth and Erev's works (Roth and Erev 1995; Erev and Roth 1998), and revealed that simple reinforcement learning agents can reproduce the subject experiment results more precisely than the best response agents and random agents (Iwasaki et al. 2005; Ogawa et al. 2005).

3.2
Since these results are validated from the macro-level viewpoint which means that they are not sufficient for micro-level validation, this paper investigates the reproduction capability of reinforcement learning agents in terms of both micro- and macro-level validation in order to explore validated agents through comparisons of them. The reinforcement learning agents, specifically, Q-learning agents (Watkins and Dayan 1992) in computer science literature are employed among other famous agents like Roth's learning agents (Roth and Erev 1995; Erev and Roth 1998) in social science literature. This is because our previous research revealed that Q-learning agents can learn consistent behaviors and acquire sequential negotiation in the sequential bargaining game, while Roth's agents cannot (Roth's agents work well in one-time negotiation) (Takadama et al. 2006).

3.3
Furthermore, an employment of reinforcement learning agents including Q-learning agents is useful for model-to-model comparisons in terms of transferability of agents to other domains. This is because the agent's model is very simple which makes it easy to replicate for further analysis, in comparison with conventional models which are generally very complex, ad hoc, and created for their own purpose.

An implementation of agents

3.4
This section explains an implementation of reinforcement learning agents in the framework of the sequential bargaining game as follows.

Figure
Figure 1. Reinforcement learning agents

Q(s,a)=Q(s,a)+&alpha[r+&gamma maxa'&isin A(s') Q(s',a')-Q(s,a)].   ....   (1)

Figure
Table 1. Variables in Q-learning

3.5
For the above mechanism, Q-learning mechanism estimates the expected rewards by using the next Q-values, which strengthens the sequential state and action pairs that contribute to acquiring the reward. This is done by updating Q(s,a) to be close to r+maxa'&isin A(s') Q(s',a'). Precisely, Q(s,a) is close to maxa'&isin A(s') Q(s',a') until the final negotiation because r is set to 0 due to the fact that the reward is not obtained until the bargaining game is completed, while Q(s,a) is close to r at the final negotiation because r is set by the acquired reward calculated from the payoff and maxa'&isin A(s') Q(s',a') is set to 0 which indicates that there is no further negotiation.

3.6
For the action selection mechanisms that determine the acceptance of an offer or counter-offer, the following methods are employed.

3.7
As a concrete negotiation process, agents proceed as follows. Defining {offer,offered}iA{1,2} as the ith offer value (action) or offered value (state) of agent A1 or A2 , A1 starts by selecting one Q-value from the line S(Start) (i.e., one Q-value from {Q01 , ..., Q09 }[3] in the line S), and makes the first offer, offer1A1, according to the selected Q-value (for example, A1 makes an offer of 10% if it selects Q01). Here, we count one step when either agent makes an offer. Then, A2 selects one Q-value from the line offered1A2 (= offer1A1) (i.e., one Q-value from {QV0, ..., QV9 }, where V= offered1A2 (= offer1A1)). A2 accepts the offer if QV0 (i.e., the acceptance (A)) is selected; otherwise, it makes a counter-offer, offer2A2, according to the selected Q-value in the same way as A1. This cycle is continued until either agent accepts the offer of the other agent or the negotiation is over (i.e., the maximum number of steps (MAX_STEP) is exceeded by deciding to make a counter-offer instead of acceptance at the last negotiation step).

3.8
To understand this situation, let us consider the simple example where MAX_STEP=6 as shown in Figure 2. Following this example, A1 starts to make an offer of 10%(= offer1A1) to A2 by selecting Q01 from the line S(Start). However, A2 does not accept the first offer because it determines to make a 10%(= offer2A2) counter-offer by selecting Q11 from the line 10%(= offered1A2, corresponding to A1's offer). Then, in this example, A1 makes a 90%(= offer3A1) counter-offer by selecting Q19 from the line 10%(= offered2A1), A2 makes a 90%(= offer4A2) counter-offer by selecting Q99 from the line 90%(= offered3A2), A1 makes a 10%(= offer5A1) counter-offer by selecting Q91 from the line 90%(= offered4A1), and A2 makes a 10%(= offer6A2) counter-offer by selecting Q11 from the line 10%(= offered5A2). Finally, A1 accepts the 6th offer from A2 by selecting Q10 from the line 10%, which results in A(acceptance). But, if A1 makes a counter-offer instead of accepting the 6th offer from A2 at the last negotiation step (which means to exceed the maximum number of steps), both agents can no longer receive any payoff, i.e., they receive 0 payoff.

Figure
Figure 2. Example of a negotiation process

3.9
Here, we count one iteration when the negotiation process ends or fails. In each iteration, Q-learning agents update the worth pairs of state and action in order to acquire a large payoff.

* Simulation

Simulation cases

4.1
The following simulations were conducted in the sequential bargaining game as comparative simulations shown in Table 2.

Figure
Table 2. Simulation cases

Evaluation criteria and parameter setting

4.2
In each simulation, (a) the payoff for two agents and (b) the negotiation process size are investigated. Here, the negotiation process size is the number of steps until an offer is accepted or MAX_STEP if no offer is accepted. All simulations are conducted for up to 10,000,000 iterations, which is enough for the agents to learn appropriate behaviors, and the results show the moving average of 10,000 iterations, which are all averaged over 10 runs. As for the parameter setting, the variables are set as follows.

4.3
Note that (1) preliminary examinations found that the tendency of the results does not drastically change according to the above parameter setting. We have confirmed, in particular, that the results do not drastically change when varying the sensitive parameter &epsilon and T around 0.25 and 0.5, respectively; (2) we have confirmed in case 2 that &epsilon(=0.9) in the &epsilon-greedy selection and T(=1000) in the Boltzmann distribution selection show mostly the same high randomness in the action selection; and (3) ChangeRate in case 2 is set as 0.000001 to reduce the randomness of agents' behaviors around the end of simulations (i.e., 10,000,000 iterations).

4.4
Finally, all simulations were implemented by Java with standard libraries and conducted in Windows XP OS with Pentium 4 (2.60GHz) Processor[4].

Simulation results

4.5
Figures 3 and 4 show the simulation results of Q-learning agents with the constant random parameter and those with changing the random parameter, respectively. The upper, middle, and lower figures in Figure 3 correspond to cases 1-a, 1-b, and 1-c, respectively, while the upper and lower figures in Figure 4 correspond to cases 2-a and 2-c, respectively. The left and right figures in all cases indicate the payoff and negotiation process size, respectively. The vertical axis in these figures indicates these two criteria, while the horizontal axis indicates the iterations. In the payoff figure, in particular, the red and skyblue lines indicate the payoff of agents 1 and 2, respectively. Finally, all results are averaged from 10 runs at 10,000,000 iterations. The variances across the 10 runs of all simulation results are less than 0.3 in both payoffs and the negotiation process size, which is enough to be small, i.e., the simulation results across the 10 runs are consistent.

Figure
Figure 3. Simulation results of Q-learning agents (Case 1):
Average values over 10 runs through 10,000,000 iterations

Figure
Figure 4. Simulation results of Q-learning agents (Case 2):
Average values over 10 runs through 10,000,000 iterations

4.6
These results suggest that (1) in case 1, the payoff of Q-learning agents with the constant random parameter converges within 40% and 60% in any type of the three action selections (i.e., the &epsilon-greedy, roulette, and Boltzmann distribution selections), while the negotiation of those agents is more than two-time negotiations in any type of the three action selections, which means that the agents acquire the sequential negotiation; and (2) in case 2, the payoff of the Q-learning agents with changing the random parameter mostly converges at 10% and 90% in the &epsilon-greedy selection and converges within 40% and 60% in Boltzmann distribution selection, while the negotiation process size of those agents tends to increase as a whole in the &epsilon-greedy selection and shows the increasing and decreasing trend in Boltzmann distribution selection.

* Discussion

Subject experiment result

5.1
Before discussing the simulation results of Q-learning agents, this section briefly describes the subject experiment result found in Kawai et al. (2005). Figure 5 shows this result indicating the payoff of two human players in the left figures and the negotiation process size in the right figures. The vertical and horizontal axes have the same meaning as in Figures 3 and 4. In the payoff figure, in particular, the red and skyblue lines indicate the payoff of human players 1 and 2, respectively. Note that (1) all values in this figure are averaged from 10 cases through 20 iterations, where 10 cases were done by 10 pairs created from 20 human players. The variances across the 10 cases of all experiment results are less than 0.9 in both payoffs and the negotiation process size, which is a little larger than that in simulation results but the experiment results across the 10 cases are mostly consistent; (2) all human players are not well aware of the bargaining game and can make an offer or counter-offer either of 10%, 20%, ..., 90% or accept an offer (which is the same as simulations); (3) human players negotiated not face-to-face but by Windows Messenger in order not to know each other and to avoid the influence of facial emotion. For this purpose, human players participated in the bargaining game in separated rooms; and (4) human players received one actual payoff from 1 iteration (game) selected from among 20 iterations (games). By informing human players of this reward decision rule before starting the bargaining game, they were motivated to concentrate on every game because they did not know which game would be selected for determining the payment.

Figure
Figure 5. Subject experiment results in (Kawai 2005):
Average values over 10 experiments through 20 iterations

5.2
The result shows that (1) the payoff of two human players mostly converges within 40% and 60%, which indicates that human players never accept the tiny offer, unlike the rational players analyzed in Section 2; and (2) the negotiation process size increases a little bit around the first several iterations, decreases and converges around two around the last several iterations, which indicates that (2-i) human players acquire sequential negotiations; and (2-ii) the increasing and decreasing trend occurs in the subject experiment. This result also differs from the theoretical analysis done by (Rubinstein 1982), which indicates that the rational players (calculate and) offer the optimal payoff (i.e., the minimum payoff) and accept the offer right away without any further negotiation.

5.3
To analyze the reason why we obtained the above results, we conducted a questionnaire survey of the human players. The results are summarized as follows: (1) human players that can make the last offer come to be aware of their advantage of having a chance to acquire a larger payoff; (2) human players search for a mutually agreeable payoff not by one-time negotiation but by sequential negotiations; and (3) the increasing and decreasing trend emerges because of the following reasons: (3-i) the negotiation process size increases around the first several iterations because both players do not know their strategies each other which promotes them to explore possibilities of obtaining a larger payoff by competing with each other which requires further negotiations (i.e., a larger negotiation process size is required to explore a larger payoff) and (3-ii) the negotiation process size decreases around the last several iterations because both players find a mutually agreeable payoff by knowing their strategies each other which decreases the motivation of human players to negotiate again (i.e., a few negotiation process size is enough to determine their payoffs). These results suggest that the trend change of the negotiation process size (i.e., the increasing trend to the decreasing trend) represents the change in thinking of human players (i.e., thinking for exploring a larger payoff to thinking for a mutually agreeable payoff).

Case 1: Q-learning agents with the constant random parameter

5.4
Regarding Q-learning agents with the constant random parameter, Figure 3 shows that (1) the payoff of two agents converges within 40% and 60% (one of them is rather close to 50% and 50%) in any type of the three action selections (i.e., the &epsilon-greedy, roulette, and Boltzmann distribution selections); and (2) the negotiation process size of those agents is more than two in any type of the three action selections, which means that the agents acquire the sequential negotiation. We obtain these results for the following reasons:

5.5
The above analysis suggests that Q-learning agents can acquire the similar results of the subject experiment described in Section 5.1 from both the payoff viewpoint (i.e., an acquisition of the payoff within 40% and 60%) and the negotiation process size viewpoint (i.e., an acquisition of sequential negotiations). This derives the implication that Q-learning agents with any type of action selections are validated from the macro-level viewpoint.

Case 2: Q-learning agents with changing the random parameter

5.6
It should be noted here, however, that the validation described in the previous section is not accurate because Q-learning agents cannot reproduce the increasing and decreasing trend found in the subject experiment from the precise negotiation process size viewpoint. This suggests that the macro-level validation is not enough to explore validated agents. More importantly, the only macro-level validation may derive incorrect implications. To overcome this problem, we should conduct the micro-level validation in addition to the macro-level validation as Gilbert claimed (Gilbert 2004).

5.7
For this purpose, we focus on the thinking of human players from the micro-level validation and consider why the increasing and decreasing trend is occurred in the negotiation process size. Concretely, the change in thinking of human players is investigated from the viewpoint of the negotiation process size because the trend change of the negotiation process size represents the change in thinking of human players which is revealed from the questionnaire survey to human players conducted in the subject experiment in Section 5.1. Repeating the analysis of the subject experiment, such a trend emerges by players competing with each other to obtain a larger payoff around the first several iterations which promotes further negotiations (i.e., the negotiation process size increases), and by finding a mutually agreeable payoff around the last several iterations which decreases the motivation of human players to negotiate again (i.e., the negotiation process size decreases). Considering these characteristics of human players, we introduce the randomness decreasing parameter in equations (4) and (5) described in Section 4.1. This parameter decreases the randomness of the action selection of human players as the iterations increase, which have the following functions: (1) the high randomness of the action selection in the first several iterations corresponds to the stage where players try to explore a larger payoff by competing with each other; while (2) the low randomness of the action selection in the last several iterations corresponds to the stage where players make a mutually agreeable payoff with a small number of negotiations.

5.8
By using Q-learning agents with the above randomness decreasing parameter, we conducted the simulation and acquired Figure 4 showing that (1) the payoff of agents employing the &epsilon-greedy selection converges mostly at the maximum and minimum payoffs, while that of agents employing Boltzmann distribution selection converges within 40% and 60%; and (2) the negotiation process size of agents employing the &epsilon-greedy selection increases as a whole tendency although it sometimes vibrates, while that of agents employing the Boltzmann distribution selection shows the increasing and decreasing trend. We obtain these results for the following reasons.

Figure
Table 3. Q-table in Q-learning agents

5.9
The above analysis suggests that Q-learning agents employing Boltzmann distribution selection with changing the random parameter can acquire the results similar to the subject experiment described in Section 5.1 from both the payoff viewpoint (i.e., an acquisition of the payoff within 40% and 60%) and the negotiation process size viewpoint (i.e., an acquisition of both sequential negotiations and the increasing and decreasing trend). This is because the agents employing Boltzmann distribution selection with changing the random parameter become to select their actions considering the past experience, while the agents employing the &epsilon-greedy selection with changing the random parameter become to select the best action (which is not usual human behaviors). It goes without saying that the agents employing the roulette section cannot reproduce the increasing and decreasing trend in the negotiation process size due to a lack of the random change parameter, while the agents employing both &epsilon-greedy and Boltzmann distribution selections with changing the random parameter can reproduce such trend change. This derives the implication that Q-learning agents employing Boltzmann distribution selection with changing the random parameter are validated in the sequential bargaining game from both the micro- and macro-level viewpoints. This result stresses the importance of both the micro- and macro-level validation in an exploration of validated agents.

Validity of Q-learning agents

5.10
The above analysis derives the following implications: (1) the macro-level validation is not sufficient to explore validated agents, which suggests that the micro-level validation should be conducted in addition to the macro-level validation; and (2) Q-learning agents employing Boltzmann distribution selection with changing the random parameter are validated in the sequential bargaining game from both the micro- and macro-level viewpoints. In order to further strengthen the validity of the above Q-learning agents, this section discusses it from the following aspects.

Model comparison vs. agent comparison

5.11
In general, the model-to-model approach (Hales et al. 2003) compares different models to investigate the validity of "computational models" and "simulation results." This also contributes to promoting transfer of knowledge on "models" by clarifying the limits of the applicability of their "models". In comparison with this approach, our approach compares different agents in the same model to validate "agents" and "simulation results" by comparison with subject experiment results. This also contributes to promoting transfer of knowledge on "agents" by clarifying the limits of applicability of their "agents." From this analysis, we find that (1) both approaches pursue the same goal and (2) the only difference is to focus on model-level design (i.e., the framework of the model) or agent-level design (i.e., the framework of the agent). This viewpoint suggests that our approach can be regarded as one of the model-to-model approaches.

* Conclusions

6.1
This paper addressed both micro- and macro-level validation in agent-based simulation (ABS) to explore validated agents that can reproduce not only human-like behaviors externally but also human-like thinking internally. For this purpose, we employed the sequential bargaining game for the long investigation of a change in humans' behaviors and thinking and compared simulation results of Q-learning agents employing any type of the three types of action selections (i.e., the &epsilon-greed, roulette, and Boltzmann distribution selections) in the game. Intensive simulations have revealed the following implications: (1) Q-learning agents with any type of the three action selections can acquire sequential negotiation, but they cannot show the increasing and decreasing trend found in subject experiments. This indicates that Q-learning agents can reproduce human-like behaviors but not human-like thinking, which means that they are validated from the macro-level viewpoint but not from the micro-level viewpoint; and (2) Q-learning agents employing Boltzmann distribution selection with changing the random parameter cannot only acquire sequential negotiation but also show the increasing and decreasing trend in the game. This indicates that the Q-learning agents can reproduce both human-like behaviors and thinking, which means that they are validated from both micro- and macro-level viewpoints.

6.2
What should be noted here is that these results have only been obtained from one example, i.e., the sequential bargaining game. Therefore, further careful qualifications and justifications, such as analyses of results using other learning mechanisms and action selections or in other domains, are needed to generalize our results. Such important directions must be pursued in the near future in addition to the following future research: (1) an exploration of other ChangeRage settings; (2) modeling agents that produce human-like behaviors in the short iterations (such as 20 iterations as subject experimental results); (3) simulation with more than two agents; (4) an analysis of the case where humans play the game with agents like in (Bosse and Jonker 2005); and (5) investigation of the influence of the discount factor (Rubinstein 1982) in the bargaining game.

* Acknowledgements

The research reported here was supported in part by a Grant-in-Aid for Scientific Research (Young Scientists (B), 19700133) of the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan. The authors wish to thank Juliette Rouchier for introducing Gilbert's interesting paper and also thank anonymous reviewers for useful, significant, and constructive comments and suggestions.


* Notes

1 In detail, Rubinstein concluded that the solution for the bargaining game is unique, i.e., it reaches the perfect equilibrium partition (P.E.P) under the assumptions that (1) the discount factors are common knowledge to the players and (2) the number of stages (or steps) to be played is infinite (Rubinstein 1982). In other words, in an exchange between rational players, the first offerer should (calculate and) offer the P.E.P; the responder (the opponent players) should then accept the offer right away, making an instantaneous deal with no need of further interaction. Concretely, assuming that players P1 and P2 are penalized with discount factors &delta1 and &delta2 , respectively, and P1 is granted the first offer, the composition of the P.E.P contract is that player P1 receives a share of the pie that returns her a utility of U_1 = (1-&delta2 ) / (1-&delta1&delta2 ), whereas player P2 gets a share that returns him a utility of U_2 = &delta2(1-&delta1 ) / (1-&delta1&delta2 ). For values of &delta close to 0, the finite-horizon alternating-offers bargaining games give a great advantage to the player making the last offer. In this research, we employ the finite-horizon alternating-offers bargaining game in the case where &delta1 = &delta2 = 0.
2 In the context of reinforcement learning, worth is called "value." We select the term "worth" instead of "value" because the term "value" is used as a numerical number represented in the state and action.
3 At the first negotiation, one Q-value is selected from {Q01 , ..., Q09 }, not from {Q00 , Q01 , ..., Q09 }. This is because the role of the first agent is to make the first offer and not to accept any offer (by selecting Q00) due to the fact that a negotiation has not started yet.
4 Source code can be downloaded from http://www.cas.hc.uec.ac.jp/bargaining-game/index.html.
5 In other words, the Q-learning agents get into the local minimum solution.

* References

AXELROD, R. M. (1997), The Complexity of Cooperation: Agent-Based Models of Competition and Collaboration, Princeton University Press.

AXTELL, R., Axelrod, R., Epstein J., and Cohen, M. D. (1996), "Aligning Simulation Models: A Case Study and Results," Computational and Mathematical Organization Theory (CMOT), Vol. 1, No. 1, pp. 123-141.

BINMORE, K. G. (1998), Game Theory and the Social Contract: Just Playing, Volume 2, The MIT Press.

Bosse, T. and Jonker, C. M. (2005), "Human vs. Computer Behaviour in Multi-Issue Negotiation," First International Workshop on Rational, Robust, and Secure Negotiations in Multi-Agent Systems (RRS'05), IEEE Computer Society Press, pp. 11-24.

CARLEY, K. M. and Gasser, L. (1999), "Computational and Organization Theory," in Weiss, G. (Ed.), Multiagent Systems - Modern Approach to Distributed Artificial Intelligence -, The MIT Press, pp. 299-330.

EREV, I. and Roth, A. E. (1998), "Predicting How People Play Games: Reinforcement Learning in Experimental Games with Unique, Mixed Strategy Equilibria," The American Economic Review, Vol. 88, No. 4, pp. 848-881.

EPSTEIN J. M. and Axtell R. (1996), Growing Artificial Societies, MIT Press.

FRIEDMAN, D. and Sunder, S. (1994), Experimental Methods: A Primer for Economists, Cambridge University Press.

GILBERT, N. (2004), "Open problems in using agent-based models in industrial and labor dynamics," In R. Leombruni and M. Richiardi (Eds.), Industry and Labor Dynamics: the agent-based computational approach, World Scientific, pp. 401-405.

HALES, D., Rouchier J., and Edmonds, B. (2003), "Model-to-Model Analysis," Journal of Artificial Societies and Social Simulation (JASSS), Vol. 6, No. 4, 5 http://jasss.soc.surrey.ac.uk/6/4/5.html.

IWASAKI, A., Ogawa, K., Yokoo, M., and Oda, S. (2005), "Reinforcement Learning on Monopolistic Intermediary Games: Subject Experiments and Simulation," The Fourth International Workshop on Agent-based Approaches in Economic and Social Complex Systems (AESCS'05), pp. 117-128.

KAGEL, J. H. and Roth, A. E. (1995), Handbook of Experimental Economics Princeton University Press.

KAWAI et al. (2005) "Modeling Sequential Bargaining Game Agents Towards Human-like Behaviors: Comparing Experimental and Simulation Results," The First World Congress of the International Federation for Systems Research (IFSR'05), pp. 164-166.

MOSS, S. and Davidsson, P. (2001), Multi-Agent-Based Simulation, Lecture Notes in Artificial Intelligence, Vol. 1979, Springer-Verlag.

MUTHOO, A. (1999), Bargaining Theory with Applications , Cambridge University Press. MUTHOO, A. (2000), "A Non-Technical Introduction to Bargaining Theory," World Economics, pp. 145-166.

NYDEGGER, R. V. and Owen, G. (1974), "Two-Person Bargaining: An Experimental Test of the Nash Axioms," International Journal of Game Theory, Vol. 3, No. 4, pp. 239-249.

OGAWA, K., Iwasaki, A., Oda, S., and Yokoo, M. (2005) "Analysis on the Price-Formation-Process of Monopolistic Broker: Replication of Subject-Experiment by Computer-Experiment," The 2005 JAFEE (Japan Association for Evolutionary Economics) Annual Meeting (in Japanese).

OSBORNE, M. J. and Rubinstein, A. (1994), A Course in Game Theory, MIT Press.

ROTH, A. E., Prasnikar, V., Okuno-Fujiwara, M., and Zamir, S. (1991), "Bargaining and Market Behavior in Jerusalem, Ljubljana, Pittsburgh, and Tokyo: An Experimental Study," American Economic Review, Vol. 81, No. 5, pp. 1068-1094.

ROTH, A. E. and Erev, I. (1995) "Learning in Extensive-Form Games: Experimental Data and Simple Dynamic Models in the Intermediate Term," Games and Economic Behavior, Vol. 8, No. 1, pp. 164-212.

RUBINSTEIN, A. (1982), "Perfect Equilibrium in a Bargaining Model," Econometrica, Vol. 50, No. 1, pp. 97-109.

SCHELLING, T. C. (1960), The Strategy of Conflict, Harvard University Press.

STÅHL, I. (1972), Bargaining Theory, Economics Research Institute at the Stockholm School of Economics.

SPULBER, D. F. (1999), Market Microstructure - Intermediaries and the theory of the firm- , Cambridge University Press.

SUTTON, R. S. and Barto, A. G. (1998), Reinforcement Learning: An Introduction, The MIT Press.

TAKADAMA et al. (2003), "Cross-Element Validation in Multiagent-based Simulation: Switching Learning Mechanisms in Agents," Journal of Artificial Societies and Social Simulation (JASSS), Vol. 6, No. 4, 6. http://jasss.soc.surrey.ac.uk/6/4/6.html

TAKADAMA et al. (2006), "Can Agents Acquire Human-like Behaviors in a Sequential Bargaining Game? - Comparison of Roth's and Q-learning agents -," The Seventh International Workshop on Multi-Agent-Based Simulation (MABS'06), pp. 153-166.

WATKINS, C. J. C. H. and Dayan, P. (1992), "Technical Note: Q-Learning," Machine Learning, Vol. 8, pp. 55-68.

----

ButtonReturn to Contents of this issue

© Copyright Journal of Artificial Societies and Social Simulation, [2008]