A Game Theory Model for Manipulation Based on Machiavellianism: Moral and Ethical Behavior

This paper presents a new game theory approach for modeling manipulation behavior based on Machiavellianism (social conduct and intelligence theory). The Machiavellian game conceptualizes the Machiavellianism considering three concepts: views, tactics and immorality. For modeling the Machiavellian views and tactics we employ a Stackelberg/Nash game theory approach. For representing the concept of immorality, we consider that rational Machiavellian players employ a combination of the deontological and utilitarian moral rules, as well as, moral heuristics. We employ a reinforcement learning approach for the implementation of the immorality concept providing a computational mechanism, in which, its principle of error-driven adjustment of cost/reward predictions contributes to the players' acquisition of moral (immoral) behavior. The reinforcement learning algorithm is based on an actor-critic approach responsible for evaluating the new state of the system and it determines if the cost/rewards are better or worse than expected, supported by the Machiavellian game theory solution. The result of the model is the manipulation equilibrium point. We provide the details needed to implement the extraproximal method in an efficient and numerically stable way. Finally, we present a numerical example that validates the effectiveness of the manipulation model.


Introduction Brief review
. Game theory is a mathematical system for analyzing and predicting how humans behave in strategic situations.
It uses three distinct concepts to make precise predictions of how people will, or should, interact strategically: strategic thinking, best-reply, and mutual consistency (equilibrium). Strategic thinking assumes that all players form beliefs based on an analysis of what other players may possibly do. The best-reply chooses the best reply given those beliefs. Mutual consistency adjust best responses and beliefs until they reach an equilibrium. In this sense, there is an understandable interest in the development and implementation of interaction social models which tries to understand, predict, manipulate and control the behavior of people, organizations, government, companies, etc. for instance, (see Schindler ; Xianyu ; Wijermans et al. ). Everyone that goes aboard on this kind of interaction models must consider the behavior observations of the fi eenth-century philosopher and politic, Niccolò Machiavelli (Wilson et al. ). .
Machiavelli's primary contribution was his painfully honest observations about human nature (Machiavelli ). He distinguishes the natural laws that govern how e ective leaders exercise power over the human resources and creates a new moral system, deeply rooted in Roman virtue (and vice). He develops his proposal against the conceptions of the Judeo-Christian self-contained moral systems. His ethical system works both as a limit to human possibilities and as the source of human virtue. Machiavelli says that human nature is aggressive and only in some measure able to be manipulated. In this sense, he observes that under competitive conditions the human bean pursue his main goals with increasing levels of ruthlessness (Machiavelli , , ).
as a product of behavior, including personal attributes, with the possibility to a ect others through interaction, and the environment structure. Dawkins ( ) proposed that, in terms of selfishness, altruism, cooperation, manipulation, lie and truth, genetically there exists a selfishness and manipulation gene. Dawkins & Krebs ( ) classified manipulation as a natural-selection state benefiting individuals able to manipulate others' behavior. Vleeming ( ) denoted a personality dimension in which people can be classified in terms of being more or less manipulated in di erent interpersonal situations. Wilson et al. ( ) defined Machiavellianism as a social strategy behavior involving the manipulation of others to obtain personal benefits, frequently against others' interests. They clarify that anybody is able to manipulate others to di erent degrees, and they also explain that selfishness and manipulation are behaviors widely studied in evolutionary biology. Hellriegel et al. ( ) defined Machiavellianism as a personal style of behavior in front of others, characterized by: the use of astuteness, tricks and opportunism in interpersonal relationships; cynicism towards other persons' nature; lack of concern with respect to conventional morals. Christie & Geis ( ) proposed three factors to evaluate high or low Machiavellianism: tactics, morality and views. Tactics are concerned with planned actions (or recommendations) to confront specific situations with the purpose of obtaining planned benefits at the expense of others. Morality is related to behavior that can be associated with some degree of "badness" with respect social conventions. Views involve the idea that the world consists of manipulators and manipulated. In this sense we introduce the following definition. .
Immorality is a un-arrangement of customs. One of the best-known concepts is the immorality described by Nietzsche. Therefore, we consider the factor of morality proposed by Christie & Geis ( ) as inappropriate, because in the evaluation of the factor, immorality is considered the opposite to a "conventional moral."

.
There have been proposed several methods from di erent angles and application domains for solving the problem of moral and ethical decision making (Bales ; Cervantes et al. ; Dehghani et al. ; Indurkhya & Misztal-Radecka ; Wallach et al. ; Hartog & Belschak ). However, there is still a lack of a fundamental and mathematical decision model and a rigorous cognitive process for decision-making. In this paper, we will propose a game theory solution combined with a reinforcement learning approach to acquire manipulation behavior.

Main results
.
This paper presents a new game theory approach for modeling manipulation behavior based on the Machiavellian social conduct theory. The Machiavellian game conforms a system that allows to analyze and predict how Machaivellian players behave in strategic situations combining all the following three fundamental features of game theory: a) formation of beliefs based on analysis of what others players might do (strategic thinking); b) choosing a best-reply given those beliefs (optimization); and c) adjustment of best-reply and beliefs until they are mutually consistent (equilibrium). The assumption of mutual consistency is justified by introducing a learning process. As a result, the Machiavellian equilibrium is the consequence of a strategic thinking, optimization, and equilibration (learning process).
. A Machiavellian player conceptualizes the manipulation social conduct considering three concepts: views, tactics and immorality. For modeling the Machiavellian Views and Tactics we employ a Stackelberg/Nash game theory approach. For representing Machiavellian immorarilty we introduce utilitarian and deontological moral theories and we review psychological findings regarding moral decision making (Tobler et al. ). An advantage of the Machiavellian social conduct theory is that whereas other moral theories provide standards for how we should act, they do not describe how moral judgments and decisions are achieved in practice. Next, we establish a relationship between moral behavior and game theory (economic theories) and we suggest a reinforcement learning approach which provides evidences of how people acquire Machiavellian immoral behavior, considering its principle of error-driven adjustment of cost/reward predictions.

.
In summary, this paper makes the following contributions: • We represent the Machiavellian Views naturally as a Stackelberg game where the hierarchical organization consists of manipulators (leaders) and manipulated (followers) players. By definition of the Stackelberg game the leaders have commitment power.
• The manipulators and manipulated players are themselves in a (non-cooperative) Nash game.
• The Machiavelllian Tactics correspond to the solution of Stackelberg/Nash game.
• We employ the extraproximal method for solving the Machiavellian game (Antipin ; Trejo et al. , ) (see the appendix).
• We suggest a reinforcement learning approach for modeling the Machiavellian immorality based-on an Actor-Critic architecture • For representing the Machiavellian immorality, we consider that rational players employ deontological and utilitarian moral (depending the case) • We suggest the immorality reinforcement learning rules needed for modeling immorality.
• We restrict our approach to a finite, ergodic and controllable Markov chains.
• The result of the model is the manipulation equilibrium point.
• Finally, we present a numerical example that validates the e ectiveness of the proposed Machiavellian social conduct and intelligence approach.
Organization of the paper .
The rest of the paper is organized as follows. In the next Section, we present the mathematical background needed for the understanding of the rest of the paper. Section describes the general Machiavellianism social conduct architecture. We suggest a Machiavellian Stackelberg game theory model for representing Machiavellian Views and Tactics, in Section . Section describes a reinforcement learning approach for modeling the Machiavellian immorality. In Section , we present a simulated experiments for Machiavellian social conduct theory. Finally, in Section we present some conclusions and future work. To make the paper more accessible, the long details needed to implement the extraproximal method are placed in the appendix.

Preliminaries
. Let S be a finite set, called the state space, consisting of finite set of states s (1) , ..., s (N ) , N ∈ N. A Stationary Markov chain (Clempner & Poznyak ) is a sequence of S-valued random variables s(n), n ∈ N. The Markov chain can be represented by a complete graph whose nodes are the states, where each edge (s (i) , s (j) ) ∈ S 2 is labeled by the transition probability. The matrix Π = (π (ij) ) (s(i),s(j))∈S ∈ [0, 1] N ×N determines the evolution of the chain: for each k ∈ N , the power Π k has in each entry (s (i) , s (j) ) the probability of going from state s (i) to state s (j) in exactly k steps.
Definition A controllable Markov chain (Poznyak et al. ) is a -tuple where: • S is a finite set of states, S ⊂ N, endowed with discrete a topology; • A is the set of actions, which is a metric space. For each s ∈ S, A(s) ⊂ A is the non-empty set of admissible actions at state s ∈ S. Without loss of generality we may take A= ∪ s∈S A(s); • K = {(s, a)|s ∈ S, a ∈ A(s)} is the set of admissible state-action pairs, which is a measurable subset of S × A; • Π = π (ij|k) is a stationary controlled transition matrix, where π (ij|k) ≡ P (s(n + 1) = s (j) |s(n) = s (i) , a(n) = a (k) ) represents the probability associated with the transition from state s (i) to state s (j) under an action a (k) ∈ A(s (i) ), k = 1, ..., M ; Definition A Markov Decision Process is a pair where: • MC is a controllable Markov chain ( ) • J : S × K → R is a cost function, associating to each state a real value. .
The strategy (policy) d (k|i) (n) ≡ P (a(n) = a (k) |s(n) = s (i) ) represents the probability measure associated with the occurrence of an action a(n) from state s(n) = s (i) . .
The elements of the transition matrix for the controllable Markov chain can be expressed as Let us denote the collection d (k|i) (n) by D n as follows A policy d loc n n≥0 is said to be local optimal if for each n ≥ 0 it maximizes the conditional mathematical expectation of the utility function J(s n+1 ) under the condition that the history of the process is fixed and can not be changed herea er, i.e., it realizes the "one-step ahead" conditional optimization rule where J(s n+1 ) is the utility function at the state s n+1 . .
The dynamics of the Stackelberg game for Markov chains is described as follows. The game consists of ι = 1, M + N players and begins at the initial state s ι (0) which (as well as the states further realized by the process) is assumed to be completely measurable. Each player ι is allowed to randomize, with distribution d ι (k|i) (n), over the pure action choices a ι (k) ∈ A ι s ι (i) , i = 1, N ι and k = 1, M ι . The leaders correspond to l = 1, N and followers to m = 1, M. At each fixed strategy of the leaders d l (k l |i l ) (n) the followers make the strategy selection d m (km|im) (n) trying to realize a Nash-equilibrium. Below we will consider only stationary strategies d ι (k|i) (n) = d ι (k|i) . In the ergodic case when all Markov chains are ergodic for any stationary strategy d ι (k|i) the distributions P ι s ι (n + 1) = s (jι) exponentially quickly converge to their limits P ι s = s (i) satisfying The cost function of each player, depending on the states and actions of all the other players, is given by the values W ι (i1,k1;...;i N +M ,k N +M ) , so that the "average cost function" J ι in the stationary regime can be expressed as .

Notice that by Equation ( ) it follows that
.

Let us introduce the variables
for the leaders l where col is the column operator, and let us introduce the variables for the m followers. .
We consider a Stackelberg game where the leaders and the followers do not cooperate. Then, the definitions of the equilibrium point is as follows: .
. A Stackelberg/Nash equilibrium for the followers is a strategy v * = v 1 * , .., v M * given the strategy u = u 1 , .., u N for the followers such that for any v m ∈ V and u ∈ U.

Machiavellianism architecture
. This paper presents a new game theory model to represent the Machiavellianism (social conduct and intelligence theory). The Machiavellian social conduct is engaged with manipulating others for personal gain, even against the other's self-interest and the Machiavellian intelligence is the capacity of an individual to be in a successful engagement with social groups (Byrne & Whiten ). A Machiavellian agent is one who conceptualize the Machiavellianism considering three concepts: views, tactics and immorality. Views involve the idea that the world consists of manipulators and manipulated (restricting the model to two types of agents). The tactics are concerned the use of manipulation (Machiavelli's) strategies. The immorality is related to the natural behavior to not become restricted to a conventional moral (the Machiavellian agent's behavior is rationally bounded by the immorality). In order to (manipulate) acquire power, survive or sustain a particular position, Machiavellian agents make use of Machiavellian intelligence applying di erent selfish manipulation strategies, which include looking for control the changes taking place in the environment. .
Figure shows the schematic structure of the Machiavellian actor-critic algorithm involving a reinforcement learning process. The learning agent has been split into two separate entities: the actor (game theory) and the critic (value function). The actor coincides with a Stackelberg/Nash game (Stackelberg ) that implements the concepts of Views and Tactics of the Machiavellianism. The critic conceptualize the Machiavellian immorality represented by a value function process. The actor is responsible for computing a control strategy d ι * for each player ι, given the current stateî ι . The critic is responsible for evaluating the quality of the current strategy by adapting the value function estimate. A er a number of strategy evaluation steps by the critic, the actor is updated by using information from the critic. The architecture is described as follows. .
The structure of the game corresponds to a Stackelberg approach in respond to the Machiavellian Views, which involves the idea that the world consists of manipulators and manipulated players. The dynamics of the Machiavellian game is as follows: the manipulators players consider the best-reply of the manipulated players, and then select the strategy that optimizes their utility, anticipating the response of the manipulated players. Subsequently, the manipulated players observe the strategy played by the manipulators players and select the best-reply strategy. The manipulators and manipulated players are themselves in a (non-cooperative) Nash game. Formally, the Stackelberg model is solved to find the subgame perfect Nash equilibrium, i.e. the strategy that serves best each player, given the strategies of the other player and that entails every player playing in a Nash equilibrium in every subgame. The Machiavellian Tactics are concerned with the use of manipulation strategies. In this sense, the equilibrium point of the game represents the strategies needed to achieve specific power situations. In this model, the manipulators have commitment power presenting a significant advantage over the manipulated players. .
For representing the concept of immorality we employ a reinforcement learning approach providing a computational mechanism, in which, its principle of error-driven adjustment of cost/reward predictions contributes to the players' acquisition of moral/immoral behavior. The reinforcement learning algorithm is based on an actor-critic approach responsible for evaluating the new state of the system and determine if the cost/rewards are better or worse than expected, supported by the Machiavellian game theory solution. We view actor-critic algorithm as stochastic game algorithm on the parameter space of the actor. The game is solved employing the extraproximal method (see the appendix). The functional of the game is viewed as a regularized Lagrange function whose solution is given by a stochastic gradient algorithm. .
The responsibility of the actor is computing the strategy solution d ι * (îι|kι) of the Machiavellian game. In the actor, the action selection follows the strategy solution d ι * (îι|kι) of the Machiavellian game: a) given a fixedî ι , each player chooses randomly an action a ι (t) = a (kι) (for the estimated valuek) from the vector d ι * (îι|kι) , b) then, players employ the transition matrix Π ι = π ι ıιjι|kι to choose randomly the consecutive state s ι (t + 1) = s ι (for the estimated value) from the vector π ι (îιjι|kι) (for a fixedî ι andk ι ), c) as soon as s ι (t), a ι (t) and s ι (t + 1) are selected they are sent to the Critic. .
The role of the critic is to evaluate the current strategy prescribed by the actor and compute an approximation of the projection ofπ ι (î,|k) (t) andĴ ι (î,|k) (t). The actor uses this approximation to update its strategies in the direction of performance improvement of the Machiavellian game. In the Critic, we consider that rational Machiavellian players employ a combination of the deontological and utilitarian moral rules, as well as, moral heuristics, for representing the concept of immorality decision-making in the Machiavellianism social conduct theory. Most of the time a Machiavellian player considers a deontological moral. But the disposition to not become attached to a conventional moral is represented by the utilitarian moral which, in our case, promotes immoral values. The utilitarian moral establishes states where the moral quality of actions is determined by their consequences: the goal is to maximize the utility of all individuals in the society. On the other hand, deontological moral states that certain things are morally valuable in themselves. The moral quality of actions arises from the fact that the action is done to protect such moral values: consequences and outcomes of actions are secondary. Our approach considers that a Machiavellian player have the disposition to not become attached to a conventional moral as an utilitarian moral (immorality). This combination between utilitarian and deontological morals suggests a specific method for moral decision making.
. The estimated values are updated employing the moral learning rules for computingπ ι (î,|k) (t) andĴ ι (î,|k) (t). When the player's performance is compared to that of a player which acts optimally from the beginning, the di erence in performance gives rise to the notion of regret. Then, the critic produces a reinforcement feedback for the actor by observing the consequences of the selected action. The critic takes a decision considering a TD-error e, in our case we consider the mean square error e Π ι (t − 1) −Π ι (t) > 0 , which determines if the cost/rewards are better or worse than expected with the preceding action. The TD-error e corresponds to the mean squared error of an estimator which in this case measures the di erence between the estimator and what is estimated (the di erence occurs because of randomness). The TD-error e in the reinforcement learning process is employed to evaluate the preceding action: if the error is positive the tendency to select this action should be strengthened or else, lessened. The value-minimizing/maximizing action at each state are taken whether the actor-critic learning ruleπ ι (î,|k) ensures convergence. .
We note some minor di erences with the common usage of reinforcement learning because the control strategy need to change as the actor computes the game. This need not present any problems, as long as the actor parameters are updated on a slower time scale. The Machiavellian architecture point towards that the cost/reward model may have benefits not only for individual survival by computing the Stackelberg/Nash game, but also by contributing to the computation of the moral value of actions employing a reinforcement learning approach.

Machiavellian Views and Tactics
The Machiavellian game . Let us consider a Stackelberg game (Trejo et al. , ) with N manipulators whose strategies are denoted by u l ∈ U l l = 1, N where U is a convex an compact set. Denote by u = (u 1 , ..., u N ) ∈ U the joint strategy of the players and ul is a strategy of the rest of the players adjoint to u l , namely, Remark Considering the Machiavellian Views (see Subsection -Brief Review -) we represent manipulators and manipulated players in a Stackelberg game (Trejo et al. ): this is a multiplayer model involving a Nash game for the manipulators and a Nash game for the manipulated players restricted by a Stackelberg game defined as follows.

Definition A game with N manipulators and M manipulated players is said to be a Machiavellian Stackelberg/Nash game if
G (u,û(u)|v) := N l=1 ϕ l ū l , ul|v − ϕ l u l , ul|v satifies the properties: • for the manipulators players Remark The strategy (u * , v * ) is a solution of the Machiavellian game that corresponds to the Machiavellian Tactics needed to achieve specific power situations (see Subsection -Brief Review -).

.
Let R = N l=1 N l M l , then U adm (U admissible) is defined as follows The regularized Lagrange principle .

Machiavellian Immorality
. The designing of the Machiavellian immorality module involves the learning rule for estimating π ι (i,j|k) and the learning rule for estimating J ι (i,j|k) . Both rules are defined as follows (Sánchez et al. ).
. The learning rule for estimating π ι (i,j|k) is as followŝ such that where f ι (i|k) (t) (the frequency of observed experiences) is the total number of times that the player ι evolves from state i applying action k in the the RL process and, f ι (i,j|k) is the total number of times that player ι evolves from state i to state j applying action k in the RL process. For f (î,|k) (0) = 0, and f (î|k) (0) = 0.

Remark
The normalization of this processes is obtained by the projection on simplex. Given that jπ ι ıj|k = 1, at each step, the complete rowî is projected to the simplex (P :π ι ı|k (t) → S Nι ).
. The learning rule for estimating J ι (i,j|k) is given bŷ .
As well as, for the cost/reward model, we keep a running average of the rewards observed upon taking each action in each state as follows is the sum over all immediate costs/rewards, received a er executing action a in state i and stepping to state j, incremented by ∆ J multiplied by a random value η, −1 ≤ η ≤ 1. The parameter η is a learning rate that can depend on the salience properties of both the unconditional and the conditional stimuli being associated.

Remark
The learning rules of the architecture are computed considering the maximum likelihood model where 0 0 := 0 .
. The designing of the adaptive module for the actor-critic architecture consists of the following learning rules. The definition involves the variable t 0 which is the time required to compute the corresponding matricesπ ι (î,|k) (t) andĴ ι (î,|k) (t).
. The learning rule for estimating π ι (i,j|k) is given bŷ . The estimation of the transition matrix Π for the random variables s i , s j and a k is considered only if the estimated error e decreases for any predecessor estimated transition matrixΠ(t − 1) of Π(t), i.e. e Π (t − 1) − Π(t) > 0.
. The learning rule for estimating J ι (î,|k) is as followŝ We have a RL process for representing the concept of Machiavellian immorality where if there exist changes in the system the players are able to learn and adapt to the environment. We can estimate the aleatory variables corresponding to the entry (î,,k) of both, the transition and the cost/reward matrices. The principle of error-driven adjustment of cost/reward predictions given by the estimated error e contributes to the players' acquisition of moral (immoral) behavior. The RL process converges because the ergodicity restriction imposed on the Markov game.

Numerical example
. This example shows a proof-of-concept experimental result conducted using the proposed model. The environment has four player (two manipulators and two manipulated), four states and two actions. The dynamics of the process can be observed by studying the time varying estimated cost functionsĴ ij|k , estimated transition matricesπ ij|k , the estimated strategiesd i|k and the error e. .
For the purpose of implementation, the learning process can be conceptualized in two phases, where each phase corresponds to learning a single module. At the beginning of Phase , the players starts exploring by executing the actions considering a so max action selection rules considering a Boltzmann distribution. A er ten thousand algorithm iterations, the estimated transition matrices and cost functions of all the players begins to stabilize. In this sense, the error decreases and falls below a given threshold. In Phase the players begins to explore by executing the actions based on the solution of the game solver. A er ten hundred algorithm iterations,   .
The estimated values forĴ ij|k are as follows:   Conclusions and future work . In this paper we developed a manipulation model where the manipulator players can anticipate the predicted response of the manipulated players. The proposed Machiavellian game theory approach established a system that analyzes and predicts how players behave in strategic situations combining: strategic thinking, best-reply, and mutual consistency (equilibrium). We justified the mutual consistency to predict how players are likely to behave by introducing a reinforcement learning process. In fact, the equilibrium is the result of a strategic thinking, optimization, and equilibration (or learning).
. In our proposed model the manipulators players play a Machiavellian strategy first and the manipulated players move sequentially in terms of the actions proposed by the manipulators. In the dynamics of the model the manipulators players consider the best-reply of the manipulated players, and then select the Machiavellian strategy that optimizes their utility, anticipating the responses of the manipulated players. Subsequently, the manipulated players observe the Machiavellian strategy played by the manipulators players and select their best-reply strategy. The Machiavellian behavior described above coincides with a Stackelberg game. Strategies are selected according to the Machiavellian social conduct theory.

.
We introduced the utilitarian (immoral behavior) and deontological moral theories for modeling the Machiavellian immorality and, we reviewed psychological findings regarding moral decision making. An advantage of the Machiavellianism is that whereas other moral theories provide standards for how we should act, they do not describe how moral judgments and decisions are achieved in practice. We established a relationship between moral and game theory (economic theories) and we suggested that the reinforcement learning theory, with its principle of error-driven adjustment of cost/reward predictions, provided evidences of how players acquire moral (immoral) behavior. .
The introduction of the reinforcement learning theory suggested on the one hand, a mechanism by which utilitarian moral value may guide the Machiavellian behavior, as well as, the deontological moral obtain a value for certain moral principles or acts. We noted some minor di erences with the common usage of classical reinforcement learning because the control strategy needs to change as the actor computes the game. This need not pose any problems, as long as the actor parameters are updated on a slower time scale. The main advantage of our approach is that it preserves the convergence properties of the extraproximal method. The result of the model is the manipulation equilibrium point. Validity of the proposed method was successfully demonstrated both theoretically and by a simulated numerical example. .
There exist a number of challenges le to be addressed for future work. The obvious next step is to incorporate our model into an implemented explanatory system for testing it with real-life scenarios. An interesting technical challenge is that of finding an analytical solution for the manipulation equilibrium point. An additional interesting problem is to associate the manipulation equilibrium point proposed in this paper with the bagaming problem where players should cooperate when non-cooperation leads to Pareto-ine icient results. It will be interested also to examine the problem of the existence of a manipulation equilibrium point under conditions of moral hazard (when one player takes more risks because someone else bears the cost of those risks). These extensions should lead to a more complete account of complex manipulation understanding.

Appendix: Extraproximal
The individual cost-functions of the leaders are defined as follows: