Generating Synthetic Bitcoin Transactions and Predicting Market Price Movement Via Inverse Reinforcement Learning and Agent-Based Modeling

In this paper, we present a novel method to predict Bitcoin price movement utilizing inverse reinforcement learning (IRL) and agent-based modeling (ABM). Our approach consists of predicting the price through reproducing synthetic yet realistic behaviors of rational agents in a simulated market, instead of estimating relationships between the price and price-related factors. IRL provides a systematic way to find the behavioral rules of each agent from Blockchain data by framing the trading behavior estimation as a problem of recovering motivations from observed behavior and generating rules consistent with these motivations. Once the rules are recovered, an agent-based model creates hypothetical interactions between the recovered behavioral rules, discovering equilibrium prices as emergent features through matching the supply and demand of Bitcoin. One distinct aspect of our approach with ABM is that while conventional approaches manually design individual rules, our agentsâ€™ rules are channeled from IRL. Our experimental results show that the proposed method can predict short-term market price while outlining overall market trend.


Introduction
. Agent-based models (ABMs) use a bottom-up approach to discover complex, aggregate-level properties. These properties emerge from individual agent behaviors and interactions within their environment. Thus, once individual behavioral rules of Bitcoin trading are formulated, an ABM can generate an emerging aggregate phenomenon, such as market price, from the possibly non-linear interactions of those rules. Unfortunately, discovering these rules can be challenging and potentially require deep insight and domain knowledge.
. We utilize inverse reinforcement learning (IRL) as a method for obtaining individual rules for an ABM directly from data. Conventional statistical analysis of trading behaviors or market behaviors mainly evaluates the impact of independent actions on the target events. However, we exploit Markov decision processes (MDPs) to model the long-term process of trading Bitcoin, and then estimate a generalizable trading rule for each agent assuming an action has a long-term consequence, and each agent tries to maximize their long-term rewards. .
With the combination of IRL & ABM, initially proposed by Lee et al. ( ), we first generate synthetic but realistic Bitcoin trading rules and, then, find daily prices resulting from the interactions. Through an experiment, we show that the method provides rich but concise market behavioral rules for agents while predicting aggregatelevel market price. The contribution of this paper is to demonstrate that the combination of individually learned rules and macro-level simulations can provide a new option for market prediction and policy experiments. .
In Section we review the background of cryptocurrency, IRL, and related research. In Section , we present the details of the proposed method. Section presents an experiment and results using the proposed method on nine -month periods of the Bitcoin market. In Section we discuss the challenges of the method, and we conclude the paper by highlighting areas of future research in Section .

Background and Related Work
Cryptocurrency and blockchain . Blockchain first came into existence in as a result of the seminal paper, "Bitcoin: A Peer-to-Peer Electronic Cash System" published under the pseudonym "Satoshi Nakamoto" (Nakamoto ). The proposition of the paper was simple: to create a distributed public ledger upon which everybody could participate and interact to maintain a record of transactions, removing the necessity of any third party intermediary. Essentially, the paper proposed a way of transacting currency without the need for a bank or central body to govern or dictate the flow of assets in a secure and distributed way. .
The Bitcoin blockchain is secured thanks to cryptography, and as a result, it is nearly impossible to steal or take an individual's Bitcoins. Bitcoin is able to secure itself using a network of what are called "miners" who search for what are called "blocks" of Bitcoin using a large array of computers competitively validating the integrity of transactions and in turn getting rewarded with newly minted Bitcoin. All of Bitcoin's transaction history is available on the public blockchain, including every transaction ever made since its genesis block and will continue to exist for as long as the network remains operational and at least one copy of it exists. A Bitcoin transaction consists of input addresses, input amounts, output addresses, and output amounts. The transactions form blocks, and then the verified blocks are linked to a continuously growing list of the chain. .
The most noticeable characteristic of Bitcoin at the moment is its price fluctuations. The price skyrocketed from $ /BTC to $ , /BTC in one year, while experiencing large amounts of volatility throughout the period. However, nothing is clear about what causes this high volatility since it is almost completely anonymous and traded without borders. As Bitcoin continues to generate more interest around the world as a result of its price phenomenon, it will become more important to make sense of its chaotic price movements.
Research on cryptocurrency market .
One of the major approaches regarding cryptocurrency price predictions involves user-sentiment monitoring through comment analysis of online cryptocurrency communities. The comment analysis study ultimately concluded that more qualitative selection criteria are needed to build a prediction model (Kim et al. ). In another vein, Kristoufek ( ) utilized Wavlet Coherence Analysis to determine that while Bitcoin contains standard financial asset characteristics, there exists speculative properties that determine its price. Also, new timeseries analysis frameworks have been proposed using cryptocurrency data. Amjad & Shah ( ) framed the prediction as a ternary-state classification problem, and Jang & Lee ( ) utilized Bayesian neural networks.
. ABM has also been used in cryptocurrency research to model complex systems that consist of heterogeneous, autonomous agents who interact with each other and the environment. Cocco & Marchesi ( ) presented an agent-based artificial market model of the Bitcoin mining process and the Bitcoin transactions. The main result of the authors' model is that it e ectively reproduces the unit-root property of the price series, the fat tail phenomenon, the volatility clustering of the price returns, the generation of Bitcoins, the hashing capability, the power consumption, and the hardware and electricity expenses incurred by miners. The authors were able to demonstrate that an artificial financial market model can reproduce the stylized facts of the Bitcoin market. .
Most of the machine learning-based methods, including the aforementioned studies, have focused on identifying relationships between Bitcoin price and price-related factors without considering the most unambiguous actors of the system: Bitcoin users and traders. We believe that starting from the fundamental actors of the system can reveal more realistic emergent features and provide a more expandable model than previous methods that are restricted to only macro phenomena. At the same time, unlike most ABM models, our approach generates agent behaviors directly from transaction data, which are publicly available in the cryptocurrency market.
Our contribution is that we propose a new groundwork for market simulation while there currently exists no outstanding research that presents a prediction benchmark in this young, highly volatile market.

Inverse reinforcement learning
. Inverse reinforcement learning (IRL) (Russell ) is a problem within the reinforcement learning framework based on Markov decision processes (MDPs), which is a common probabilistic model with sequential decisions and rewards. Formally, an MDP is represented as a five-tuple (S, A, P · (· , · ), R(· , · ), γ) where S is the set of discrete states, A is the set of discrete actions, P a (s, s ) is the probability of moving from state s to s a er taking action a, R(s, a) is the scalar reward for taking action a in s, and γ is the reward discount factor. An IRL problem assumes that the reward R is not known and tries to find it that explains observed trajectories of (s, a) pairs and determine the associated optimal policy. .
Specifically, a policy π : S × A → [0, 1] is defined as a set of probabilities of choosing action a ∈ A at state s ∈ S, and then we can compute the value of the cumulative reward v = T t=1 γ t r t , where t denotes the time for each step along the trajectory. For each state s ∈ S, the expectation of the cumulative reward under the policy π can be computed as follows: The optimal policy for an MDP is a policy π * that satisfies V π * (s) ≥ V π (s) for all state s and for all policy π.
Among the IRL algorithms, we utilize the OptV algorithm because of its speed and e iciency. The speed is critical for our method since we model hundreds of IRL agents to generate aggregated behaviors in an ABM simulated market. In the OptV formulation, the optimal policy π * (s) is parameterized by the desirability function z(s) = exp(−v(s)) where v(s) is the optimal value function. Then the inference is made by maximum likelihood with the negative log-likelihood where n refers to a discrete step in a trajectory. By writing L in terms of v, we have which reduces the IRL problem to unconstrained convex optimization of an easily-computed function. We apply L-BFGS method with back-tracking linesearch to solve the optimization problem.

Proposed Method
. Our method begins with Bitcoin blockchain data for individual users' transactions and ends with price estimation generated by an ABM. The method is summarized graphically in Figure below. Figure : Method diagram illustrating the progression from observational data to aggregate behaviors generated by an ABM.
. We try to recover the behaviors of each agent from its sequential transactions using an IRL algorithm. We then build an ABM that represents a hypothetical Bitcoin market. Since the market has well-defined interactions, and the supply and demand of the market can be inferred directly from the behavioral rules recovered from the IRL, the ABM is expected to play out over time to predict the future equilibrium prices, at least for a short period of time.
. Specifically, the method consists of four steps (numbered -in Figure ).
. Identify major agents by address linking . Model IRL for major agents' behavioral rules . Construct ABM for Bitcoin supply/demand . Predict equilibrium prices for future dates Identify major agents by address linking .
As of Aug. , there are million addresses on the Bitcoin Blockchain. Since typical users have multiple Bitcoin addresses and the transactions are anonymous, it is necessary to group addresses that are controlled by a single entity. This process is called address linking, which itself is an ongoing topic in cryptocurrency research. On this front, we use a method implemented by Kalodner et al. ( ). The heuristics used in this implementation include ( ) input addresses used in the same transaction are controlled by the same entity (except for CoinJoin transactions), and ( ) change addresses are not reused.

.
There are million entities (clusters of addresses) identified by the method stated above. Since our model recognizes each entity as an agent, it is simply impossible to run an IRL algorithm for all entities due to its large size. We analyzed the market transaction volume, which reveals that a very small number of agents take up a dominating portion of the whole market. For example, only , agents make up % of the whole market transaction amounts during Jan. -Jul.
. When counting only agents who had more than transactions during the period, only , agents make up % of the counted transaction volume. The shade in the plot below shows agents taking up % of market transactions to get a sense of the number of major agents.

Model IRL for major agents' behavioral rules .
An IRL algorithm solves for reward function (R) with MDP\R which consists of states, actions, and transition probabilities. With R recovered, we can solve the completed MDP and find an optimal behavioral policy with value iteration. We assume that each agent has its own MDP\R since trading volume is considerably di erent from agent to agent. Daily transactions of each agent are fed into an IRL algorithm as an observed trajectory of (s, a) pairs. The results of our IRL approach are the (presumably) optimal actions for each state in a given MDP.
. A state is defined by four variables: BTC price (BP ), di erence between BTC price and moving average of the price (P M ), BTC value possessed (V P ), and BTC value realized (V R). All values are measured as a daily value based on Coordinated Universal Time (UTC). An agent action (A) is the net amount of BTC spent on a day. A negative value of action means an agent has net receiving amount on that day. Those variables should be discretized to define discrete states for IRL. In our formulation, the reward discount factor (γ) of the IRL represents a preference for immediate rewards considering market uncertainty and internal rate of return (IRR) of Bitcoin users. The γ is set to . throughout the method.
. Transition probabilities is projected by assuming that daily BTC price change (D) follows a normal distribution. Any combination of changes in state variables can be calculated by the probability distribution of the price di erence since all of the four state variables can be expressed by the daily BTC price change as follows.
where t is index for day and m is the length of moving average period. For example, the probability density of changing from BP 1 to BP 2 controlling for other variables can be simply determined by normal pdf f (D 2 |µ, σ) once µ and σ has been estimated from historical price data.

Construct ABM for bitcoin supply and demand
.
In order to represent market interactions, we construct a hypothetical order book. Assuming that at a given price level, each agent has its own level of receiving/spending amount that maximizes its long-term goal. The order book provides an e icient way to find the equilibrium price among the participants. Figure  Specifically, it is assumed that agents whose action probability recovered by IRL is greater than . are the ones that participate in the market (if two actions have the equal probability, the average of the two actions was used). All participants are assumed to send either a buy order or a sell order with a specific amount. Then, hypothetical transactions are executed at the current price level. When receiving orders or spending orders are le a er the executions, the price goes up or down, respectively. If there are only a small number of market participants le , it is assumed that an equilibrium price for the day has been reached. The following figure illustrates how an equilibrium price is found on a given day. .

Aggregated Transaction Policies
The following algorithm delineates the procedure on a given day.
. First, we set all agents in the simulated market to active, and if there are more than a pre-determined number of agents active in our simulated market we update the BTC moving average, value possessed, and value realized according to the base price while determining a current state and associated action. We then set the action for the agent.
. Second, based on an agents action as a spender or receiver we determine market equilibrium by matching spenders with receivers until active agents fall below a certain threshold and we reach market equilibrium.
. Lastly, if there is a disequilibrium of spenders and receivers, we adjust the equilibrium price accordingly.
Algorithm : Procedure for finding equilibrium price on a day input : Behavioral policy (π) for each agent and BTC price of previous day (p prev ) output: Daily equilibrium price (p curr ) p curr ← p prev set all agents as active while there are more than n active agents do foreach agent i do update BTC MA, BTC value possessed, and BTC value realized according to p curr determine current state (s i ) and associated action (π i (s i )) set action A i ← π i (s i ) end Spenders ← agents with A > 0 Receivers ← agents with A < 0 set agents with A = 0 as inactive if Spenders = ∅ and Receivers = ∅ then randomly pair up spending agent and receiving agent foreach pair of spender j and receiver k do Once an equilibrium price has been found on a given day, the price can play as the next day's base price on which agents determine their states and associated policies. Thus, the simulation is able to play out for any period of days. Thus, the series of equilibrium prices generated in the simulation is considered prediction prices. While the prediction period is unbounded, the simulation may part ways from the real market trends a er several rounds because the cumulative BTC value possessed and BTC value realized by each agent can be di erent from the real entity even though the predicted price is similar to the real market price.

Experiment and Results
. In order to test diverse market conditions, we performed experiments with the proposed method on training data containing di erent -month periods of real Bitcoin transactions ranging from Sep. to Jul. . Each experiment consists of simulations starting with the same conditions of the last day of training periods. The purpose of the simulation is to obtain percentages of correct directional prediction. A set of simulations gives consistent results when testing multiple trials. The experiments generate equilibrium prices for days, and these prices are compared with the real market prices within the same period.

Dataset .
Bitcoin blockchain data was aggregated by summing transaction amounts to generate daily net spent amount by each agent. Training datasets were created for nine -month periods starting from September . We used the blockchain data of July and before to exclude the e ect of the hard fork that took place in August . We selected major users that take up % of the total market transaction volume. Users who had less than transactions were excluded for training. Both state and action spaces are uniformly discretized for MDP\R by dividing the state space into , states, for BTC price, for -day moving average gap, for BTC value possessed (V P ), and for BTC value realized (V R). The last price of the training period is set in the middle of the BTC price range to prevent any bias in either direction. The BTC price range is from : -max(maximum BTC price -last training price, last training price -minimum BTC price) to : max(maximum BTC price -last training price, last training price -minimum BTC price).
The transition probabilities are projected individually based on this BTC price range with discretized V P and V R. The action space is divided into net spending levels, ranging from : − |minimum net spending| + |maximum net spending| 2 to : |minimum net spending| + |maximum net spending| 2 .
The maximum and minimum value were measured during the training period. The following table summarizes the information of training datasets.

Simulation and validation .
Even though we have the exact same behavioral rules for our agents, the market price is determined by how supply and demand are paired. If the market always matches the biggest supply and the biggest demand, the result should be deterministic. However, for a more realistic simulation, we try random matching where the probability of execution is in proportion to the volume. We performed simulations for each of nine periods. The simulation ignored actions of less than BTC, which is usually less than . % of total daily transaction volume. The procedure for finding equilibrium price stops when there are less than active agents le to prevent wavering prices by the last few agents on a day. There are inevitable parameters to model the market with the significantly small number of agents compared to the real market.
. Figure below show the -day prediction result of the simulations. Because the predicted prices tend to precede the real market prices, the real prices of more days are displayed for a comparative purpose. Since there is no comparable method that utilizes individual data in addition to time-series data, univariate ARIMA model predictions are presented as a baseline, which is most frequently used for time series only data. The price data are clearly non-stationary, and thus we take a first di erence of the data for the ARIMA model. The first di erencing is compatible with the previous assumption for the transition probability projection. AR and MA orders are chosen with the smallest AIC for each experiment. . Table shows the percentages of correct predicted direction (up/down) compared to the price of the last training day. The first half of the predictive period tends to be less correct since it does not capture blips of the market prices. However, the prediction rate of the following half-period is almost % with the maximum of . % on the th day. The prediction rates start to drop a er days suggesting that the simulation develops its own direction rather than reflecting the market trend. .
The percentage does not represent the accuracy of the prediction. Sometimes, the disparity between the real price and predicted price can be big even though the predicted direction is correct.  Table : Percentages of correct directional prediction for the proposed method out of simulations and ARIMA out of experiments. The shade emphasizes prediction periods with % or more correct prediction rate.

Sensitivity analysis .
There are not many model parameters since the behavioral rules of agents are generated by IRL. Two arbitrarily chosen parameters in the previous section were ( ) minimum transaction amount deciding active agent and ( ) minimum number of agents deciding equilibrium price. .
We now test these parameters to make sure that they do not a ect the result. As shown in Figure below, the minimum transaction amount has a negligible e ect on the result. When the minimum number of agents in the market is set to (around % of total market participants), the simulation seems to stop running before reaching the equilibrium. When the minimum number of agents is too small, however, the equilibrium price tends to be influenced by the last few remaining agents. It is worth noting that there is an important parameter that is latent in the IRL state variables: moving average day (m). Since a moving average (MA) smooths out price over m days, it provides a linear trend of the price. We found that m was sensitive to the individual behaviors and thus the following results. When m is small, it does not to provide much additional information since MA acts like price itself. When m is large, it lags the price and does not represent timely decisions of individual users. Since we target -day, short and mid-term predictions, we decided to be an appropriate value for m.

Discussion
. Considering that the market price is the result of interactions of market participants, the reasoning behind our method is straightforward compared to other traditional methods that try to find price movement regularity or secondary correlation between the price and external information. Since our method first builds the market itself with individual participants, we can even trace back the causal chain from the market price to the individuals. Moreover, the use of IRL provides a systematic way to generalize market participants' behaviors.
Since a state-based market model has multitudinous combinations of state variables, it is necessary to project observed behaviors to unobserved states. IRL is an e icient tool for generating behavioral rules in unseen situations (generalization) or even in di erent model dynamics (transferability). The experimental results indicate that the proposed method could be an e ective, new prediction method in the Bitcoin market.
. However, several limitations remain. First, our simulations fall short of being able to model a large enough number of agents to simulate more exact real-world scenarios, as one requires too much computational power necessary to run these simulations. We could not perform sensitivity analysis with the number of agents for the same reason. For more accurate simulations, there should be either a faster IRL algorithm or a method to select influential agents. Currently, we only exclude agents with less than transactions, since less than transactions are hard to constitute sequential decisions that are necessary for the IRL method. .
A second limitation is that cryptocurrency markets are continually active on a global scale, thus discretizing a day based on UTC is arbitrary. This arbitrariness may result in poor IRL models for some agents and, as a result, a less accurate ABM model. If we can identify each agent's time zone, we will be able to assign individual time zones. Using a smaller time scale could be another way to circumvent this limitation. Another obstacle we face using this method is that testing multiple parameters in the ABM is not easy since almost all parameters are embedded in the IRL models. This could result in neglecting some important variables in the prediction model.

Conclusion and Future Work
. In this paper, we proposed a method for generating synthetic Bitcoin transactions and predicting market prices. We were able to predict short-term Bitcoin price movements by utilizing the motivation-based approach to recover not only exhibited behavioral rules but also unobserved rules rooted in the agents' motivations. Our results showed a greater than % directional predictive accuracy on average a er a prediction period of six days. From day to day we encountered our strongest predictive accuracy, displaying our model's fortitude in predicting short and mid-term price movements in the Bitcoin market.

.
Our result does not imply that the proposed method outperforms other prediction techniques. The baseline experiment is far from the best e ort, and the directional prediction rate is not a fair metric since it cannot measure comparative accuracy and precision of the result. Thus, a follow-up study is necessary in order to show our model's comparative performance. Since it is not fair to simply compare the results from di erent datasets and di erent time frames, the combination of algorithm and dataset should be considered, such as a comparison between IRL+ABM with individual data and other machine learning methods with macroeconomic data. .
Once the supply and demand are constructed with ABM, it can be easily expanded to other market behaviors such as price spread, volatility, and trading volume. This expansion can be considered as potential future research, as well as comprehensive validations, since almost all market phenomena can be explained by supply and demand behaviors. Another idea would be understanding market cycles in bear vs. bull trends and how our agents behave during these scenarios. Comparing market movement before and a er the Segwit hard fork would be an interesting topic as well. .
On the methodology side, we believe that the combination method of IRL and ABM is applicable to much more diverse domains, possibly all areas where ABM is used. By recovering individual rules from data with IRL, this JASSS, ( ) , http://jasss.soc.surrey.ac.uk/ / / .html Doi: . /jasss. approach can systematically and even automatically build an ABM. In addition, imitation learning is not limited to IRL, and there are a number of techniques that can model behavioral rules from data. One promising future work is applying generative adversarial networks (GANs) in combination with ABM for constructing a predictive model.