Dynamic Pricing Strategies for Perishable Product in a Competitive Multi-Agent Retailers Market

: Due to the fierce competition in the marketplace for perishable products, retailers have to use pricing strategies to attract customers. Traditional pricing strategies adjust products’ prices according to retailers’ current situations (e.g. Cost-plus pricing strategy, Value-based pricing strategy and Inventory-sensitive pricing strategy). However, many retailers lack the perception for customer preferences and an understanding of the competitive environment. This paper explores a price Q-learning mechanism for perishable products that considers uncertain demand and customer preferences in a competitive multi-agent retailer market (a model-free environment). In the proposed simulation model, agents imitate the behavior of consumers and retailers. Four potential influencing factors (competition, customer preferences, uncertain demand, perishable characteristics) are constructed in the pricing decisions. All retailer agents adjust their products’ prices over a finite sales horizon to maximize expected revenues. A retailer agent adjusts its price according to the Q-learning mechanism, while others adapt traditional pricing strategies. Shortage is allowed while backlog is not. The simulation results show that the dynamic pricing strategy via the Q-learning mechanism can be used for pricing perishable products in a competitive environment, as it can produce more revenue for retailers. Further, the paper investi-gates how an optimal pricing strategy is influenced by customer preferences, customer demand, retailer pricing parameters and the learning parameters of Q-learning. Based on our results, we provide pricing implications for retailers pursuing higher revenues.


Introduction
. In order to attract customers, many retailers are o ering a large number of perishable products. This allows them to compete against the more traditional channels that specialize in these items (Adenso-Díaz et al. ). Li et al. ( b) reported that more than % of sales in the grocery retail industry in the US in corresponded to food and beverages, and % of those sales were of products with a limited shelf life. This means that more than % of this sales channel deals in perishable units. Perishable products o en require careful handling, and their limited shelf life requires the implementation of some sort of strategy that manages the spoilage of outdated units. According to Chung & Li ( ), % of consumers frequently check expiry dates of perishable product when buying. They prefer to select the fresher units, which provide a higher perception of quality when the units have di erent expiry dates but the same price (Chung & Li ). It is clear that adjusting prices according to product characteristics, instead of adopting a fixed price across the entire shelf life, may increase sales. Therefore, retailers can dynamically change their prices to balance supply and demand based on information such as inventory shelf life and the elasticity of demand. .
Price strategies have been extensively studied in perishable industry (e.g., a Cost-plus pricing strategy, a Valuebased pricing strategy and an Inventory-sensitive pricing strategy) (Li et al. ; Chang et al. ). Related research has shown that factors such as cost, uncertain demand, competition and perishable characteristics play a vital role in price setting (Sung & Lee ; Shankar & Bolton ; Soni & Patel ). Customer preferences (e.g., quality of perishable products, distance to retailers and price of perishable products) have also been recognized as an important factor in modeling people's economic behavior that is significant to retailers' pricing strategies (Feldmann & Hamm ). These characteristics mean that the pricing problem of perishable products presents a complex, large-scale stochastic modeling challenge. Due to their computational tractability, complex models generally are too di icult to implement in real time. A multi-agent simulation model is considered to be a good instrument for modeling the social interaction of actors (Lee et al. ). Chang et al. ( ) proposed an agent-based simulation model to develop a best practice dynamic pricing strategy for retailers. In this model, retailers adjust their pricing strategies according to their current situations. However, their model lacks the ability to perceive customer preferences or to analyze the competitive environment, and it therefore may lead to an incorrect pricing policy for local optimization. To solve this problem, a simulation model that can learn customer preferences implicitly and optimize pricing strategies for perishable products should be created. The Q-learning algorithm is known for its ability to create nearly optimal solutions to problems that involve a dynamic environment with a large state space (Dogan & Güner ). Applications of Q-learning in the context of an expert system include real-time rescheduling (Li et al. a), inventory control in supply chain (Jiang & Sheng ), and dynamic pricing policies (Tesauro & Kephart ). Rana & Oliveira ( ) used reinforcement learning to model the optimal pricing of perishable interdependent products in conditions in which demand is stochastic. However, they did not apply Q-learning to the pricing strategies of di erent retailers that explicitly model customer preferences in a competitive environment. Moreover, there lack a model-free environment (e.g, a multi-agent system), which includes many influencing factors that are implicitly incorporated into the pricing decisions.

Literature Review
. Price strategies have been widely studied (Zhou ). Di erent non-linear benefit/demand functions have been proposed based on the price elasticity of demand, the e ect of incentives, and the penalties of DR programs on customer responses (Schweppe et al. ; Yusta et al. ). Yousefi et al. ( ) proposed dynamic price elasticities, which comprise di erent clusters of customers with divergent load profiles and energy use habitudes (linear, potential, logarithmic, and exponential representations of demand vs. price function). Much dynamic pricing research has been related to replenishment (Maihami & Karimi ); procurement (Gümüş et al. ); inventory (Gong et al. ); or uncertain demand (Wen et al. ). Despite the potential benefits of dynamic pricing, many sellers still adopt a static pricing policy due to the complexity of frequent reoptimizations, the negative perception of excessive price adjustments, and the lack of flexibility caused by existing business constraints. Chen et al. ( ) studied a standard dynamic pricing problem in which the seller (a monopolist) possessed a finite amount of inventory and attempted to sell products during a finite selling season. The company developed a family of pricing heuristics to address these challenges. Ibrahim & Atiya ( ) considered dynamic pricing for use in cases of continuous replenishment. They derived an analytical solution to the pricing problem in the form of a simple-to-solve ordinary di erential equation. The method did not rely on computationally demanding dynamic programming solutions. .
Reinforcement learning, a model-free, non-parametric method, is known for its ability to propose near optimal solutions to problems that involve a dynamic environment with a large state space (Dogan & Güner ). It originated in the areas of cybernetics, psychology, neuroscience, and computer science, and it has attracted increasing interest in artificial intelligence and machine learning (dos Santos et al. ; Oliveira ). Over the past decade, reinforcement learning has become increasingly popular for representing supply chain problems in competitive settings. Kwon et al. ( ) developed a case-based myopic RL algorithm for the dynamic inventory control problem of a supply chain with a large state space. Jiang & Sheng ( ) proposed a similar case-based RL algorithm for dynamic inventory control of a multi-agent supply-chain system. From the aspect of dynamic pricing policies, Li et al. ( a) studied joint pricing, lead-time and scheduling decisions using reinforcement learning in make-to-order manufacturing systems. Q-learning is one of the reinforcement learning models that has been studied extensively by researchers Li et al. ( ). Q-learning was a famous anticipatory learning approach for agents who sought to learn how to act optimally in controlled Markovian domains. Much research has extended the learning model, such as Even-Dar & Mansour ( ) and Akchurina ( ). In the dynamic pricing area, Tesauro & Kephart ( ) studied simultaneous Q-learning by analyzing two competing seller agents in di erent, moderately realistic economic models. Q-learning has been used to investigate dynamic pricing decisions. Collins & Thomas ( ) explored the use of reinforcement learning as a means to solve a simple dynamic airline pricing game (SARSA, Q-learning, and Monte-Carlo learning were compared).

.
Many factors a ect price strategies. Zhao & Zheng ( ) and Elmaghraby & Keskinocak ( ) showed that prices should rise if there is an increase in perceived product value. Moreover, cost, demand, competition and customer preference play a vital role in price setting (Sung & Lee ; Shankar & Bolton ; Feldmann & Hamm ). Deterioration characteristics are also a factor in a dynamic pricing problem (MacDonald & Rasmussen ; Tsao & Sheen ). Traditional mathematical methods cannot adequately describe the complexities of the competitive market. Due to its modeling power, multi-agent simulation has received much attention from researchers who investigate collective market dynamics (Kim et al. ; Lee et al. ). Arslan et al. ( ) presented an agent-based model intended to shed light on the potential destabilizing e ects of bank pricing behavior. Rana & Oliveira ( ) proposed a methodology to optimize revenue in a model-free environment in which demand is learned and pricing decisions are updated in real-time (Monte-Carlo simulation). However, there are few papers about the dynamic pricing problem of perishable products that consider competition, customer preference and stochastic demand by means of the Q-learning algorithm. .
This paper uses the Q-learning algorithm to model the optimal pricing for perishable products considering uncertain demand and customer preference (distance and price) in a competitive multi-agent retailer market (model-free environment). Many potential influencing factors are constructed in the pricing decisions of the multi-agent models. The remainder of this paper is organized as follows: Section proposes a multi-agent model of the virtual competitive market and the use of Q-learning for dynamic pricing policies. Section describes three scenarios used to test the availability and sensitivity of the Q-learning approach. Future directions and conclusive remarks end the paper in Section .

Model
. This paper proposes a model to simulate dynamic pricing strategy for perishable products in a competitive market. First, each retailer agent imitates a retailer's sales and replenishment behavior. A retailer agent adjusts its price strategy based on its current situation and the competitive environment. It issues replenishment orders from a supplier based on its current inventory. In the simulation, Q-learning, which is a kind of reinforcement learning, is used to construct a dynamic pricing strategy without no advance information about customer behavior. Then, each customer agent imitates a customer's purchase choice process. The customers in the market are assumed to have di erent consumption preferences according to recent product pricing studies (Feldmann & Hamm ; Chang et al. ). The customer agent demand is stochastic, and customer agents continue to update their position values. The simulation considers a market situation in which multiple retailers compete with each other. The notations used in this paper are listed in Section . . A multi-agent model in a competitive environment for perishable products, and in a model-free environment, is proposed in Sections . -. . The Q-learning algorithm for dynamic pricing is considered in Sections . -. , and the other dynamic pricing policies are modeled in Section . . The model can be accessed at the following URL: https://www.comses.net/codebases/5887/releases/1.0.0/.

Notation .
The main notations used in this paper are shown in Tab. . Other notations are explained in the sections in which they first appear.

V 1
The set of retailer agent V 2 The set of customer agents v 1j The jth retailer agent in set , if customer i buys products from retail j at time t , the retailer agent orders perishable products , otherwise , otherwise E An adjacency matrix consists of e ij S t v1j The state of retailer agent v 1j at time t f t v1j The fitness of retailer agent v 1j at time t The stock level of retailer agent The behaviors of retailer agent at time t Q The order quantity of retailer agents The income of retailer agent v 1j at time The order cost of retailer agent C 2 The purchasing cost of retailer agent C 3 The unit inventory cost of retailer agent ψ The quantity deterioration rate of perishable products r(t) The residual value of perishable products at time t α The value deterioration rate of perishable products S t v2i The state of customer agent v 2i at time t ξ The customers' acceptable value coefficient s ki The set of kth preference value to customer agent v 2i The reward gained by retailer agent v 1j a er each state transition s kij The kth preference value of customer The learning rate of Q-learning algorithm η The discount rate of Q-learning algorithm a v1j The action of retailer agent v 1j δ v2i The preference of customer agent v 2i T The theoretical sales cycle f in The inventory-price coe icient The price state of retailer agent v ij at time t Table : Summary of the notations in this paper .
Next, the multi-agent model and the Q-learning algorithm for perishable products are formulated.

The virtual competitive market environment .
The competitive market consists of several competitive members (retailers) who compete for customers. Each member of the market (retailer and customer) is simulated by an agent in Netlogo . . . The virtual competitive The relationship between retailers and customers is modeled by an adjacency matrix E. e t ij is the element of the adjacency matrix E. The relationships among suppliers, retailers and customers are shown in Figure , in which a solid arrow means eij = 1, while a dotted arrow means eij = 0. e ij should satisfy the following constraint:

The state of retailer agents
.
v1j means the selling action of v 1j at time t, and b(4) t v1j is the action used to calculate costs, profit and inventory at time t. Figure shows the dynamic processes of retailer agents. If the inventory of agent , the retailer agent orders Q perishable products from a supplier, otherwise it issues no orders. The order policy is the same for each retailer agent. f t v1j can be calculated as follows: where C 1 , C 2 , C 3 are the order cost, unit purchasing cost and unit inventory cost, respectively.
. Equation means that the retailer's demand is determined by its inventory at the tth cycle. If the inventory is less than customer demand, only part of the overall customer demand will be satisfied 2 ×ψ). Similar to D t vij , the value of I t vij is determined by its inventory at the (t − 1)th cycle. If the final inventory of JASSS, ( ) , http://jasss.soc.surrey.ac.uk/ / / .html Doi: . /jasss. retailer agent v 1j at the (t − 1)th cycle is less than zero, an order will be generated, and the inventory at the tth cycle will equal Q. Equation shows this type of relationship. The initial stocks of all retailers are assumed to be the same (Chang et al. ), as Equation shows.
In these models, shortages are allowed but cannot be backlogged. Moreover, the value deterioration rate of perishable products should be considered. Di erent types of perishable products have di erent rates of deterioration based on their characteristics. Blackburn & Scudder ( ) and Chang et al. ( ) described the diminishing value of perishable products at time t in the form of an exponential function, as shown in Equation . They all assumed that perishable goods with the same criteria have similar qualities.
When t = 0, r(t) = 1, perishable products are completely fresh; When lim t→∞ r(t) = 0, the perishable products are useless. If r(t) < ξ, set F O t vij = 1. V is the customer acceptable value coe icient. Figure : The dynamic processes in a competitive market.

State of customer agents
.
The state of customer agents can be described as S t v2i = D v2i , e t ij , δ v2i . The demand of each customer follows a normal distribution D v2i ∼ N (u, σ 2 ). It is assumed that each customer's requirement is independent of the retailer's pricing policy. e t ij is a -variable, which is used to describe the retailer choosing the actions of customer agents. The actions of a retailer choosing for all customer agents at time t can be described via adjacency matrix E that is shown by Equation .    . δ v2i = {δ(1) v2i , δ(2) v2i }. δ(1) v2i is a K ×N set, in which each element is one preference type (δ(1) v2i = {s 1i1 , s 1i2 , . . . , s 1iN , s 2i1 , s 2i2 , . . . , s 2iN , . . . , s KiN }). The proportion of each preference is defined as set δ(2) v2i (δ(2) v2i = {γ 1i , γ 2i , γ 3i , . . . , γ Ki }), and K k=1 γ ki = 1. Customer agents' actions are shown in Figure . In this paper, two kinds of preferences are considered. The first preference is the distance between customer and retailer. It can determine the convenience of consumption (Chang et al. ). Price is another customer preference. Compared to quality-oriented customers, this kind of customer is sensitive to price (Lee et al. ). Based on this, customers can be divided into three categories (See Table ). For example, customer prefers a retailer with a lower price. The retailer choosing processes can be described as follows: .
Step : For each k, calculate the value s ki1 , s ki2 , s ki3 , . . . , s kin of customer agents δ v2i (k means the kth preference of customers).
Step : Normalize the kth preference value of customer agent δ v2i , y kij = s kij −min{s ki } max{s ki }−min{s ki } (according to the synthesizing evaluation function of customer preferences). y kij is the normalized kth preference value from customer agent v 2i to retailer agent v ij , y kij ∈ [0, 1].
Step : For each retailer agent v 1j , calculate the total preference value δ of customer agent v 2i , δ = 2 k=1 (γ ki × y kij ). The customer agent v 2i chooses the retailer agent with the maximum δ .

Customer categories Distance Price
Customer Sensitive Insensitive Customer Insensitive Sensitive Customer Equal Equal Q-learning algorithm for dynamic pricing policy .
The dynamic pricing process of X v1j is the Markov decision process (MDP), given the transition function: is the state of agent v 1j at period t, a t v1j is the action of agent v 1j in period t, and x t+1 v1j is the state of agent v 1j at period t + 1. X t v1j is the set of x t v1j , and X v1j , A v1j , T v1j , re v1j is an MDP where r vij is the mapping function from X vij × A vij to R that defines the reward gained by v 1j a er each state transition. The action policy π is the mapping from states to distribution over actions π : X v1j → Π(A v1j ) . Π(A v1j ) is the probability distribution over an action. The problem is then to find π based on r vij . Figure shows the interaction of agent and environment in a Q-learning algorithm. Figure : The interaction of agent and environment in Q-learning.

Q-learning reward function .
The Q-learning algorithm can be viewed as a sampled, asynchronous method for estimating the optimal Qfunction of the MDP ( X v1j , A v1j , T v1j , R v1j ). The reward r t v1j at period t is determined by customer demand, product price and retailer total cost. All customer demand can be described by the set D t = D t v21 , D t v22 , . . . D t v 2M .
The customer decision matrix is E t v1j . Then, the reward function can be calculated by: The Q-function Q(x v1j , x a1j ) defines the expected sum of the discounted reward attained by executing action a v1j (a v1j ∈ A v1j ) in state x v1j (x v1j ∈ X). The Q-function is updated by using the agent's experience. The learning process is as described below.
. Agents observe the current state and select an action from A v1j . Boltzmann so -max distribution is adopted to select the actions. For each price state A v1j = {a 0 v1j , a 1 v1j , a 2 v1j }, where a 0 v1j means a price decrease, a 1 v1j means on operation occurs to the price and a 2 v1j means a price increase. The probability calculation of action is similar that described by Li et al. ( ).
. e u : The tendency of agents to explore unknown actions. A u : The set of unexplored actions from the current state. A w : The set of explored actions (at least once) from the current state. .
The probability of P a m v1j |x v1j can be calculated by: If one of the actions belongs to A w and the others belong to A u , then If only one action belongs to A u , then A er selecting an action, the agent observes the state at period t + 1 and receives a reward for the system. The corresponding Q value for state x v1j and action a v1j are updated according to the following formula: where (0 ≤ < 1, learning rate) is the weight of the new information used in updating Q.
(0 ≤ < 1, discount rate) represents the importance of the value of future states in assessing the current state. At each simulation period, company agents update their Q-table using the Q-learning algorithm and learn the optimal cognitive map of how actions influence goals. .
In order to prevent premature convergence, a random parameter is introduced to describe the step size (a t v1j ,x t v 1j ,−xv 1j ) of price decreases or increases. .
where x v1j ,lower and x v1j ,upper denote the upper and lower price limits of retailer agent v 1j , and Equation is a piecewise function, and its meaning can be described as follows: . If a retailer agent with the Q-learning algorithm chooses the action to decrease a price, the decreased step size should be a random value between [0, (x v1j − x v1j ,lower )]; . If a retailer agent with the Q-learning algorithm chooses the action to increase a price, the increased step size should be a random value between [0, (x v1j ,upper − x v1j )]; . In any other situation, the step size should be zero. .
In this model, retailer agents adjust their price values based on market judgments. Figure shows the basic graphical model of this paper. Dotted arrows indicate that an action is chosen depending on the state. Figure : Basic graphical model of this paper. .
A complete learning process occurs from the initial state to the terminal state. This is considered one cycle. The Q-learning process is shown in Table . The Q-learning algorithm contains two steps: action determination based on the current Q-value and evaluation of a new action via the reward function. This cycle continues until the Q-value converges.
Initialize Q(x v1j , a v1j ) arbitrarily, π to the policy to be evaluated Repeat (for each cycle) Initialize x v1j Repeat (for each step of cycle Choose a v1j from x v1j using policy π derived from Q Take action a v1j Set Until x v1j is terminal Table : Q-learning algorithm.

Other pricing policies .
In real-market situations, there are many pricing policies. Three of these pricing policies are considered (Chang et al. ) and compared with the Q-learning algorithm proposed in this article.
. Cost-plus pricing strategy. A product's price is set based on unit cost (ordering cost, inventory cost and other factors) with a degree of profit. Once the price is calculated, it remains constant throughout the entire sales cycle. x(t) is calculated by Equation , where λ is the target profit coe icient. . Value-based pricing strategy. This pricing policy is suitable for customers with high product quality. The value of perishable product tends to decrease during the sales cycle, so the price becomes cheaper as time passes. The price calculation formula is shown in Equation and Equation , where m and β are the freshness impact factors, θ is the basic price for retailers and T is the theoretical sales cycle.
x v12 (t) = m · e −βt + θ(α > 0, θ > 0, 0 < t < T ) ( ) . Inventory-sensitive pricing strategy. A larger inventory will typically cause lower prices, so that the retailer may sell products more quickly and reduce outdating. The price can be calculated by Equation , where I cu is current inventory, x 13 is the basic price for retailers, x min and x max mean the upper and lower prices respectively, and f in is the inventory-price coe icient. Standard inventory I st is equal to

Runs of the Model
. A virtual market is established in the Netlogo platform. Di erent categories of retailers and customers can be distinguished by di erent colors. The general model parameters agree with the conditions of the Nanjing Jinxianghe (China) vegetable market. Market information was obtained from http://nw.nanjing.gov.cn/ (Chang et al. ) (shown in Appendix). This experiment also takes the selling of grape as an example. The price of grapes in the market is between and Yuan. To calculate a reasonable price of retailer agents, x v1j ,lower is set to Yuan and x v1j ,upper is set to Yuan. x min = x v1j ,lower and x max = x v1j ,upper . There are four retailer agents and hundreds of customer agents. Customer demand is assumed to be normal, distributed according to D v2i ∼ N (3, 1). To balance freshness and the ordering frequency, the threshold quantity for restocking perishable products is set to kg. Other parameters are set based on previous research (Chang et al. ; Li et al. ). .
Four scenarios are run for this dynamic price competitive market to quantify the performance of the dynamic pricing strategies. The four experiments are (i) sensitivity experiments on the learning rate and discount rate, (ii) sensitivity experiments on customer demand, (iii) sensitivity experiments on customer preferences, and (iv) sensitivity experiments on retailer pricing behavior. In these scenarios, a retailer agent with the Q-learning algorithm chooses the optimal action to maximize the sums of its fitness. The results of these experiments are shown in Sections . -. . The parametric settings for the experiments are presented in Table . The values of several parameters refer to Chang et al. ( ). These parameters are e icient to verify the e ectiveness of the simulation model. Other market can utilize the model as well. The Netlogo interface of the competitive market is proposed in Figure . In the simulation, retailer agent v 11 runs with the Cost-Plus Pricing Strategy, retailer v 12 runs with the Value-Based Pricing Strategy, retailer v 13 runs with the Inventory-Sensitive Pricing Strategy, and retailer v 14 runs with the Q-learning algorithm. The quantity of di erent kinds of customer agents is changed according to the di erent situations. Figure : Netlogo interface of the competitive market.

Simulation results
.
Take a set of parameters arbitrarily, e.g. x v11 = 7.5, m = 4, θ = 4, x v13 = 6.5, = 0.5, η = 0.4, based on market information from Nanjing, China. These parameters can maintain the average price of retailer agents in the same level, and it can eliminate the positive e ect of optimal strategy choosing. The position of customer agents is set randomly. The quantities of customer , customer and customer are assumed to be . In this situation, the e ectiveness of the Q-learning algorithm is verified by comparing the profits that the retailer agents gain. Figures  − f cost v1j ,t < 0. Then, the profit of the four retailer agents increases with the increase in trading volume, but the fluctuation of the marginal revenue is severe. The Q-learning algorithm performs better a er steps, and the price war tends to stabilize a er steps.

Sensitivity experiments .
In this section, we estimate the optimal pricing strategy for retailers considering various market conditions. The economic behavior of customer agents and retailer agents is observed. The experiments simulate four scenarios. The sensitivity results from these experiments are shown in Sections . -. .

Sensitivity experiments on the learning and discount rates .
This section conducts two sets of sensitivity experiments with respect to learning rate and discount rate η.The parameters are set as Table shows. The quantities of customer , customer and customer are assumed to be . Other parameters are set as x v1j = 7.5, m = 4, θ = 4, x v3j = 6.5. Wide parameter ranges are chosen to obtain the sensitive analysis. This section changes one parameter at a time, keeping the rest at their values, as shown in Table . If is a variable, then set η = 0.4. If η is a variable, then set = 0.5. For each parameter, the experiment is simulated over periods. Figure shows the average profit of the four retailer agents with di erent discount and learning rates. The results suggest that the profits of retailer agents are sensitive to η (see Figure (a)) and (see Figure (b)). The dynamic pricing strategy using the Q-learning algorithm can help retailers maintain a competitive edge except when η = 0.9. The second-best pricing strategy is the inventory-sensitive pricing strategy. The fixed pricing strategy and the value-based pricing strategy result in a lack of competitive advantage in this kind of market. Moreover, the discount rate of the Q-learning algorithm reduces the intensity of competition in the market (the total profit of retailer agents increases), while the learning rate is the opposite. This is an interesting phenomenon, and it means that the retailer agent with a learning mechanism can lead market competition.

Parameter Minimum Maximum Sensitivity step
.
. . Table : The ranges of learning rate and discount rate. Figure : The average profit of four retailer agents with di erent learning rate or discount rate.

Sensitivity experiments on market demand
. This section conducts a sensitivity analysis of market demand influence to observe how retailer agents share market returns with di erent dynamic pricing strategies. The market demand is mainly a ected by the number of customers. In this scenario, the parameters are set as x v1j = 7.5, m = 4, θ = 4, x v3j = 6.5, = 0.5, η = 0.4. The variables are the quantities of di erent customer categories. For easy comparison, seven occasions in Table are considered. Figure shows the average profit of retailer agents, and it describes the competition results clearly. Retailer agents with Q-learning maintain a competitive advantage, while other dynamic pricing strategies only work well in certain situations. For example, the inventory-sensitive pricing strategy win the competition in occasion . However, behaves poorly in occasion . Interestingly, the total profit of all retailer agents increased with customer demand, except for in occasion . In this occasion, the underpricing behaviors of retailer agents v 12 and v 13 disordered the market.

Sensitivity experiments on customer preference
. This section conducts a sensitivity analysis of customers' preferences to observe how customer behavior a ects the profit of retailer agents, as in Table and Table . The parameters are set as x v1j = 7.5, m = 4, θ = 4, x v3j = 6.5, = 0.5, η = 0.4, and the number of all kinds of customers is set at . First, the preference of the first kind of customer is considered (occasion ). The variable is the customer preference degree for price and distance. Table is the sensitivity results, and it shows the average profit of the four retailer agents considered in this situation. The results reveal that the dynamic pricing strategy with the Q-learning algorithm performs well no matter how much customers' preferences change. Interestingly, the insensitivity to price of the first kind of customer can reduce competition among retailer agents. Retailer agents share more market revenue in this occasion. Then, the preference of the second kind of customer is considered (occasion ). The sensitivity results are shown in Table . In this situation, the dynamic pricing strategy proposed in this paper has a clear competitive edge as well. However, there is no obvious relationship between customer preference and market revenue.  Table : The average profit of the four retailer agents in occasion .

Sensitivity experiments on retailer price
. This section conducts a sensitivity analysis of retailer price to observe how retailers' pricing behaviors a ect the profit of the retailer agents. The number of customers is set to . There are four scenarios, and the sensitivity results are shown in Figure . The variables in the four scenarios are the fixed price of retailer agent v 11 , the freshness impact factor m of retailer agent v 12 , the basic price x v13 and the inventory-sensitive coe icient f in of retailer agent v 13 , respectively. The paper changes one parameter at a time, keeping the rest at the values shown in  is the average profit of retailer agents in scenario . The profit of each retailer agent is sensitive to the price of the others. In scenarios and , the dynamic pricing strategy with the Q-learning algorithm is superior to other pricing strategies. However, the competitive edge of this pricing strategy decreases in scenarios and . The sensitivity results also reveal that a high pricing strategy does not always bring benefits to retailers in a perishable product market. Profits in this type of market depend on customer preference and the pricing strategies of other retailers. Figure : The average profit of retailer agents at di erent scenario.

Conclusions
. Retailers with perishable products struggle in a competitive market environment. This paper researches the pricing strategies of retailers with perishable products, taking into account customer preferences. First, how agent-based modeling helps to inform pricing strategies in a competitive environment is demonstrated. The trade behaviors between retailers and customers are simulated in the model. Traditional retailers in the market adopt pricing strategies based on product freshness, inventory, cost and other factors. Customers in the market have di erent preferences of price and distance from certain retailers. Second, this paper presents a dynamic pricing model, using the Q-learning algorithm, to solve pricing problems in a virtual competitive market. The algorithm allows retailer agents to adjust their price, informed by observations of their experience. The optimal pricing strategy is measured by the final profit in di erent situations.
. The proposed simulation model was applied to the Nanjing Jinxianghe (China) vegetable market. Sensitivity experiments are shown to observe how certain factors, such as the learning rate and discount rate of Q-learning, customer demand, customer preferences and the basic price of retailers, a ect pricing strategy. The optimal strategy is calculated under di erent market conditions. According to our findings, the Q-learning algorithm can be used as an e ective dynamic price approach in most competitive conditions. However, this strategy is not always optimal in every market. If a retailer adjusts its basic price parameter, the competitive environment changes, and so does the optimal pricing strategy. For example, an inventory-sensitive pricing strategy may produce higher profits when retailer agent v 11 adjusts its fixed price or retailer agent v 12 adjusts its freshness impact factor. .
The proposed simulation should be suitable for a wide range of market conditions and may be implemented by adjusting the parameters in the model. It is beneficial to research the pricing problem of perishable products, and our work can be extended along several directions. For example, retailers in markets may have various perishable products to sell, and the products may influence each other. In future research, a more complex learning mechanism should be constructed to improve the retailer agents' learning ability. These are the key issues for future research of perishable products' dynamic price strategies.

Appendix: The market information from Nanjing, China
The price information of grape in Nanjing, China are shown in Figures and . Figure