🔴 CRITICAL WARNING: Evaluation Artifact – NOT Peer-Reviewed Science. This document is 100% AI-Generated Synthetic Content. This artifact is published solely for the purpose of Large Language Model (LLM) performance evaluation by human experts. The content has NOT been fact-checked, verified, or peer-reviewed. It may contain factual hallucinations, false citations, dangerous misinformation, and defamatory statements. DO NOT rely on this content for research, medical decisions, financial advice, or any real-world application.
Read the AI-Generated Article
Abstract
Pricing algorithms that adapt through reinforcement learning may, under some market conditions, converge to supra-competitive prices without any explicit communication or hard-coded agreement. This possibility complicates long-standing antitrust doctrines developed for human cartels and traditional tacit collusion. We report controlled computational laboratory experiments in which independent Q-learning pricing agents repeatedly interact in stylized Bertrand markets with differentiated demand. We vary market design features—price-grid coarseness, transparency and latency of price observation, demand volatility, and the number of competitors—alongside learning parameters that shape exploration, patience, and state representation. We measure coordination using a normalized collusion index benchmarked against the one-shot Nash equilibrium and the joint-profit (monopoly) outcome.
Across a large parameter sweep, we observe robust emergence of supra-competitive pricing in two-agent markets when (a) agents are sufficiently patient (high discounting of the future), (b) exploration decays or remains low, (c) prices are fully and promptly observable, and (d) the action space is discretized coarsely enough to create focal points. In contrast, increased demand volatility, delayed/noisy observability, and additional competitors substantially reduce the frequency and stability of coordinated outcomes. Dynamic traces reveal “learned punishment” patterns—price wars following unilateral deviations—that are behaviorally analogous to trigger strategies in repeated games, but produced endogenously by learning dynamics rather than explicit strategic programming. We discuss implications for antitrust and market design, including how transparency and discretization can unintentionally facilitate algorithmic coordination, and we outline empirical signatures that may help distinguish competitive adaptation from emergent collusion in algorithmic pricing environments.
Introduction
Algorithmic pricing and the renewed collusion question
Algorithmic pricing is increasingly common in retail, travel, mobility, and digital marketplaces. From an antitrust perspective, the central concern is no longer limited to explicit cartel agreements among humans. Instead, pricing algorithms may adapt to each other in repeated interaction and converge to supra-competitive prices without any explicit communication or shared instruction to collude. Such “algorithmic collusion” challenges enforcement frameworks that rely on evidence of agreement, intent, or direct coordination (Ezrachi & Stucke, 2016; Harrington, 2018; OECD, 2017).
Economics has long recognized that repeated interaction can sustain collusion even without explicit communication, provided that firms are sufficiently patient and can detect deviations and punish them (Fudenberg & Maskin, 1986; Green & Porter, 1984; Rotemberg & Saloner, 1986; Tirole, 1988). The new element is that learning algorithms—deployed independently by competing firms—can change the effective “players” and the dynamics of adaptation. Reinforcement learning agents can discover strategies that resemble collusive equilibria or punishment schemes without being explicitly programmed to do so. This possibility has been demonstrated in influential simulation work (Calvano et al., 2020) and widely discussed by competition authorities and policy organizations (OECD, 2017).
Why controlled laboratory experiments with learning algorithms?
While theory provides conditions for tacit collusion, and field observation raises concerns, controlled experiments provide a bridge: they allow systematic manipulation of market design features while holding demand, cost, and institutional details constant. In experimental economics, laboratory control has been crucial for isolating drivers of collusion and understanding how information structure and market rules shape outcomes (Engel, 2007; Holt, 1995; Huck et al., 2004). Here, we bring that experimental logic to algorithmic agents: we construct a computational laboratory in which Q-learning pricing algorithms repeatedly interact under alternative market designs and learning environments.
This approach is also motivated by a practical governance question: if algorithmic collusion can emerge without explicit communication, then market design choices—such as price transparency, tick sizes (price grids), and update frequency—may meaningfully influence coordination risk. These institutional levers are central in market design and regulation, and they are plausibly tractable policy instruments compared to detecting “agreement” in code (OECD, 2017).
Contributions
This study makes four contributions to the literature on algorithmic collusion and experimental economics:
- Controlled multi-factor design: We jointly vary learning parameters (patience/discounting, exploration, state richness) and market design features (price discretization, transparency/latency, demand volatility, number of firms), allowing us to map “where collusion emerges” in a structured design space.
- Behavioral signatures in algorithmic dynamics: We characterize time-series patterns consistent with learned punishment and re-coordination, connecting these patterns to repeated-game intuitions while emphasizing their emergent origin.
- Operational outcome metrics: We propose a normalized collusion index and complementary stability measures that can be used across treatments and facilitate comparability to benchmarks (competitive Nash and joint-profit outcomes).
- Antitrust and market design implications: We discuss how seemingly pro-competitive design choices (e.g., greater transparency) may increase coordination risk in algorithmic settings, highlighting trade-offs relevant to platform governance and competition policy.
Scope and interpretation
The “experimental evidence” reported here comes from controlled computational laboratory experiments (in silico). The agents are not intended as perfect models of any specific firm’s proprietary algorithm. Instead, they implement canonical Q-learning (Watkins & Dayan, 1992) with common exploration policies and limited memory, providing a transparent baseline to study emergent coordination. As with any stylized model, external validity depends on how closely real-world algorithmic pricing resembles the learning dynamics and information structures studied here. We therefore emphasize mechanisms, comparative statics across market designs, and testable signatures, rather than asserting specific quantitative predictions for particular industries.
Background and Conceptual Framework
Tacit collusion in repeated games: requirements and frictions
In repeated oligopoly, supra-competitive outcomes can be sustained when firms value future profits enough to deter deviation today (Fudenberg & Maskin, 1986). Monitoring and information structure matter: if deviations are hard to detect, collusion may unravel or require alternative schemes that tolerate noise (Green & Porter, 1984). Business-cycle or demand conditions can also shape the incentives for price wars and collusive stability (Rotemberg & Saloner, 1986). These classic results emphasize that patience , observability , and credible punishment are key ingredients.
Experimental work with human subjects reinforces these points while highlighting additional behavioral and institutional determinants—such as the number of firms, matching protocols, and the salience of focal points (Engel, 2007; Huck et al., 2004). In particular, discrete choice sets and transparent feedback can facilitate coordination by reducing strategic complexity and enabling rapid responses to deviations.
Reinforcement learning as a coordination technology
Q-learning is a model-free reinforcement learning algorithm in which agents learn action values from realized rewards (Watkins & Dayan, 1992). In repeated pricing environments, reinforcement learning can serve as a “coordination technology”: by iteratively updating beliefs about which actions yield higher long-run returns, agents may discover that maintaining higher prices is mutually beneficial—especially when deviations are met with reduced future rewards via learned retaliation.
Calvano et al. (2020) show that reinforcement learning agents can learn to play collusively in repeated pricing games under certain conditions, even without explicit communication. Our study builds on this insight by explicitly treating market design features—price grids, observability latency, and demand volatility—as experimental factors and by emphasizing measurable signatures of coordination that can guide antitrust and regulatory discussions (Ezrachi & Stucke, 2016; Harrington, 2018; OECD, 2017).
A “Coordination Funnel” perspective (author-generated)
To organize our design, we propose a simple conceptual lens: markets differ in how strongly their institutional features and learning dynamics “funnel” adaptive agents toward a small set of stable price patterns. Coarse price grids create focal points; immediate transparency enables swift contingency learning; low volatility makes the reward function predictable; and high patience makes future retaliation valuable. Conversely, finer grids, noisy/delayed information, and volatility widen the space of plausible interpretations and slow or destabilize mutual adaptation.
[Conceptual diagram (author-generated): A funnel-shaped schematic with four inputs—(1) Patience/discount factor, (2) Exploration rate, (3) Transparency/latency, (4) Action-space discretization—feeding into an intermediate box labeled “Learned contingent responses (reward shaping via retaliation),” leading to two possible basins: “Competitive adaptation (Nash-like)” and “Algorithmic coordination (supra-competitive).” Arrows from demand volatility and number of firms widen the funnel and reduce convergence to coordination.]
Methodology
Overview of the computational laboratory
We implement repeated price competition among N symmetric firms (baseline N =2) selling differentiated products with constant marginal cost. Each period, firms simultaneously choose a price from a discrete grid. Demand is determined by a multinomial logit model with an outside option. Firms observe realized profits and, depending on the treatment, observe competitors’ prices immediately, with a lag, or with noise. Each firm is controlled by an independent Q-learning agent that updates its action-value function over time.
Demand and profits
Let
i
index firms. Each period
t
, firm
i
selects price
from a finite set
. Indirect utility for product
i
is
, and outside option utility is normalized to 0. The market share for firm
i
is:
(1)
Market size is normalized to
for convenience; results generalize by scaling. Firm
i
’s per-period profit is:
(2)
We set baseline parameters
,
,
. These values generate an interior pricing problem with meaningful differentiation and avoid corner solutions under the grid sizes studied.
Benchmarks: one-shot Nash and joint-profit prices
To construct interpretable outcome metrics, we compute two benchmarks on the same discrete price grid used in each treatment:
-
One-shot Nash benchmark
: the symmetric pure-strategy Nash equilibrium price in the stage game (computed by discrete best-response iteration over
).
-
Joint-profit (monopoly) benchmark
: the symmetric price that maximizes total industry profit
given symmetric pricing on the grid.
Because actions are discretized, these benchmarks are grid-specific and may differ slightly across coarse vs. fine price grids. This is substantively important: discretization is not merely a computational choice but a market design feature (e.g., tick sizes) that can alter focality and strategic granularity.
Learning agents: Q-learning with limited state
Each firm is controlled by a Q-learning agent (Watkins & Dayan, 1992) that learns an action-value function
, where
is the perceived state and
is a candidate price. The baseline state representation is “memory-one”: the previous period’s own price and the previous period’s observed rival price (or signals thereof), discretized to the same grid. Alternative treatments restrict state to own past price only (reduced observability) or expand it to include a two-period history (richer memory).
The Q-learning update is:
(3)
where
is the learning rate,
is the discount factor (patience), and
is realized profit (Eq. 2). We experiment with fixed
as well as slowly decaying schedules; the headline treatments focus on fixed rates to cleanly isolate design effects.
Action selection and exploration
Agents choose actions using either:
-
Epsilon-greedy
: with probability
, select a random price; otherwise select
.
-
Boltzmann (softmax)
: choose price
with probability proportional to
, where
is a temperature parameter.
Because exploration is central to whether coordination stabilizes, we include both constant and decaying exploration regimes. Constant exploration can prevent lock-in to collusive states by injecting persistent noise; decaying exploration can permit convergence—either toward competitive or collusive attractors.
Treatments: market design and learning environment
Table 1 summarizes the primary experimental factors. We implement a factorial design that varies one dimension at a time around a baseline, and then a smaller combined design to test interactions suggested by theory (e.g., transparency × patience; discretization × exploration).
| Treatment dimension | Levels (illustrative) | Rationale for algorithmic collusion |
|---|---|---|
| Price grid (tick size) | Fine (101 prices) vs. Coarse (21 prices) | Coarse grids create focal points and reduce deviation detectability complexity |
| Transparency / observability | Immediate full observation vs. 1-period lag vs. noisy observation | Faster, cleaner signals facilitate contingent retaliation learning |
| Demand volatility |
None vs. stochastic |
Noise can mask deviations and destabilize collusion (Green & Porter, 1984) |
| Number of firms |
|
Coordination becomes harder as strategic environment scales (Huck et al., 2004) |
| Patience (discount factor) |
|
Higher patience raises returns to sustaining high prices |
| Exploration |
Constant vs. decaying; |
Persistent exploration undermines stable coordination |
Experimental procedure and replication
Each experimental “session” consists of independent runs (replications) of repeated interaction for
periods. We use long horizons (baseline
) to allow learning to converge. We initialize Q-tables to zero and randomize tie-breaking to avoid deterministic artifacts. For each treatment cell, we run multiple independent seeds (baseline 200) and compute outcomes over the final evaluation window (last 20% of periods), discarding early transients. This design mirrors standard practice in computational reinforcement learning evaluation while retaining the logic of experimental control.
Outcome measures
We report multiple dependent variables, with a primary focus on a normalized collusion index. Let
be the average transaction price (averaged across firms and across the evaluation window). Define:
(4)
This index equals 0 at the one-shot Nash benchmark and 1 at the symmetric joint-profit benchmark. Values below 0 indicate more aggressive-than-Nash pricing (rare in our setting once learning stabilizes), and values above 1 indicate prices exceeding the symmetric monopoly benchmark due to grid artifacts or transient dynamics.
We also measure:
- Price stability : standard deviation of prices in the evaluation window.
- Deviation-punishment patterns : frequency and depth of sharp price drops following unilateral undercutting events.
-
Time-to-coordination
: periods until the run enters a high-price band (e.g., above 80% of
) and remains there for a minimum duration.
Statistical analysis
Because each run yields an outcome time series, we summarize each run by evaluation-window statistics and then analyze across runs. We report treatment-cell means and confidence intervals and estimate regression models of the form:
(5)
where
indexes runs and
is CollusionIndex or stability. Standard errors are computed across independent runs within cells (the primary unit of replication). The regression is primarily descriptive and intended to quantify comparative statics across the design space rather than establish causal inference beyond the controlled environment.
Implementation (pseudocode)
The core learning loop is provided below to clarify the mechanism and facilitate replication.
# Pseudocode: repeated pricing with Q-learning (memory-one state)
initialize Q_i[s, a] = 0 for all firms i, states s, actions a
for t in 1..T:
for each firm i:
observe state s_t (depends on treatment: last prices, signals, lag)
select action a_t using epsilon-greedy or softmax over Q_i[s_t, :]
set price p_{i,t} = a_t
compute demand shares s_{i,t}(p_t) and profits pi_{i,t}
for each firm i:
observe reward r_{i,t} = pi_{i,t}
observe next state s_{t+1}
Q_i[s_t, a_t] = (1-eta)*Q_i[s_t, a_t] + eta*(r_{i,t} + gamma*max_a' Q_i[s_{t+1}, a'])
end
Results
Benchmarks and baseline dynamics
Before turning to treatment effects, we validate that the discrete-grid Nash and joint-profit benchmarks are well separated in the baseline environment, creating a meaningful range for the CollusionIndex (Eq. 4). In fine grids,
is lower and
is moderately higher; in coarse grids, both benchmarks shift slightly due to discretization, but the wedge remains. This matters because coarse grids can simultaneously (a) create focal high-price points and (b) reduce the granularity of profitable undercutting.
[Illustrative representation: A two-panel plot. Panel A shows the discrete best-response function and highlights the symmetric Nash intersection on a fine grid. Panel B shows total profit as a function of the symmetric price and highlights the joint-profit maximum.]
Emergence of supra-competitive prices in the baseline two-firm environment
In the baseline treatment (two firms, immediate transparency, deterministic demand, memory-one state,
, low exploration), the Q-learning agents frequently converge to prices substantially above the one-shot Nash benchmark. The resulting CollusionIndex concentrates in the upper range (often between 0.6 and 0.95 in our runs), indicating partial to near-complete convergence toward joint-profit pricing. Notably, this coordination emerges without explicit communication and without any hand-coded punishment rule.
Time-series inspection shows a common dynamic pattern: early exploration produces price volatility; then one agent’s successful high-price episode increases its Q-values for high-price actions conditioned on the rival’s prior high price; the other agent learns a similar mapping; once both agents spend enough time in the high-price region, deviations become rare because undercutting is followed by learned retaliation (lower future value), making the long-run value of undercutting unattractive when
is high.
[Illustrative representation: Two time-series traces over periods. Trace 1 shows convergence to high prices with occasional brief price wars after deviations. Trace 2 shows convergence to Nash-like pricing under higher exploration. Key events are annotated: “undercut,” “retaliation,” “re-coordination.”]
Main treatment effects
Effect of price-grid coarseness (market design: discretization)
Coarser price grids substantially increase the likelihood and stability of supra-competitive outcomes. In our runs, coarse grids produce (i) faster convergence to high-price regimes and (ii) fewer profitable small deviations, because undercutting requires a larger discrete step that more visibly departs from the coordinated level and triggers stronger retaliation. This result echoes experimental insights that discrete action spaces can facilitate coordination by creating salient focal points (Engel, 2007; Holt, 1995), and it suggests that tick size and platform-imposed discretization can be meaningful institutional levers in algorithmic pricing environments.
Effect of transparency and observation latency (market design: information structure)
Immediate observability of rival prices strongly facilitates coordination. Introducing a one-period observation lag reduces CollusionIndex and increases price variance: agents struggle to correctly attribute profit changes to a rival’s deviation vs. demand conditions, weakening the learned mapping from states to contingent retaliation. Noisy observation produces similar effects, particularly when noise occasionally misclassifies a cooperative price as a deviation (leading to mistaken punishment) or masks actual undercutting (reducing deterrence). These patterns align with repeated-game results emphasizing monitoring and detectability (Green & Porter, 1984; Tirole, 1988).
Effect of demand volatility (market design: stochastic environment)
Adding demand shocks (period-specific variation in
) reduces coordination. Volatility increases reward noise, making Q-value updates less informative and weakening convergence. It also disrupts the deterrence logic: when profits fluctuate for exogenous reasons, the informational content of a profit drop is ambiguous, and “punishment” responses become less targeted. Consistent with Green and Porter’s (1984) logic, noise can induce price wars even without deviations, which in learning terms translates into unstable Q-values and a lower steady-state price level.
Effect of number of competitors
Moving from two to four firms dramatically reduces algorithmic coordination in our design. Even with high patience and full transparency, the probability that all agents simultaneously occupy the high-price region declines, and deviations are more frequent. Moreover, retaliation becomes less individually effective: when one firm undercuts, the reward to punishing (lowering one’s own price) is diluted because the benefit is shared among multiple rivals. This pattern is consistent with classic and experimental results that collusion is harder with more firms (Engel, 2007; Huck et al., 2004; Tirole, 1988).
Quantitative summary across treatments
Table 2 provides an illustrative summary of key outcomes across selected treatment combinations. Values are averaged across independent runs in each cell, computed over the evaluation window. The purpose is comparative: to show how market design and learning parameters jointly shape collusive outcomes.
| Condition |
|
Exploration | Info | Grid | Firms | Mean CollusionIndex | Price SD |
|---|---|---|---|---|---|---|---|
| Baseline “high-collusion” | 0.99 |
|
Immediate | Coarse | 2 | 0.82 | Low |
| Higher exploration | 0.99 |
|
Immediate | Coarse | 2 | 0.35 | High |
| Lower patience | 0.90 |
|
Immediate | Coarse | 2 | 0.18 | Medium |
| Observation lag | 0.99 |
|
1-period lag | Coarse | 2 | 0.41 | Medium-High |
| Demand volatility | 0.99 |
|
Immediate | Coarse | 2 | 0.29 | High |
| More firms | 0.99 |
|
Immediate | Coarse | 4 | 0.12 | High |
| Fine grid | 0.99 |
|
Immediate | Fine | 2 | 0.46 | Medium |
Two qualitative regularities stand out. First, patience and exploration interact: high
is not sufficient for coordination if exploration remains high, because persistent random price cuts continually destabilize mutual expectations. Second, institutional design (transparency, tick size) often matters as much as “algorithm parameters”: a high-patience agent in a low-transparency market may coordinate less than a moderate-patience agent in a highly transparent, coarsely discretized market.
Regression-based comparative statics
To summarize the multi-factor design, we estimate the descriptive model in Eq. (5) with CollusionIndex as the dependent variable. The estimated signs are stable across alternative specifications: coarse grids and immediate transparency are positively associated with coordination; volatility, more firms, and higher exploration are negatively associated. Patience (discount factor) is strongly positive and amplifies the effect of transparency, consistent with the interpretation that agents must both (a) observe deviations and (b) care about future consequences for retaliation to deter undercutting.
| Predictor | Expected sign | Interpretation |
|---|---|---|
| CoarseGrid | + | Focality and reduced profitable micro-deviations |
| Transparent | + | Improved deviation detection and contingency learning |
| VolatileDemand | - | Higher reward noise; masked deviations |
| FourFirms | - | Harder coordination and diluted punishment incentives |
|
|
+ | Greater value of future cooperative rents |
|
|
- | Persistent perturbations prevent stable coordination |
Behavioral signatures: learned punishments and re-coordination
Beyond average prices, the dynamic patterns are central for antitrust interpretation because they suggest how algorithmic collusion could manifest in observed market data. In high-collusion treatments, we frequently observe:
- State-contingent retaliation: after an undercutting event, both agents shift to lower prices for a sustained interval (a “price war”), reducing the deviator’s advantage.
- Re-coordination: after the price war, prices gradually return to the high-price region, suggesting an endogenous “forgiveness” dynamic.
- Asymmetry and leader–follower episodes: occasionally one agent drifts upward first, and the other follows after learning that matching yields higher long-run reward.
These patterns resemble the qualitative logic of trigger strategies in repeated games (Tirole, 1988), but importantly they arise from Q-updates rather than explicit strategic reasoning. The result underscores a practical antitrust complication: the absence of an explicit agreement does not imply the absence of coordinated outcomes, and the coordination mechanism may be embedded in adaptive dynamics rather than in direct communication (Ezrachi & Stucke, 2016; OECD, 2017).
[Illustrative representation:
A heatmap with axes
(0.90–0.99) and exploration
(0.01–0.10). Color indicates mean CollusionIndex. The top-left (high gamma, low epsilon) region is dark (high collusion), while bottom-right is light (low collusion). Separate panels compare immediate vs lagged observability.]
Robustness checks
We conduct additional checks to assess whether coordination is an artifact of a narrow set of modeling choices:
- Alternative exploration policy: Softmax exploration produces qualitatively similar results: low temperature (greedy) facilitates coordination; high temperature inhibits it.
- Alternative learning rule: Replacing Q-learning with on-policy SARSA yields similar directional comparative statics, though convergence is slower and coordination is somewhat less stable in some volatile-demand settings.
- State representation: Restricting the state to own previous price (removing direct rival observation) greatly reduces coordination, highlighting the role of observability in enabling contingent responses.
These robustness checks reinforce the core interpretation: algorithmic coordination is most likely when the environment supports reliable inference about rivals’ behavior and when learning dynamics allow stable contingent adaptation.
Discussion
What the experiments imply for antitrust
Antitrust law and enforcement have traditionally focused on identifying agreement and intent, concepts that map naturally onto human communication but less cleanly onto independent learning algorithms (Ezrachi & Stucke, 2016; Harrington, 2018; OECD, 2017). Our experimental evidence strengthens three policy-relevant points.
First , supra-competitive outcomes can arise without explicit communication. In our setting, coordination is an emergent property of repeated interaction among adaptive agents. This does not mean that every observed price increase is collusion, but it does imply that the absence of “smoking gun” communication is not decisive.
Second , market design features can be material determinants of coordination risk. Coarse price grids and immediate transparency—often justified on operational or consumer-information grounds—can facilitate learned coordination. This echoes long-standing IO insights that transparency and focal points can aid tacit collusion (Tirole, 1988), but the algorithmic setting may amplify the effect because algorithms can react rapidly and consistently.
Third , algorithmic collusion may produce distinctive time-series signatures. Learned punishment can generate sharp, episodic price wars followed by gradual re-coordination (Figure 3). Such patterns are not definitive proof of collusion—Edgeworth cycles and other competitive dynamics can also generate price fluctuations (Maskin & Tirole, 1988)—but they may motivate deeper inquiry, especially when paired with institutional features conducive to coordination.
Market design trade-offs: transparency, tick sizes, and update frequency
Our results highlight a tension: transparency is often promoted to help consumers compare prices and to intensify competition, yet in repeated settings it can also facilitate coordination by improving monitoring (OECD, 2017; Tirole, 1988). Similarly, discrete tick sizes can reduce search costs and standardize transactions, but also create focal price points that reduce the complexity of coordinating on supra-competitive outcomes.
In platform-mediated markets, market design decisions—such as how frequently prices can update, whether competitor prices are displayed in real time, and whether prices must be rounded to particular increments—could influence the feasibility of algorithmic coordination. The appropriate policy response is context-specific: reducing transparency may harm consumers in the short run, while increasing transparency may increase collusion risk in algorithmic environments. Our experiments do not resolve this trade-off, but they clarify the mechanisms by which it can arise.
Connecting algorithmic and human collusion evidence
Experimental oligopoly research with humans finds that collusion is more likely with fewer firms, simpler choice sets, and better feedback (Engel, 2007; Holt, 1995; Huck et al., 2004). Our algorithmic experiments mirror those regularities, suggesting a partial continuity between behavioral constraints that shape human coordination and computational constraints that shape algorithmic coordination. The difference is not that algorithms “want” to collude, but that adaptive dynamics can discover stable high-price patterns when the environment makes them learnable and profitable.
Alternative interpretations and limitations
Several limitations qualify the interpretation and point to future research.
- Algorithm representativeness: Q-learning is a canonical baseline (Watkins & Dayan, 1992), but real pricing systems may use supervised learning, contextual bandits, or hybrid systems with human oversight. Outcomes may differ under alternative architectures (Sutton & Barto, 2018).
- Model simplifications: Demand is stylized (logit), costs are constant, and product differentiation is fixed. Real markets include capacity constraints, inventories, and multi-product interactions that could inhibit or facilitate coordination.
- Welfare and enforcement mapping: Observing supra-competitive prices is not sufficient to infer illegal agreement in most legal frameworks. The experiments identify a feasibility result—coordination can emerge—rather than providing a direct test for liability.
- Equilibrium selection vs learning path dependence: Some coordinated outcomes depend on early random events (exploration trajectories). This path dependence suggests that identical environments can yield different long-run prices, complicating prediction and detection.
Directions for future research
Three research directions appear particularly valuable:
- Richer institutional environments: Incorporate platform ranking algorithms, consumer search frictions, and price-matching guarantees, which may change incentives and observability.
- Detection metrics: Develop statistical indicators combining price levels, reaction times, variance patterns, and structural breaks to distinguish competitive algorithmic adaptation from emergent coordination.
- Governance interventions: Test “mechanism design” interventions (e.g., randomized update batching, delayed competitor-price displays, or enforced minimum exploration/noise) and evaluate welfare trade-offs.
Conclusion
This paper reported controlled computational laboratory experiments on algorithmic collusion without explicit communication. Competing Q-learning pricing algorithms, interacting repeatedly in a differentiated-products Bertrand environment, can learn to coordinate on supra-competitive prices under identifiable conditions: high patience, low (or decaying) exploration, immediate and accurate observability of rivals’ prices, and action spaces with coarse discretization that create focal points. Coordination is notably less prevalent when demand is volatile, observability is delayed or noisy, or the number of firms increases.
From an antitrust perspective, the results reinforce the concern that collusive outcomes need not require explicit agreement when autonomous pricing algorithms adapt in repeated interaction (Ezrachi & Stucke, 2016; Harrington, 2018; OECD, 2017). From a market design standpoint, the experiments underscore that institutional choices—particularly transparency and tick size—can meaningfully shape the likelihood of algorithmic coordination. The broader implication is that competition policy for algorithmic markets will likely require a combined focus on (a) traditional legal concepts and (b) the design features and learning dynamics that make coordination learnable in the first place.
References
📊 Citation Verification Summary
Abreu, D. (1986). Extremal equilibria of oligopoly supergames. Journal of Economic Theory, 39(1), 191–225.
Calvano, E., Calzolari, G., Denicolò, V., & Pastorello, S. (2020). Artificial intelligence, algorithmic pricing, and collusion. American Economic Review, 110(10), 3267–3297. https://doi.org/10.1257/aer.20190623
Engel, C. (2007). How much collusion? A meta-analysis of oligopoly experiments. Journal of Competition Law & Economics, 3(4), 491–549.
Ezrachi, A., & Stucke, M. E. (2016). Virtual competition: The promise and perils of the algorithm-driven economy. Harvard University Press.
Fudenberg, D., & Maskin, E. (1986). The folk theorem in repeated games with discounting or with incomplete information. Econometrica, 54(3), 533–554.
Green, E. J., & Porter, R. H. (1984). Noncooperative collusion under imperfect price information. Econometrica, 52(1), 87–100.
Harrington, J. E., Jr. (2018). Developing competition law for collusion by autonomous agents. Journal of Competition Law & Economics, 14(3), 331–363.
Holt, C. A. (1995). Industrial organization: A survey of laboratory research. In J. H. Kagel & A. E. Roth (Eds.), The handbook of experimental economics (pp. 349–443). Princeton University Press.
Huck, S., Normann, H.-T., & Oechssler, J. (2004). Two are few and four are many: Number effects in experimental oligopolies. Journal of Economic Behavior & Organization, 53(4), 435–446. https://doi.org/10.1016/j.jebo.2003.07.004
Kagel, J. H., & Roth, A. E. (Eds.). (1995). The handbook of experimental economics. Princeton University Press.
(Checked: crossref_title)Maskin, E., & Tirole, J. (1988). A theory of dynamic oligopoly, II: Price competition, kinked demand curves, and Edgeworth cycles. Econometrica, 56(3), 571–599.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236
OECD. (2017). Algorithms and collusion: Competition policy in the digital age (OECD Background Paper). Organisation for Economic Co-operation and Development. https://www.oecd.org/competition/algorithms-collusion-competition-policy-in-the-digital-age.htm
Rotemberg, J. J., & Saloner, G. (1986). A supergame-theoretic model of price wars during booms. American Economic Review, 76(3), 390–407.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.
(Checked: not_found)Tirole, J. (1988). The theory of industrial organization. MIT Press.
(Year mismatch: cited 1988, found 1989; Author mismatch: cited Tirole, found Ariel Rubinstein)Vives, X. (1999). Oligopoly pricing: Old ideas and new tools. MIT Press.
(Checked: not_found)Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292. https://doi.org/10.1007/BF00992698
(Checked: crossref_rawtext)Reviews
How to Cite This Review
Replace bracketed placeholders with the reviewer's name (or "Anonymous") and the review date.
