Content
European roulette, with its well-known house edge of approximately 2.7%, presents a mathematically unwinnable challenge for any player in the long run. Despite this inherent disadvantage, the motivation behind creating a deep reinforcement learning (RL) agent to play the game lies not in financial gain, but in exploring the limits of RL techniques when confronted with near-pure randomness and an environment saturated with noise. This project aims to push the boundaries of what modern deep learning methods can achieve in an environment where the optimal strategy is simply to abstain from playing. The experiment acts as a stress test for RL algorithms, examining their capacity to find meaningful patterns in essentially white noise.
The action space in European roulette is surprisingly rich, comprising 47 discrete choices. These range from 37 straight bets on individual numbers (0-36) offering a high-risk, high-reward payout of 35:1, to various outside bets such as colors, parity, high/low ranges, and dozens, which provide lower risk but also lower payout ratios. An important inclusion is the "PASS" action, which represents the decision not to place a bet and often emerges as the most prudent choice given the house edge. The agent must weigh not only which bet to place but also the implicit risk profile associated with each action.
The state representation integrates two components: a history buffer of the last 20 spins and a gain ratio reflecting the current bankroll relative to the starting amount. Although roulette outcomes are independent and identically distributed, the history buffer allows sequence models to attempt pattern detection, despite statistical futility. Crucially, including the gain ratio enables the agent to contextualize its strategy based on performance, differentiating behavior when the agent is ahead versus when it is significantly behind. This financial context helps develop a more nuanced policy that adapts to bankroll fluctuations.
The reward system is deterministic and closely follows roulette payout rules. Positive rewards are sparse and often overshadowed by negative returns, creating a harsh learning environment. Wins on straight bets pay out +35 units, while simpler bets like red or black yield +1 unit. Losses uniformly incur a -1 penalty, and the PASS action yields zero reward. This sparse and mostly negative reward landscape is a classic challenge for RL, demanding robust exploration and stability in training.
The architecture used for the agent focuses on stability and training efficiency. Although Long Short-Term Memory (LSTM) networks initially seemed appropriate due to the sequential nature of the input, batch normalization (BatchNorm) layers proved more effective. BatchNorm stabilizes activations during training, smoothing the gradient landscape and accelerating convergence. The network first embeds each spin result into a 64-dimensional vector to capture latent relationships, such as wheel neighbor proximity, then flattens and passes them through BatchNorm-equipped dense layers. A separate sub-network processes the gain ratio, with both feature sets concatenated before final dense layers output Q-values for all possible actions.
A key innovation is employing Double DQN to mitigate the overestimation bias inherent in standard DQN algorithms. By decoupling the action selection and evaluation networks, Double DQN reduces optimistic Q-value predictions that can mislead the agent into overvaluing losing actions. This mechanism is especially important in roulette, where overestimation can mask the reality that PASS is frequently the optimal policy.
While LSTM networks may not benefit the main agent due to the lack of temporal dependencies in roulette spins, they remain useful as predictive models within the system. This dual approach leverages different architectures for their strengths: BatchNorm stabilizes Q-learning, while LSTMs attempt sequence prediction in the noisy environment. The exploration of these architectures provides valuable insights into RL training dynamics when faced with environments dominated by randomness and sparse rewards.