Model-Based Reinforcement Learning

Model-Based Reinforcement Learning: From AlphaZero to AlphaEvolve

by Ethan Kolasky

This article is the second in a series of talks on reinforcement learning that I gave to the engineering team at Alaris Security. The first talk covered policy gradient methods and how they can be applied to language models. This talk explores early breakthroughs in reinforcement learning using model-based methods and how those techniques remain surprisingly powerful and applicable to today's problems.

I'll walk through three key papers—AlphaZero, MuZero, and AlphaEvolve—to show how model-based RL can produce superhuman performance on constrained problems and how it can be combined in powerful ways with representation models. I'll then explore how areas that seem to require human-level reasoning, such as engineering and theoretical mathematics, can be tackled by these techniques.


Fundamental Concepts

Reinforcement learning is the discipline of developing AI systems that can improve based on their performance in realistic environments.

Main Challenges

While this definition is seemingly very simple, in practice it is difficult to implement. Unlike other fields in AI, such as supervised learning or generative modeling, reinforcement learning isn't training on a fixed set of ground truth data. Rather the data itself is generated by the model being trained. This leads to two main challenges in reinforcement learning:

  1. Assigning credit:
    Take a chess playing agent as an example. Even if the agent wins the game it can be hard to know which moves led to the victory. Do you train on all the moves? If so you're likely training on some bad moves that actually reduced the agent's chance of winning.
  2. Generating training data:
    During the early stages of the training process the model will perform poorly. This will lead to training data which, for the most part, consists of bad examples. Even if this challenge is overcome, during the later stages of training you may encounter the opposite problem. The agent might start "exploiting" a somewhat successful strategy at the expense of "exploring" and discovering even better strategies.

One method of overcoming these challenges is by using a model of the environment. This technique is model-based RL and is the focus of this article. This model uses the known-structure of the environment and can make it easier to assign rewards, generate training data, and let the model do explicit planning.

What is planning in reinforcement learning?

Planning is the process of searching through possible actions and future states of the environment to arrive at the optimal action for a given scenario.

The papers discussed (with the exception of AlphaEvolve) use Monte Carlo Tree Search (MCTS) for planning. This is a specific planning algorithm that builds a search tree of many possible futures. It runs a loop of selecting an action, predicting possible next states, and for each next state, selecting a new action. By recording enough branches on the tree it can estimate the best action from a given state.

How to Represent the Model?

Model-based RL requires, as the name suggests, a model of the environment. Markov Decision Processes (MDPs) is the commonly used mathematical formulation for these models. It consists of the following elements:

  • States : the set of all possible situations the agent can encounter
  • Actions : the set of all possible actions the agent can take
  • Transition function : the probability of reaching state after taking action in state
  • Reward function : the immediate reward received when transitioning from state to via action

The agent interacts with the environment by generating trajectories —sequences of states, actions, and rewards. The goal of the agent is to maximize the return, which is a sum of rewards throughout the sequence, with a discount term for longer-term rewards:

A policy defines the agent's behavior by specifying next actions, , for a given state . In RL the policy and the AI model are used somewhat interchangeably, with the model producing a "policy" for a given state.


AlphaZero

Many people in AI already know about AlphaZero. Skip to the MuZero or AlphaEvolve sections for more cutting-edge research.

Monte Carlo Search with Heuristics

Monte Carlo search converges to optimal solutions given infinite time. However, most interesting environments are "intractable"—the state space is far too large to search exhaustively.

Humans overcome this through heuristics that prune the search space. Consider chess: a human player doesn't evaluate every legal move. Instead, they apply learned principles like controlling the center, developing pieces early, protecting the king through castling, avoiding moving the same piece multiple times in the opening, and creating threats that force opponent responses. These heuristics dramatically reduce the branching factor during search.

Humans also use heuristics to limit search depth. Rather than thinking through every move until checkmate, they search until reaching a position they can evaluate. The point values assigned to pieces (pawn=1, knight=3, rook=5, queen=9) exemplify this. Pieces don't actually have points—only checkmate matters—but these values provide heuristics for evaluating positions without searching to the game's end.

How AlphaZero Works

AlphaZero trains a neural network to provide both heuristics: a policy to narrow the set of moves worth exploring, and a value function to evaluate positions without searching to the end.

The model is a convolutional neural network—the same architecture used for image recognition—applied to board positions.

AlphaZero uses MCTS with this learned model for both training and inference.

Planning

AlphaZero's planning expands the search tree by adding nodes (board states) and propagating value estimates upward. Each iteration follows four steps:

  1. Selection: Choose an action at the current node using a formula that balances exploitation (choosing high-value moves) with exploration (trying uncertain moves)
  2. Expansion: Add the resulting state as a new node in the tree
  3. Evaluation: Use the policy-value network to evaluate the new node, obtaining both a value estimate and a probability distribution over next moves
  4. Backpropagation: Propagate the value up the tree, updating estimates for all ancestor nodes
Monte Carlo Tree Search

MCTS builds a search tree by iteratively selecting, expanding, evaluating, and backpropagating through nodes.

The selection step uses the UCB (Upper Confidence Bound) formula, which balances the value estimate with an exploration bonus based on the policy prior and visit counts.

Training

AlphaZero uses self-play to generate training data. The model plays against itself using MCTS, then trains on the resulting games. The loss function includes three components:

  • Value loss: measures the difference between the network's value prediction and the actual game outcome
  • Policy loss: measures the difference between the network's policy and the improved policy found through MCTS (MCTS acts as a "policy improvement operator"—the search process produces a stronger policy than the network alone)
  • Regularization: prevents overfitting by penalizing large weights

Game Theory and Monte Carlo Search

A natural question: can an opponent exploit the value estimates produced by MCTS?

Consider naive Monte Carlo search using random rollouts. The value estimate is the average outcome across random trajectories. An opponent could exploit this by choosing moves where most random continuations are terrible but one specific continuation wins. The average looks bad, but the opponent can force the winning line.

AlphaZero avoids this pitfall because both players use MCTS during self-play. The value estimates assume both players play optimally according to the current model. This creates a form of Nash equilibrium—neither player can improve by unilaterally changing strategy. As the model improves through training, this equilibrium converges toward true optimal play.


MuZero

AlphaZero operates in two-player games with finite states and actions, deterministic transitions, and perfect information. The real world doesn't follow these rules. States are high-dimensional and partially observed. Transitions are stochastic. The state space may not even have a natural representation.

How can we extend AlphaZero's planning approach to robotics, language, and other real-world domains?

How MuZero Works

MuZero enables model-based RL in high-dimensional environments by learning a compact representation of the environment, then planning in that representation space.

Rather than learning a model of raw observations (which might be high-resolution images), MuZero learns a latent state space that captures only decision-relevant information. This mirrors human planning—we don't mentally simulate every photon and molecule, but instead think in terms of abstract concepts.

For example, when planning a chess move, we don't visualize the exact wood grain of the pieces or the precise RGB values of the board. We think abstractly: "If I move my knight here, it attacks the queen and controls the center."

MuZero uses three learned functions:

  • Representation function: encodes observation history into a latent state
  • Dynamics function: predicts the next latent state and reward
  • Prediction function: predicts policy and value from a latent state
MuZero Architecture

MuZero learns a latent dynamics model, enabling planning in high-dimensional environments without explicit knowledge of the rules.

Planning proceeds as follows:

  1. Encode current observations into a latent state using the representation function
  2. Perform MCTS in latent space:
    • At each node, use the prediction function to get policy and value
    • Select actions using the same UCB formula as AlphaZero, but using the predicted policy as a prior
    • Simulate the action using the dynamics function to get the next latent state
    • Backpropagate values up the tree
  3. Execute the action with highest visit count in the real environment

Training

MuZero trains all three functions jointly using self-play. For each state encountered during play, it unrolls steps using the dynamics model and compares predictions to actual outcomes:

Planning with World Models

MuZero demonstrates a powerful insight: learned representations of the world become far more capable when combined with explicit planning techniques. The model doesn't just predict what might happen—it actively searches through possibilities to find better solutions.

This insight is especially relevant in the current era of foundational models. Large language models for text and vision-language-action models for robotics already learn rich representations of their domains. Yet current agentic AI systems have a significant limitation: they don't perform explicit planning. They generate responses or actions directly from their learned representations without systematically exploring alternatives.

MuZero points toward a more powerful paradigm: pairing foundational models with the kind of search and planning techniques that made AlphaZero successful. By combining learned world models with explicit planning, we can build AI systems that don't just react based on learned patterns, but actively reason about consequences and search for better solutions.


AlphaEvolve

Model-based RL and planning techniques have proven surprisingly applicable beyond games. MCTS has been successfully applied to robotics, chip design, and even language model reasoning.

Engineering represents another domain with exploitable structure. While it seems uniquely human, engineering follows patterns that models can learn and leverage.

The Structure of Engineering

Engineering proceeds through phases of refinement, solving problems at progressively finer levels of detail.

Consider aerospace engineering. Vehicle design begins with conceptual design: a small team determines high-level parameters like dimensions, airfoil selection, and wingspan. Once these are established, the project moves to detailed design with a larger team. High-level specifications become detailed CAD models, revealing numerous sub-problems to solve.

Each phase is essentially a search over design parameters, guided by engineering intuition to avoid exploring obviously poor designs.

Connecting the Dots

This parallels AlphaZero's approach: combining intuition (the policy network) with systematic search (MCTS) to navigate large solution spaces.

AlphaEvolve applies this principle to engineering by using LLMs as the "intuition" component. Crucially, it recognizes that LLMs alone aren't intelligent enough to solve sophisticated problems directly. However, LLMs combine moderate intelligence with extreme speed. While a genius human might intuit the right answer or narrow the search space effectively, an LLM can explore a wider space more quickly by generating many candidates rapidly.

The Verifiability Requirement

This approach only works for problems with verifiable solutions. The system must be able to check whether a proposed solution is correct or measure its quality. This includes mathematical proofs (verifiable by proof checkers), code optimization (verifiable by testing correctness and measuring performance), and engineering designs (verifiable by simulation or physical testing).

Without verifiability, there's no way to distinguish good solutions from plausible-sounding nonsense.

How AlphaEvolve Works

AlphaEvolve combines LLM generation with evolutionary search. The system maintains a population of candidate solutions and iteratively improves them:

AlphaEvolve

AlphaEvolve uses LLMs as guided mutators within an evolutionary search framework to solve verifiable problems.

  1. Initialization: The LLM generates an initial population of diverse solutions based on the problem description
  2. Evaluation: Each solution is tested/verified to measure its quality
  3. Selection: High-performing solutions are selected as parents
  4. Mutation: The LLM generates variations of parent solutions, using the problem description and parent code as context
  5. Iteration: Steps 2–4 repeat, with the population gradually improving

The LLM serves two roles: as a generator creating novel solutions and variations, and as a guided mutator making targeted improvements based on what has worked before.

This differs from pure evolutionary algorithms (which use random mutations) and pure LLM generation (which lacks iterative refinement). The combination leverages the LLM's understanding of code and mathematical structure while using evolutionary search to overcome the LLM's limitations in complex reasoning.

AlphaEvolve has achieved state-of-the-art results in GPU kernel optimization (generating CUDA kernels that outperform hand-written code), data center optimization (improving cooling efficiency and power distribution), and theoretical mathematics (discovering novel constructions in combinatorics and graph theory).


Conclusion

The progression from AlphaZero to MuZero to AlphaEvolve reveals a consistent theme: combining learned intuition with systematic search produces results neither approach achieves alone.

AlphaZero showed that neural networks can learn the heuristics needed for effective planning in perfect-information games. MuZero extended this by learning compact representations of complex environments, enabling planning in domains where the raw state space is intractable. AlphaEvolve demonstrated that these principles transfer to open-ended engineering and mathematics problems, using LLMs as the intuition component and evolutionary search as the exploration mechanism.

The key insight across all three systems is the same: moderate intelligence combined with systematic search often outperforms raw intelligence alone. This suggests a promising direction for AI development—not relying solely on ever-larger models, but combining learned representations with principled search and planning techniques.

For problems with verifiable solutions, this paradigm offers a path to superhuman performance. The challenge ahead lies in extending these methods to domains where verification is difficult or impossible, and in scaling these approaches to tackle increasingly complex real-world problems.