
by Ethan Kolasky
Attachments
Model-Based-RL.pdfThis article is the second in a series of talks on reinforcement learning that I gave to the engineering team at Alaris Security. The first talk covered policy gradient methods and how they can be applied to language models. This talk explores early breakthroughs in reinforcement learning using model-based methods and how those techniques remain surprisingly powerful and applicable to today's problems.
I'll walk through three key papers—AlphaZero, MuZero, and AlphaEvolve—to show how model-based RL can produce superhuman performance on constrained problems and how it can be combined in powerful ways with representation models. I'll then explore how areas that seem to require human-level reasoning, such as engineering and theoretical mathematics, can be tackled by these techniques.
Reinforcement learning is the discipline of developing AI systems that can improve based on their performance in realistic environments.
While this definition is seemingly very simple, in practice it is difficult to implement. Unlike other fields in AI, such as supervised learning or generative modeling, reinforcement learning isn't training on a fixed set of ground truth data. Rather the data itself is generated by the model being trained. This leads to two main challenges in reinforcement learning:
One method of overcoming these challenges is by using a model of the environment. This technique is model-based RL and is the focus of this article. This model uses the known-structure of the environment and can make it easier to assign rewards, generate training data, and let the model do explicit planning.
What is planning in reinforcement learning?
Planning is the process of searching through possible actions and future states of the environment to arrive at the optimal action for a given scenario.
The papers discussed (with the exception of AlphaEvolve) use Monte Carlo Tree Search (MCTS) for planning. This is a specific planning algorithm that builds a search tree of many possible futures. It runs a loop of selecting an action, predicting possible next states, and for each next state, selecting a new action. By recording enough branches on the tree it can estimate the best action from a given state.
Model-based RL requires, as the name suggests, a model of the environment. Markov Decision Processes (MDPs) is the commonly used mathematical formulation for these models. It consists of the following elements:
The agent interacts with the environment by generating trajectories —sequences of states, actions, and rewards. The goal of the agent is to maximize the return, which is a sum of rewards throughout the sequence, with a discount term for longer-term rewards:
A policy defines the agent's behavior by specifying next actions, , for a given state . In RL the policy and the AI model are used somewhat interchangeably, with the model producing a "policy" for a given state.
Many people in AI already know about AlphaZero. Skip to the MuZero or AlphaEvolve sections for more cutting-edge research.
Monte Carlo search converges to optimal solutions given infinite time. However, most interesting environments are "intractable"—the state space is far too large to search exhaustively.
Humans overcome this through heuristics that prune the search space. Consider chess: a human player doesn't evaluate every legal move. Instead, they apply learned principles like controlling the center, developing pieces early, protecting the king through castling, avoiding moving the same piece multiple times in the opening, and creating threats that force opponent responses. These heuristics dramatically reduce the branching factor during search.
Humans also use heuristics to limit search depth. Rather than thinking through every move until checkmate, they search until reaching a position they can evaluate. The point values assigned to pieces (pawn=1, knight=3, rook=5, queen=9) exemplify this. Pieces don't actually have points—only checkmate matters—but these values provide heuristics for evaluating positions without searching to the game's end.
AlphaZero trains a neural network to provide both heuristics: a policy to narrow the set of moves worth exploring, and a value function to evaluate positions without searching to the end.
The model is a convolutional neural network—the same architecture used for image recognition—applied to board positions.
AlphaZero uses MCTS with this learned model for both training and inference.
AlphaZero's planning expands the search tree by adding nodes (board states) and propagating value estimates upward. Each iteration follows four steps:

MCTS builds a search tree by iteratively selecting, expanding, evaluating, and backpropagating through nodes.
The selection step uses the UCB (Upper Confidence Bound) formula, which balances the value estimate with an exploration bonus based on the policy prior and visit counts.
AlphaZero uses self-play to generate training data. The model plays against itself using MCTS, then trains on the resulting games. The loss function includes three components:
A natural question: can an opponent exploit the value estimates produced by MCTS?
Consider naive Monte Carlo search using random rollouts. The value estimate is the average outcome across random trajectories. An opponent could exploit this by choosing moves where most random continuations are terrible but one specific continuation wins. The average looks bad, but the opponent can force the winning line.
AlphaZero avoids this pitfall because both players use MCTS during self-play. The value estimates assume both players play optimally according to the current model. This creates a form of Nash equilibrium—neither player can improve by unilaterally changing strategy. As the model improves through training, this equilibrium converges toward true optimal play.
AlphaZero operates in two-player games with finite states and actions, deterministic transitions, and perfect information. The real world doesn't follow these rules. States are high-dimensional and partially observed. Transitions are stochastic. The state space may not even have a natural representation.
How can we extend AlphaZero's planning approach to robotics, language, and other real-world domains?
MuZero enables model-based RL in high-dimensional environments by learning a compact representation of the environment, then planning in that representation space.
Rather than learning a model of raw observations (which might be high-resolution images), MuZero learns a latent state space that captures only decision-relevant information. This mirrors human planning—we don't mentally simulate every photon and molecule, but instead think in terms of abstract concepts.
For example, when planning a chess move, we don't visualize the exact wood grain of the pieces or the precise RGB values of the board. We think abstractly: "If I move my knight here, it attacks the queen and controls the center."
MuZero uses three learned functions:

MuZero learns a latent dynamics model, enabling planning in high-dimensional environments without explicit knowledge of the rules.
Planning proceeds as follows:
MuZero trains all three functions jointly using self-play. For each state encountered during play, it unrolls steps using the dynamics model and compares predictions to actual outcomes:
MuZero demonstrates a powerful insight: learned representations of the world become far more capable when combined with explicit planning techniques. The model doesn't just predict what might happen—it actively searches through possibilities to find better solutions.
This insight is especially relevant in the current era of foundational models. Large language models for text and vision-language-action models for robotics already learn rich representations of their domains. Yet current agentic AI systems have a significant limitation: they don't perform explicit planning. They generate responses or actions directly from their learned representations without systematically exploring alternatives.
MuZero points toward a more powerful paradigm: pairing foundational models with the kind of search and planning techniques that made AlphaZero successful. By combining learned world models with explicit planning, we can build AI systems that don't just react based on learned patterns, but actively reason about consequences and search for better solutions.
Model-based RL and planning techniques have proven surprisingly applicable beyond games. MCTS has been successfully applied to robotics, chip design, and even language model reasoning.
Engineering represents another domain with exploitable structure. While it seems uniquely human, engineering follows patterns that models can learn and leverage.
Engineering proceeds through phases of refinement, solving problems at progressively finer levels of detail.
Consider aerospace engineering. Vehicle design begins with conceptual design: a small team determines high-level parameters like dimensions, airfoil selection, and wingspan. Once these are established, the project moves to detailed design with a larger team. High-level specifications become detailed CAD models, revealing numerous sub-problems to solve.
Each phase is essentially a search over design parameters, guided by engineering intuition to avoid exploring obviously poor designs.
This parallels AlphaZero's approach: combining intuition (the policy network) with systematic search (MCTS) to navigate large solution spaces.
AlphaEvolve applies this principle to engineering by using LLMs as the "intuition" component. Crucially, it recognizes that LLMs alone aren't intelligent enough to solve sophisticated problems directly. However, LLMs combine moderate intelligence with extreme speed. While a genius human might intuit the right answer or narrow the search space effectively, an LLM can explore a wider space more quickly by generating many candidates rapidly.
This approach only works for problems with verifiable solutions. The system must be able to check whether a proposed solution is correct or measure its quality. This includes mathematical proofs (verifiable by proof checkers), code optimization (verifiable by testing correctness and measuring performance), and engineering designs (verifiable by simulation or physical testing).
Without verifiability, there's no way to distinguish good solutions from plausible-sounding nonsense.
AlphaEvolve combines LLM generation with evolutionary search. The system maintains a population of candidate solutions and iteratively improves them:

AlphaEvolve uses LLMs as guided mutators within an evolutionary search framework to solve verifiable problems.
The LLM serves two roles: as a generator creating novel solutions and variations, and as a guided mutator making targeted improvements based on what has worked before.
This differs from pure evolutionary algorithms (which use random mutations) and pure LLM generation (which lacks iterative refinement). The combination leverages the LLM's understanding of code and mathematical structure while using evolutionary search to overcome the LLM's limitations in complex reasoning.
AlphaEvolve has achieved state-of-the-art results in GPU kernel optimization (generating CUDA kernels that outperform hand-written code), data center optimization (improving cooling efficiency and power distribution), and theoretical mathematics (discovering novel constructions in combinatorics and graph theory).
The progression from AlphaZero to MuZero to AlphaEvolve reveals a consistent theme: combining learned intuition with systematic search produces results neither approach achieves alone.
AlphaZero showed that neural networks can learn the heuristics needed for effective planning in perfect-information games. MuZero extended this by learning compact representations of complex environments, enabling planning in domains where the raw state space is intractable. AlphaEvolve demonstrated that these principles transfer to open-ended engineering and mathematics problems, using LLMs as the intuition component and evolutionary search as the exploration mechanism.
The key insight across all three systems is the same: moderate intelligence combined with systematic search often outperforms raw intelligence alone. This suggests a promising direction for AI development—not relying solely on ever-larger models, but combining learned representations with principled search and planning techniques.
For problems with verifiable solutions, this paradigm offers a path to superhuman performance. The challenge ahead lies in extending these methods to domains where verification is difficult or impossible, and in scaling these approaches to tackle increasingly complex real-world problems.