The Gain Function Hiding in Plain Sight
Modern ML has a "looks confident while being wrong" problem that control theorists solved decades ago. Diffusion models, flow matching, selective SSMs, and test-time compute are all secretly rediscovering the same primitive — the gain function — from different directions. None of them cite Yang/Mehta/Meyn 2013.
Photo Credit: Rob Grzywinski
What does a diffusion model's score function have to do with a Mars rover's attitude estimator? More than either community seems to know.If you spend enough time reading recent ML papers you start to notice the same shape, over and over. A little function that nudges things toward consistency with what you observe. A score. A flow. A gain. Different notation, different conferences, different journals. But underneath it's the same object doing the same job.Four ML communities are independently rediscovering the same primitive, in parallel, without talking to each other.This would be a fun curiosity except there's a sixty-year-old field called filtering theory that has been doing exactly this since before transistors got cheap. It runs the gyroscope in your phone, every airplane autopilot ever shipped, GPS, every Mars rover, most weather forecasting, and a surprising amount of modern statistics. It is taught in roughly zero ML curricula.So this is the post where we try to fix that. Or at least nudge it.We're going to do a quick tour: what filtering even is, what a "particle" is (no, not the physics kind), why a Gaussian sometimes lies to you with great confidence and then the punchline: a twelve-year-old algorithm called the Feedback Particle Filter that we think is hiding, unattributed, inside half the ML you already know. We'll motivate it, we'll show it working and we'll make the case that it's the right primitive for an embarrassing amount of what you already do for a living.
Before any of that — take a look at the visualization below. Three filters racing against a truth that keeps bouncing between two values. Watch the orange one in particular. Notice how confident it looks while it's being wrong. Notice when the green one wakes up.
Sixty Years of Results. Zero Mentions in the PyTorch Docs
Before we say another word about particles or gain functions or anything else that's about to feel like jargon — we want to put something on the table.There's a whole branch of applied math called filtering theory and it has been quietly running the world since Norbert Wiener was thinking about anti-aircraft fire control in the 1940s. Rudolph Kálmán refined the core ideas in 1960. The lineage runs through Stratonovich, Zakai, Beneš — people whose names appear in footnotes ML folks never read because the footnotes are in control theory journals. The field has been producing provably optimal or near-optimal solutions to inference problems for longer than most ML researchers have been alive.There is something going on in the world — call it X — that you cannot observe directly. X could be the true position of an aircraft, the temperature at a point in the ocean where there's no buoy, the latent "meaning" an LLM is tracking mid-generation, the actual state of a patient's disease. X evolves through time. Meanwhile, you have access to a related signal — call it Y — that is a noisy, incomplete, possibly garbled function of X. Squashed. Scrambled. Delayed. The question filtering theory asks is: given everything you've seen of Y so far, what is your best belief about X right now?
That object on the right — , the conditional distribution of X given the history of Y — is called the posterior or the filtering distribution. The whole game is to track it as it moves and changes, in real time, as new Y arrives.If this sounds like Bayesian inference with a time axis bolted on, that's because it is. Welcome to 1960.Two algorithm families dominated the first several decades.Kalman filters (1960, Kálmán). If the world is linear-Gaussian — X evolves linearly with Gaussian noise, Y is a linear function of X with Gaussian noise — then the posterior is exactly a Gaussian at every step, and you can update it in closed form with an embarrassingly simple set of equations. No sampling. No approximation. Exact optimal inference, dirt cheap. Ninety-nine percent of deployed engineering still runs on this or a close relative of it. You have almost certainly been inside a vehicle whose attitude estimate was computed this way.The catch: the world is often not linear-Gaussian.Particle filters (1993, Gordon/Salmond/Smith). When the world doesn't cooperate, you give up on the closed form and represent your belief as a cloud of samples. N particles, each one a hypothesis: "what if X is at this value right now?" Propagate them forward. Weight them by how well they explain the new observation. Resample. Repeat. It's Monte Carlo applied to a moving distribution and it will converge to the right answer as N gets large. It's provably correct. It's also provably expensive, and — as we'll see in the visualization — it has a failure mode that should look uncomfortably familiar to anyone who has watched a model ensemble collapse.Those two families got most of the attention for most of filtering theory's history.Then in 2013, Tao Yang, Prashant Mehta, and Sean Meyn published a paper introducing the Feedback Particle Filter. Same theoretical endpoint as the classical particle filter. Same asymptotic guarantees. But the mechanism is fundamentally different — instead of weighting and killing particles, you steer them. Every particle gets a small nudge toward consistency with the observation, computed via a function called the gain that depends on the current shape of the ensemble. No resampling. No diversity collapse. The gain function is mathematically the same object as the score function in diffusion models, the conditional velocity in flow matching, the input-dependent gating in selective SSMs like Mamba.The ML community has been rediscovering this wheel. We're just naming it.
What's a Particle?
A particle is a guess.That's it. A single number, or a vector, or a configuration. One hypothesis about what the hidden state X might be right now. If you have 200 particles, you have 200 simultaneous guesses and together they describe a probability distribution.One thing to get solid before moving on: the cloud of dots is not a set of models. The particles don't have weights. They aren't networks. They aren't ensembles in the ML sense. Each particle is just a value of X — a single point in the space of "what could be true right now." The whole swarm, taken together, is the belief.This is a genuinely different object from anything in the standard ML toolkit and that's worth sitting with for a second. Most ML uncertainty methods — deep ensembles, MC dropout, Bayesian NNs — give you multiple models, each of which produces a distribution over outputs. Particles are not models. They're samples from the posterior directly. The distribution isn't something you compute from the particles; the particles are the distribution.(A useful framing, if you want one: a particle is what a sample from an MCMC chain looks like when the target distribution is moving and the sample has to keep up in real time.)Now — why does this matter enough to write about? Because of what happens when your belief is genuinely two-humped.
Why a Gaussian Sometimes Lies With Great Confidence
Two real answers — truth lives at +μ or at −μ with equal probability. The best possible single Gaussian fitted to that truth peaks exactly in between the two answers.The further apart the two modes are, the more confidently the Gaussian points at a place that is, increasingly, the worst possible answer — the unstable equilibrium between them, the place truth least wants to be.This is not a calibration problem. It is not a training problem. It is a structural one. Any method whose output is a single mean and variance — a single bell curve — is mathematically incapable of saying "I think A or B." It can only say "somewhere between A and B, with confidence inversely proportional to how far apart they are." When A and B are very different, that becomes a very confident wrong answer.We bring this up not because bimodal posteriors are exotic. We bring it up because you have almost certainly already met this failure mode. It just had different names.
| What you've seen | What's actually happening |
|---|
| LLM confident hallucination | Multimodal posterior over completions, collapsed to one path by sequential decoding |
| GAN mode collapse | Data distribution has multiple modes, generator settled on one |
| Deep ensemble members converging | Independently trained models finding the same mode |
| Self-consistency chains all going the same wrong way | Particles with no diversity, no way to recover |
The instinct to summarize with a mean and variance is reasonable — Gaussians are tractable, they're fast, the math is clean. But when the posterior isn't Gaussian, that tractability is purchased at the cost of the structure that matters most. You get a confident answer. Just not the right shape of one.Which brings us back to the visualization at the top and the three characters in it.
Three Filters, Three Philosophies
The visualization at the top shows three algorithms racing against a bistable truth: a hidden state X that lives near +1 or −1 and occasionally flips. The observation Y = X² — squaring destroys the sign, so measuring Y is like hearing that something is at distance 1.2 from the origin without knowing which direction. Two consistent answers, always. That's the game.Each filter represents a genuine school of thought about how to handle this. We'll go through them in the order they appear.
EKF — The Confident Liar
Filter name: Extended Kalman Filter. ML cousin: Mamba, selective state-space models (SSMs).The EKF carries exactly one belief: a mean and a variance. A single bell curve. It doesn't have syntax for "probably A or B" — that concept doesn't fit in the data structure.When the world is linear-Gaussian, this is fine. More than fine, actually — the Kalman filter is the optimal estimator. Sixty years of aerospace engineering ran on it. It is genuinely beautiful mathematics for the problems it's designed for.The problem is what the EKF does when the world isn't linear-Gaussian. Our Y = X² is not linear. So the EKF does the thing all engineers do with nonlinear problems: it Taylor-expands around the current mean and pretends the result is exact. Then it updates its single bell curve as if that linearization were gospel.Watch the orange row. For stretches — sometimes quite long ones — it tracks fine. Then truth jumps. The EKF, committed to its single mode, doesn't chase it; it parks somewhere in between, variance briefly widening, then snapping back to confident narrowness. It is wrong with a precision that is almost impressive.The ML connection is not an analogy. Mamba and selective SSMs are, structurally, EKFs with learned parameters. Same linear-Gaussian belief over a hidden state, same input-dependent gain, same update mechanism. The EKF's failure mode — a unimodal belief that parks confidently in the wrong place — is the same failure mode you get from Mamba on tasks where the right answer has two plausible realizations. The math is the same. The failure looks the same. It just has a different name in the paper.One-liner: the EKF assumes reality is shaped like one bell curve. When it isn't, it lies with impressive confidence.
Bootstrap PF — The Desperate Survivor
Filter name: Bootstrap Particle Filter (Bootstrap PF). ML cousin: self-consistency, majority-vote ensembles.A particle filter abandons the closed-form belief entirely. Instead: N particles, each one a hypothesis, "what if X is at this value right now?" Two steps per timestep.- Predict: each particle takes a forward step through the dynamics, plus a kick of noise. This spreads the swarm, representing uncertainty about where X might have moved.
- Update: weight each particle by how well it explains the new observation. The particle that predicted Y closest to what you actually saw gets more weight. Then resample with replacement — draw a new population of N particles with probability proportional to weight.
That's it. No closed form. No Gaussian constraint. And it's provably correct: as N → ∞, the particle distribution converges to the true posterior at every step.The failure mode is in the resampling. When you resample with replacement, you duplicate high-weight particles and discard low-weight ones. After a few rounds, many particles are identical — children and grandchildren of the same ancestor from two steps back. The diversity of the swarm collapses. This is what filter people call degeneracy. It's what you'd call mode collapse.Watch the blue row. Early in the run, you'll see roughly even numbers of particles on each side — the filter is holding both modes. Then, at some point, usually quietly, one side wins. All the particles are in the same well. If truth is on that side, great. If it's not, the filter is stuck: there's nothing on the other side to upweight. Recovery requires particles to drift across zero through noise, which is slow and unreliable.The self-consistency connection is structural. Generate N reasoning chains, score them by some criterion (majority vote, answer consistency, reward model), resample toward the winners. That's bootstrap PF applied to reasoning. Same diversity collapse: when all chains explore the same wrong territory, there's nothing to reweight toward the right answer.One-liner: evolution by natural selection. Works well until the environment shifts — then your survivors are all adapted to the wrong well.
Feedback Particle Filter — The Quiet Guide
Filter name: Feedback Particle Filter (FPF). ML cousin: diffusion model score functions, flow matching, and — we'll argue — what test-time compute should be doing.No weights. No resampling. Each particle gets a small continuous nudge toward consistency with the observation. The nudge is computed by a function called the gain — K(x) — that depends on the current shape of the entire swarm. Particles steer each other.The first two terms in the update equation are just standard noisy dynamics — the particle evolves forward like it would without any observation. The third term is the correction: K(x) times the innovation, which is roughly the gap between what this particle predicted and what was actually observed. The gain K tells each particle how hard to steer and in which direction based on where in the state space it lives and what the rest of the swarm looks like.Watch the green row. The swarm stays bimodal. Particles cluster near +1 and near −1, and neither population cannibalizes the other. When truth jumps from one well to the other, the particles on the destination side are already there — the filter doesn't have to rebuild from scratch. It just shifts weight.Now the ML connection, stated plainly: the gain function K(x) is the same mathematical object as the score function in diffusion models — , the gradient of the log-posterior with respect to the state. It's the conditional velocity field in flow matching. It's the input-dependent gating in Mamba, at least at the level of the idea. It's what self-consistency's majority-vote step is trying to approximate, crudely, by counting.Yang, Mehta, and Meyn worked this out in 2013 and published it in provably exact form with convergence guarantees. The four ML communities rediscovering pieces of it over the following decade did so without citing them. This isn't a criticism — parallel discovery happens, especially across field boundaries. It's an observation: the primitive exists, it's well-understood, and knowing that it exists is useful for thinking about where the ML versions fail and why.One-liner: instead of killing bad particles, steer everyone gently. Diversity survives. So do multimodal posteriors.
The Rosetta Stone
If you've made it this far, you deserve the full translation table. These are the same ideas wearing different hats:
| Filter people call it | ML people call it |
|---|
| State, Xt | Latent variable, ground truth |
| Observation Yt | Data, observed signal |
| Observation function H(X) | Encoder, forward model |
| Filtering | Online posterior inference |
| Extended Kalman Filter | Mamba, selective state-space |
| Sequential importance / resample | Self-consistency, majority vote |
| FPF (Yang–Mehta–Meyn 2013) | Diffusion score, flow matching |
Why This Is the Future
Anywhere noisy observations meet a hidden state worth estimating, you have a filtering problem.The ML hit list is long: test-time compute, tree-of-thought, MCTS reasoning. World models in RL. Agentic loops with tool use. Diffusion and flow matching (FPF gain in disguise). LLM uncertainty and hallucination. Online learning with concept drift.The physics-simulation hit list is even longer. SPH — smoothed particle hydrodynamics — already uses particles in every Pixar and PhysX pipeline but has no feedback loop. Cloth simulation with mocap constraints. AR/VR pose tracking — every Apple Vision Pro and Quest headset is running filters that don't preserve modes. SLAM in robotics already uses particle filters; mode-preserving SLAM via FPF is the obvious next step nobody's taken. Differentiable physics with observations. Game physics with networked players. Crowd simulation, traffic, epidemics.The thesis is simple: anywhere people already have particles and observations, FPF eats it. Anywhere they should have particles and don't — because someone said "use a Kalman filter" — FPF is the right answer they haven't tried yet.The primitive exists. It's well-understood. It has convergence proofs. And it's sitting in a 2013 paper that the fastest-moving field in the history of science has somehow managed not to read.Maybe it's time.
A decade-old vision of AI companionship finds its moment as language barriers dissolve through LLMs. The personal journey from dusty pitch decks to daily AI integration reveals how shared context and understanding transform theoretical possibilities into practical realities.8 January 2025
Google's recently released Genie 3 represents a breakthrough in world modeling, but it's actually just one instance of something universal. Every creative and problem-solving domain involves navigating constraint spaces that function as physics. Whether you're writing fiction, debugging code, or composing music, you're doing the same fundamental work as Genie 3: exploring what's possible within a system of constraints.12 August 2025