url("https://fonts.googleapis.com/css2?family=Raleway:ital,wght@0,100..900;1,100..900&family=Source+Serif+4:ital,opsz,wght@0,8..60,200..900;1,8..60,200..900&display=swap");

(min-width: 768px) {
  .article.article-md h2 { font-size: 24px; }
}

(min-width: 1024px) {
  .article.article-md h2 { font-size: 28px; }
  .heading-button { width: 50px; height: 50px; }
  .heading-button-arrow { font-size: 28px; padding: 12px 8px; }
}

(max-width: 768px) {
  [data-root-node="true"] { --margin-left: 0px !important; --margin-right: 0px !important; }
  .contained { padding: 0px 16px; }
  [data-tid="h"] { margin-left: var(--margin-left); margin-right: var(--margin-right); }
  .auto-column-container { column-count: 1; }
  .pullquote, .pullquote.left, .pullquote.right { float: none; width: 100%; text-align: center; margin-left: 0px; margin-right: 0px; }
  .footer-content { grid-template-columns: repeat(3, 1fr); }
  .footer-about { grid-column: 1 / span 3; }
}

Modern ML has a "looks confident while being wrong" problem that control theorists solved decades ago. Diffusion models, flow matching, selective SSMs, and test-time compute are all secretly rediscovering the same primitive — the 

 — from different directions. None of them cite Yang/Mehta/Meyn 2013.

What does a diffusion model's score function have to do with a Mars rover's attitude estimator? More than either community seems to know.

If you spend enough time reading recent ML papers you start to notice the same shape, over and over. A little function that nudges things toward consistency with what you observe. A score. A flow. A gain. Different notation, different conferences, different journals. But underneath it's the same object doing the same job.

Four ML communities are independently rediscovering the same primitive, in parallel, without talking to each other.

This would be a fun curiosity except there's a sixty-year-old field called 

 that has been doing exactly this since before transistors got cheap. It runs the gyroscope in your phone, every airplane autopilot ever shipped, GPS, every Mars rover, most weather forecasting, and a surprising amount of modern statistics. It is taught in roughly zero ML curricula.

So this is the post where we try to fix that. Or at least nudge it.

We're going to do a quick tour: what filtering even 

, what a "particle" is (no, not the physics kind), why a Gaussian sometimes lies to you with great confidence and then the punchline: a twelve-year-old algorithm called the 

 that we think is hiding, unattributed, inside half the ML you already know. We'll motivate it, we'll show it working and we'll make the case that it's the right primitive for an embarrassing amount of what you already do for a living.

Before any of that — take a look at the visualization below. Three filters racing against a truth that keeps bouncing between two values. Watch the orange one in particular. Notice how confident it looks while it's being wrong. Notice when the green one wakes up.

Sixty Years of Results. Zero Mentions in the PyTorch Docs

Before we say another word about particles or gain functions or anything else that's about to feel like jargon — we want to put something on the table.

There's a whole branch of applied math called 

 and it has been quietly running the world since 

 was thinking about anti-aircraft fire control in the 1940s. 

 refined the core ideas in 1960. The lineage runs through 

 — people whose names appear in footnotes ML folks never read because the footnotes are in control theory journals. The field has been producing provably optimal or near-optimal solutions to inference problems for longer than most ML researchers have been alive.

There is something going on in the world — call it 

 could be the true position of an aircraft, the temperature at a point in the ocean where there's no buoy, the latent "meaning" an LLM is tracking mid-generation, the actual state of a patient's disease. 

 evolves through time. Meanwhile, you have access to a related signal — call it 

 — that is a noisy, incomplete, possibly garbled function of 

. Squashed. Scrambled. Delayed. The question filtering theory asks is: given everything you've seen of 

\underbrace{X(t)}_{\substack{\text{the thing you} \\ \text{can't see}}} \xrightarrow{\;\text{noisy, lossy}\;} \underbrace{Y(t)}_{\substack{\text{what you} \\ \text{actually measure}}} \;\Longrightarrow\; \underbrace{p(X_t \mid Y_{0:t})}_{\substack{\text{your belief} \\ \text{about X, right now}}}

 The whole game is to track it as it moves and changes, in real time, as new 

If this sounds like Bayesian inference with a time axis bolted on, that's because it is. Welcome to 1960.

Two algorithm families dominated the first several decades.

 with Gaussian noise — then the posterior is 

 a Gaussian at every step, and you can update it in closed form with an embarrassingly simple set of equations. No sampling. No approximation. Exact optimal inference, dirt cheap. Ninety-nine percent of deployed engineering still runs on this or a close relative of it. You have almost certainly been inside a vehicle whose attitude estimate was computed this way.

The catch: the world is often not linear-Gaussian.

). When the world doesn't cooperate, you give up on the closed form and represent your belief as a 

 particles, each one a hypothesis: "what if 

 is at this value right now?" Propagate them forward. Weight them by how well they explain the new observation. Resample. Repeat. It's Monte Carlo applied to a moving distribution and it will converge to the right answer as N gets large. It's provably correct. It's also provably expensive, and — as we'll see in the visualization — it has a failure mode that should look uncomfortably familiar to anyone who has watched a model ensemble collapse.

Those two families got most of the attention for most of filtering theory's history.

Then in 2013, Tao Yang, Prashant Mehta, and Sean Meyn published a 

. Same theoretical endpoint as the classical particle filter. Same asymptotic guarantees. But the mechanism is fundamentally different — instead of weighting and killing particles, you 

 them. Every particle gets a small nudge toward consistency with the observation, computed via a function called the 

 that depends on the current shape of the ensemble. No resampling. No diversity collapse. The gain function is mathematically the same object as the score function in diffusion models, the conditional velocity in flow matching, the input-dependent gating in selective SSMs like Mamba.

The ML community has been rediscovering this wheel. We're just naming it.

That's it. A single number, or a vector, or a configuration. One hypothesis about what the hidden state 

 might be right now. If you have 200 particles, you have 200 simultaneous guesses and together they describe a probability distribution.

One thing to get solid before moving on: 

the cloud of dots is not a set of models.

 The particles don't have weights. They aren't networks. They aren't ensembles in the ML sense. Each particle is just a 

 — a single point in the space of "what could be true right now." The whole swarm, taken together, is the belief.

This is a genuinely different object from anything in the standard ML toolkit and that's worth sitting with for a second. Most ML uncertainty methods — deep ensembles, MC dropout, Bayesian NNs — give you multiple 

 over outputs. Particles are not models. They're samples from the posterior directly. The distribution isn't something you compute from the particles; the particles 

(A useful framing, if you want one: a particle is what a sample from an MCMC chain looks like when the target distribution is 

 and the sample has to keep up in real time.)

Now — why does this matter enough to write about? Because of what happens when your belief is genuinely two-humped.

Why a Gaussian Sometimes Lies With Great Confidence

 with equal probability. The best possible single Gaussian fitted to that truth peaks exactly in between the two answers.

The further apart the two modes are, the more confidently the Gaussian points at a place that is, increasingly, the 

 possible answer — the unstable equilibrium between them, the place truth least wants to be.

This is not a calibration problem. It is not a training problem. It is a structural one. Any method whose output is a single mean and variance — a single bell curve — is mathematically incapable of saying "I think 

, with confidence inversely proportional to how far apart they are." When 

 are very different, that becomes a very confident wrong answer.

We bring this up not because bimodal posteriors are exotic. We bring it up because you have almost certainly already met this failure mode. It just had different names.

The instinct to summarize with a mean and variance is reasonable — Gaussians are tractable, they're fast, the math is clean. But when the posterior isn't Gaussian, that tractability is purchased at the cost of the structure that matters most. You get a confident answer. Just not the right shape of one.

Which brings us back to the visualization at the top and the three characters in it.

The visualization at the top shows three algorithms racing against a bistable truth: a hidden state 

 and occasionally flips. The observation 

 — squaring destroys the sign, so measuring 

 is like hearing that something is at distance 

 from the origin without knowing which direction. Two consistent answers, always. That's the game.

Each filter represents a genuine school of thought about how to handle this. We'll go through them in the order they appear.

Filter name: Extended Kalman Filter. ML cousin: Mamba, selective state-space models (SSMs).

 carries exactly one belief: a mean and a variance. A single bell curve. It doesn't have syntax for "probably 

" — that concept doesn't fit in the data structure.

When the world is linear-Gaussian, this is fine. More than fine, actually — the Kalman filter is the optimal estimator. Sixty years of aerospace engineering ran on it. It is genuinely beautiful mathematics for the problems it's designed for.

The problem is what the EKF does when the world isn't linear-Gaussian. Our 

 is not linear. So the EKF does the thing all engineers do with nonlinear problems: it Taylor-expands around the current mean and pretends the result is exact. Then it updates its single bell curve as if that linearization were gospel.

Watch the orange row. For stretches — sometimes quite long ones — it tracks fine. Then truth jumps. The EKF, committed to its single mode, doesn't chase it; it parks somewhere in between, variance briefly widening, then snapping back to confident narrowness. It is wrong with a precision that is almost impressive.

The ML connection is not an analogy. Mamba and selective SSMs are, structurally, EKFs with learned parameters. Same linear-Gaussian belief over a hidden state, same input-dependent gain, same update mechanism. The EKF's failure mode — a unimodal belief that parks confidently in the wrong place — is the same failure mode you get from Mamba on tasks where the right answer has two plausible realizations. The math is the same. The failure looks the same. It just has a different name in the paper.

Filter name: Bootstrap Particle Filter (Bootstrap PF). ML cousin: self-consistency, majority-vote ensembles.

A particle filter abandons the closed-form belief entirely. Instead: 

 particles, each one a hypothesis, "what if 

 is at this value right now?" Two steps per timestep.

That's it. No closed form. No Gaussian constraint. And it's provably correct: as 

, the particle distribution converges to the true posterior at every step.

The failure mode is in the resampling. When you resample with replacement, you duplicate high-weight particles and discard low-weight ones. After a few rounds, many particles are identical — children and grandchildren of the same ancestor from two steps back. The diversity of the swarm collapses. This is what filter people call 

Watch the blue row. Early in the run, you'll see roughly even numbers of particles on each side — the filter is holding both modes. Then, at some point, usually quietly, one side wins. All the particles are in the same well. If truth is on that side, great. If it's not, the filter is stuck: there's nothing on the other side to upweight. Recovery requires particles to drift across zero through noise, which is slow and unreliable.

The self-consistency connection is structural. Generate 

 reasoning chains, score them by some criterion (majority vote, answer consistency, reward model), resample toward the winners. That's bootstrap PF applied to reasoning. Same diversity collapse: when all chains explore the same wrong territory, there's nothing to reweight toward the right answer.

Feedback Particle Filter — The Quiet Guide

Filter name: Feedback Particle Filter (FPF). ML cousin: diffusion model score functions, flow matching, and — we'll argue — what test-time compute should be doing.

No weights. No resampling. Each particle gets a small continuous nudge toward consistency with the observation. The nudge is computed by a function called the 

 — that depends on the current shape of the entire swarm. Particles steer each other.

The first two terms in the update equation are just standard noisy dynamics — the particle evolves forward like it would without any observation. The third term is the correction: 

, which is roughly the gap between what this particle predicted and what was actually observed. The gain 

 tells each particle how hard to steer and in which direction based on where in the state space it lives and what the rest of the swarm looks like.

Watch the green row. The swarm stays bimodal. Particles cluster near 

, and neither population cannibalizes the other. When truth jumps from one well to the other, the particles on the destination side are 

 — the filter doesn't have to rebuild from scratch. It just shifts weight.

Now the ML connection, stated plainly: the gain function 

, the gradient of the log-posterior with respect to the state. It's the conditional velocity field in flow matching. It's the input-dependent gating in Mamba, at least at the level of the idea. It's what self-consistency's majority-vote step is trying to approximate, crudely, by counting.

Yang, Mehta, and Meyn worked this out in 2013 and published it in provably exact form with convergence guarantees. The four ML communities rediscovering pieces of it over the following decade did so without citing them. This isn't a criticism — parallel discovery happens, especially across field boundaries. It's an observation: the primitive exists, it's well-understood, and knowing that it exists is useful for thinking about where the ML versions fail and why.

If you've made it this far, you deserve the full translation table. These are the same ideas wearing different hats:

Anywhere noisy observations meet a hidden state worth estimating, you have a filtering problem.

The ML hit list is long: test-time compute, tree-of-thought, MCTS reasoning. World models in RL. Agentic loops with tool use. Diffusion and flow matching (FPF gain in disguise). LLM uncertainty and hallucination. Online learning with concept drift.

The physics-simulation hit list is even longer. SPH — smoothed particle hydrodynamics — already uses particles in every Pixar and PhysX pipeline but has no feedback loop. Cloth simulation with mocap constraints. AR/VR pose tracking — every Apple Vision Pro and Quest headset is running filters that don't preserve modes. SLAM in robotics already uses particle filters; mode-preserving SLAM via FPF is the obvious next step nobody's taken. Differentiable physics with observations. Game physics with networked players. Crowd simulation, traffic, epidemics.

anywhere people already have particles and observations, FPF eats it.

 Anywhere they should have particles and don't — because someone said "use a Kalman filter" — FPF is the right answer they haven't tried yet.

The primitive exists. It's well-understood. It has convergence proofs. And it's sitting in a 

 that the fastest-moving field in the history of science has somehow managed not to read.

Modern ML has a "looks confident while being wrong" problem that control theorists solved decades ago. Diffusion models, flow matching, selective SSMs, and test-time compute are all secretly rediscovering the same primitive — the gain function — from different directions. None of them cite Yang/Mehta/Meyn 2013.

What you've seen	What's actually happening
LLM confident hallucination	Multimodal posterior over completions, collapsed to one path by sequential decoding
GAN mode collapse	Data distribution has multiple modes, generator settled on one
Deep ensemble members converging	Independently trained models finding the same mode
Self-consistency chains all going the same wrong way	Particles with no diversity, no way to recover

Filter people call it	ML people call it
State, X_t	Latent variable, ground truth
Observation Y_t	Data, observed signal
Observation function H(X)	Encoder, forward model
Filtering	Online posterior inference
Extended Kalman Filter	Mamba, selective state-space
Sequential importance / resample	Self-consistency, majority vote
FPF (Yang–Mehta–Meyn 2013)	Diffusion score, flow matching

The Gain Function Hiding in Plain Sight

Sixty Years of Results. Zero Mentions in the PyTorch Docs

What's a Particle?

Why a Gaussian Sometimes Lies With Great Confidence

Three Filters, Three Philosophies

EKF — The Confident Liar

Bootstrap PF — The Desperate Survivor

Feedback Particle Filter — The Quiet Guide

The Rosetta Stone

Why This Is the Future

More from Tangents in Surface Tension

Where to Begin ...

The Universality of World Models

Join us on Discord

About Us

Info

Legal

Social