In-context learning reveals how large language models perform an astonishing feat of mental gymnastics, absorbing and applying new patterns without actually changing their underlying structure. Large language models demonstrate a form of intelligence that challenges our understanding of learning itself, showing how static systems can achieve dynamic understanding through clever architectural tricks and emergent behaviors.
Photo Credit: Rob GrzywinskiOriginally posted on February 23, 2023 on LinkedIn.
The "Woozle Wuzzle" Problem
Before I boggle your mind a little bit, let's briefly review what Large Language Models (LLMs) are. (See my What is a Large Language Model post for more information.) LLMs are static models that have been trained on an enormous amount of text in order to learn a mapping function that predicts the next word. You give it "the quick brown fox" and it has been trained to have "jumped" as the most likely next word.On to the boggling: Imagine that you have a news article about a “woozle wuzzle” — a phrase that you've never seen before. You’re going to find the phrase “woozle wuzzle” spread throughout that article. In your corpus in general, the presence of “woozle wuzzle” is effectively zero. But for that one article, when the word “woozle” appears, it’s highly likely that the next word will be “wuzzle”. So if you’re a model and your goal is to maximize the likelihood of predicting the next word, you better figure out how to get your “wuzzle” on. You can’t just memorize “woozle wuzzle” because you’re going to have to remember a zillion of these odd combinations that show up throughout text. So what are you going to do? You’re going to generalize! You’re going to learn that you may have instances of word A followed by word B within a document and you have to remember that. More specifically, you're going to have to learn how to learn, remember and reference new phrases.
Memory vs Learning
I hear the heckler in the back row shouting "It's called Machine Learning for a reason, jacka**!". Machine learning occurs only during the training phase when it learns a mapping function. Once that's done, school's out! "But when I use ChatGPT it remembers what I said earlier in the conversation!" That's a clever trick. When you use a GPT, you are interacting with a fixed model. Everything that it knows and learns about your conversation is solely contained within that conversation. The model itself remains completely unchanged. So how does ChatGPT "remember" what you said earlier in the conversation? It doesn't! It simply includes the entire conversation each time you add a new prompt. If you've ever gone back to a ChatGPT conversation and edited an earlier prompt then you noticed that you lose all responses after that. This is the reason why. The same is true if your conversation is longer than the allowed input window of the model. It simply won't know what you said earlier in the conversation. The sad reality is that we currently don't have scalable and performant model architectures that include a working memory. But oh brother are folks working their keisters off to make one!
Neural Networks: The Wild West
Any more hecklers? No? Ok then, back to the boggling! We don't really understand and we certainly don't have any idea how to control the inner workings of neural nets. We come up with some clever ideas, create an architecture, cross our fingers and let the sucker go. (It's slightly more scientific than that but only barely.) We know that these nets with sufficient complexity are capable of creating literally anything. Specifically, it can be shown that they can create a universal Turing machine (i.e. a computer). Yes, it's possible for a sufficiently complex neural net with enough training to just say "Screw this! I'm just going to build a computer and simplify all of this!" The bottom line is that we can't tell the model what it should learn and how it should use it. It just figures it all out on its own.Our heckler is back and yelling "That's all crazy talk!" Let's go back to our “woozle wuzzle” example. In order to solve the problem of looking back over the text for previous instances of the current word, finding the word that came after it last time, and then predicting that the same completion will occur again, the network had to build a specialized circuit called "induction heads". It then uses this circuit as the basis for most (all?) of the in-context learning that it is capable of. So not only did it build a unit of computation that it can use at inference time, it learned how to apply it in a number of novel ways.
Induction Heads: The Hidden Circuit
And if you think that “woozle wuzzle”'ing is crazy talk, there are papers have been released within the past few weeks that are starting to uncover more about in-context learning. In this case, they're specifically trying to understand how the neural net is able to "rewire" itself at runtime (again, remember that the model is static) in the case of few-shot prompting. It may be that the model learns how to build a basic gradient-descent-based learning mechanism (which is how the model itself is trained!). Basically, it learned how to do linear regression. When you provide the trained model with examples followed by a question, the model uses the examples to reinterpret its own stored parameters so that the answer to the question is more likely to follow that of the examples. How crazy is that?In-context learning is just one of the capabilities unlocked by these recent models. Wait until we get to chain-of-through reasoning! There's so much more to be boggled about! As always, reach out if anything is confusing or to share in your own bogg'dacity and bogg'daciousness!(1,178 tokens)
Mental models and their transmission form the foundation of human cognition and communication. By examining how we construct, manipulate, and share these internal representations, we gain crucial insights into both human intelligence and the emerging challenges of artificial minds.15 January 2025
Large Language Models represent the fusion of simple prediction with emergent complexity, where teaching a machine to guess "what comes next" somehow births understanding. Like a child learning language through pattern recognition, these neural networks transform basic word prediction into sophisticated language processing, challenging our assumptions about intelligence and computation.13 January 2025