url('https://fonts.googleapis.com/css2?family=Raleway:ital,wght@0,100..900;1,100..900&family=Source+Serif+4:ital,opsz,wght@0,8..60,200..900;1,8..60,200..900&display=swap');

(min-width: 768px) {
  .article.article-md h2 {
    font-size: 24px;
  }
}

(min-width: 1024px) {
  .article.article-md h2 {
    font-size: 28px;
  }
    .heading-button {
    width: 50px;
    height: 50px;
  }
  .heading-button-arrow {
    font-size: 28px;
    padding: 12px 8px
  }
}

(max-width: 768px) {
  [data-root-node="true"] {
    --margin-left: 0px !important;
    --margin-right: 0px !important;
  }
  .contained { 
    padding: 0px 16px;
  }

  [data-tid="h"] {
    margin-left: var(--margin-left);
    margin-right: var(--margin-right);
  }
  .auto-column-container {
    column-count: 1;
  }
  .pullquote, .pullquote.left, .pullquote.right{
    float: none;
    width: 100%;
    text-align: center;
    margin-left: 0;
    margin-right: 0;
  }

  .footer-content { 
    grid-template-columns: repeat(3, 1fr);
  }

  .footer-about {
    grid-column: 1 / span 3;
  }
}

In-context learning reveals how large language models perform an astonishing feat of mental gymnastics, absorbing and applying new patterns without actually changing their underlying structure. Large language models demonstrate a form of intelligence that challenges our understanding of learning itself, showing how static systems can achieve dynamic understanding through clever architectural tricks and emergent behaviors.

Originally posted on February 23, 2023 on 

Before I boggle your mind a little bit, let's briefly review what Large Language Models (LLMs) are. (See my 

 post for more information.) LLMs are static models that have been trained on an enormous amount of text in order to learn a mapping function that predicts the next word. You give it "the quick brown fox" and it has been trained to have "jumped" as the most likely next word.

On to the boggling: Imagine that you have a news article about a “woozle wuzzle” — a phrase that you've never seen before. You’re going to find the phrase “woozle wuzzle” spread throughout that article. In your corpus in general, the presence of “woozle wuzzle” is effectively zero. But for that one article, when the word “woozle” appears, it’s highly likely that the next word will be “wuzzle”. So if you’re a model and your goal is to maximize the likelihood of predicting the next word, you better figure out how to get your “wuzzle” on. You can’t just memorize “woozle wuzzle” because you’re going to have to remember a zillion of these odd combinations that show up throughout text. So what are you going to do? You’re going to generalize! You’re going to learn that you may have instances of word A followed by word B within a document and you have to remember that. More specifically, 

you're going to have to learn how to learn, remember and reference new phrases.

I hear the heckler in the back row shouting "

It's called Machine Learning for a reason, jacka**!

". Machine learning occurs only during the training phase when it learns a mapping function. Once that's done, school's out! "

But when I use ChatGPT it remembers what I said earlier in the conversation!

" That's a clever trick. When you use a GPT, you are interacting with a fixed model. Everything that it knows and learns about your conversation is 

contained within that conversation. The model itself remains completely unchanged. 

So how does ChatGPT "remember" what you said earlier in the conversation? It doesn't! It simply includes the entire conversation each time you add a new prompt. If you've ever gone back to a ChatGPT conversation and edited an earlier prompt then you noticed that you lose all responses after that. This is the reason why. The same is true if your conversation is longer than the allowed input window of the model. It simply won't know what you said earlier in the conversation. The sad reality is that we currently don't have scalable and performant model architectures that include a working memory. But oh brother are folks working their keisters off to make one!

Any more hecklers? No? Ok then, back to the boggling! We don't really understand and we certainly don't have any idea how to control the inner workings of neural nets. We come up with some clever ideas, create an architecture, cross our fingers and let the sucker go. (It's slightly more scientific than that but only barely.) We know that these nets with sufficient complexity are capable of creating literally anything. Specifically, it can be shown that they can create a universal Turing machine (i.e. a computer). Yes, it's possible for a sufficiently complex neural net with enough training to just say "Screw this! I'm just going to build a computer and simplify all of this!" The bottom line is that we can't tell the model what it should learn and how it should use it. It just figures it all out on its own.

" Let's go back to our “woozle wuzzle” example. In order to solve the problem of looking back over the text for previous instances of the current word, finding the word that came after it last time, and then predicting that the same completion will occur again, the network had to build a specialized circuit called "induction heads". It then uses this circuit as the basis for most (all?) of the in-context learning that it is capable of. So not only did it build a unit of computation that it can use at inference time, it learned how to apply it in a number of novel ways.

And if you think that “woozle wuzzle”'ing is crazy talk, there are papers have been released within the past few weeks that are starting to uncover more about in-context learning. In this case, they're specifically trying to understand how the neural net is able to "rewire" itself at runtime (again, remember that the model is 

) in the case of few-shot prompting. It may be that the model learns how to build a basic gradient-descent-based learning mechanism (which is how the model itself is trained!). Basically, it learned how to do linear regression. When you provide the trained model with examples followed by a question, the model uses the examples to reinterpret its own stored parameters so that the answer to the question is more likely to follow that of the examples. How crazy is that?

In-context learning is just one of the capabilities unlocked by these recent models. Wait until we get to chain-of-through reasoning! There's so much more to be boggled about! As always, reach out if anything is confusing or to share in your own bogg'dacity and bogg'daciousness!

In-Context Learning

The "Woozle Wuzzle" Problem

Memory vs Learning

Neural Networks: The Wild West

Induction Heads: The Hidden Circuit

More from Tangents in Surface Tension

Thinking about Thinking

What is a Large Language Model?

Join us on Discord

About Us

Info

Legal

Social