url('https://fonts.googleapis.com/css2?family=Raleway:ital,wght@0,100..900;1,100..900&family=Source+Serif+4:ital,opsz,wght@0,8..60,200..900;1,8..60,200..900&display=swap');

(min-width: 768px) {
  .article.article-md h2 {
    font-size: 24px;
  }
}

(min-width: 1024px) {
  .article.article-md h2 {
    font-size: 28px;
  }
    .heading-button {
    width: 50px;
    height: 50px;
  }
  .heading-button-arrow {
    font-size: 28px;
    padding: 12px 8px
  }
}

(max-width: 768px) {
  [data-root-node="true"] {
    --margin-left: 0px !important;
    --margin-right: 0px !important;
  }
  .contained { 
    padding: 0px 16px;
  }

  [data-tid="h"] {
    margin-left: var(--margin-left);
    margin-right: var(--margin-right);
  }
  .auto-column-container {
    column-count: 1;
  }
  .pullquote, .pullquote.left, .pullquote.right{
    float: none;
    width: 100%;
    text-align: center;
    margin-left: 0;
    margin-right: 0;
  }

  .footer-content { 
    grid-template-columns: repeat(3, 1fr);
  }

  .footer-about {
    grid-column: 1 / span 3;
  }
}

Large Language Models represent the fusion of simple prediction with emergent complexity, where teaching a machine to guess "what comes next" somehow births understanding. Like a child learning language through pattern recognition, these neural networks transform basic word prediction into sophisticated language processing, challenging our assumptions about intelligence and computation.

Originally posted on February 22, 2023 on 

The Basic Building Block: Mapping Functions

All machine learning boils down to learning a mapping function, basically the 

that you learned in high school. You give the algorithm a bunch of inputs (

) between the two. As incredible as it sounds, that's it! No matter what fancy new thing that you read about that has been accomplished with machine learning, it all boils down to the fact that it learned some (likely highly complex) mapping.

What mapping function does a Large Language Model (LLM) learn? 

It simply learns to predict the next word in a sentence. 

That's it. No, really! During training, LLMs are fed a huge corpus of text, such as books, articles and websites, and their parameters are adjusted such that it maximizes the likelihood of correctly predicting the next word in a sentence. If you feed it "the quick brown fox", it has been trained to return "jumped" as the most likely next word. It then takes "the quick brown fox jumped" and predicts "over". And so on.

Beyond Memorization: The Power of Generalization

Just like back in high school when you learned 

, memorizing the answers to the test isn't the best study approach. The same is true with machine learning. Overfitting, what memorization is called in machine learning, results in a model that is not robust enough to accurately handle new inputs that it has never seen before. You want to design a model that is able to generalize from the data it was trained on so that it may accurately process new data that it's never seen before. The how's and why's of generalization in machine learning is an active area of study and we're just starting to understand the basics of it. The fact that these models are able to generalize at all is fascinating.

Neural Networks: Brain-Inspired Architecture

While there are lots of different forms of machine learning, for problems such as language, neural networks seem to be a good fit. Neural networks were inspired by our understanding of how a brain works. They are made up of layers of connected nodes or neurons, which receive and process information before passing it on to the next layer. These layers can be thought of as a hierarchy of processing steps, where each layer learns to represent increasingly complex features of the input data. For example, in neural nets that are trained to read hand-written numbers, it has been found that some layers are responsible for detecting the edges of the letters, other layers can detect circular shapes and other layers detect the angle that lines make. These generalizations simply emerged as the model was trained.

Size Matters: Width, Depth, and Architecture

The width (the number of nodes per layer) and depth of a neural network affect what it can learn and how easy it is to train. Deeper networks have more capacity to learn complex representations of data, but tend be more difficult to train and are more prone to overfitting. Neural nets with a depth greater than three layers are considered "deep learning" models. Wider networks have more parameters and may be better able to capture more intricate features of the data, but can also require more computational resources. In the case of GPT-3 for example, there are 96 layers and 175 billion parameters!

The way that the nodes are designed and how they connect to each other also affect how the network can learn. Currently, most LLMs use a specific architecture called a "transformer" (the “T” in “GPT”) which allows the model to process and analyze massive amounts of text in a way that is both efficient and surprisingly effective. Transformers use attention mechanisms that allow the neural network to focus on specific parts of the input data. This can be particularly useful for tasks such as machine translation, where the model needs to attend to different parts of the input sentence at different times.

I hope this gives you enough information that you can start to form a picture in your head of what an LLM is and what it's been trained to do. Comment if anything is unclear or if you have additional questions!

What is a Large Language Model?

The Basic Building Block: Mapping Functions

What Makes LLMs Tick?

Beyond Memorization: The Power of Generalization

Neural Networks: Brain-Inspired Architecture

Size Matters: Width, Depth, and Architecture

More from Tangents in Surface Tension

In-Context Learning

Citizen AI

Join us on Discord

About Us

Info

Legal

Social