Large Language Models represent the fusion of simple prediction with emergent complexity, where teaching a machine to guess "what comes next" somehow births understanding. Like a child learning language through pattern recognition, these neural networks transform basic word prediction into sophisticated language processing, challenging our assumptions about intelligence and computation.
Photo Credit: Rob GrzywinskiOriginally posted on February 22, 2023 on LinkedIn.
The Basic Building Block: Mapping Functions
All machine learning boils down to learning a mapping function, basically the y = f(x) that you learned in high school. You give the algorithm a bunch of inputs (x) and outputs (y) and its job is to learn how to map (f) between the two. As incredible as it sounds, that's it! No matter what fancy new thing that you read about that has been accomplished with machine learning, it all boils down to the fact that it learned some (likely highly complex) mapping.
What Makes LLMs Tick?
What mapping function does a Large Language Model (LLM) learn? It simply learns to predict the next word in a sentence. That's it. No, really! During training, LLMs are fed a huge corpus of text, such as books, articles and websites, and their parameters are adjusted such that it maximizes the likelihood of correctly predicting the next word in a sentence. If you feed it "the quick brown fox", it has been trained to return "jumped" as the most likely next word. It then takes "the quick brown fox jumped" and predicts "over". And so on.
Beyond Memorization: The Power of Generalization
Just like back in high school when you learned y = f(x), memorizing the answers to the test isn't the best study approach. The same is true with machine learning. Overfitting, what memorization is called in machine learning, results in a model that is not robust enough to accurately handle new inputs that it has never seen before. You want to design a model that is able to generalize from the data it was trained on so that it may accurately process new data that it's never seen before. The how's and why's of generalization in machine learning is an active area of study and we're just starting to understand the basics of it. The fact that these models are able to generalize at all is fascinating.
Neural Networks: Brain-Inspired Architecture
While there are lots of different forms of machine learning, for problems such as language, neural networks seem to be a good fit. Neural networks were inspired by our understanding of how a brain works. They are made up of layers of connected nodes or neurons, which receive and process information before passing it on to the next layer. These layers can be thought of as a hierarchy of processing steps, where each layer learns to represent increasingly complex features of the input data. For example, in neural nets that are trained to read hand-written numbers, it has been found that some layers are responsible for detecting the edges of the letters, other layers can detect circular shapes and other layers detect the angle that lines make. These generalizations simply emerged as the model was trained.
Size Matters: Width, Depth, and Architecture
The width (the number of nodes per layer) and depth of a neural network affect what it can learn and how easy it is to train. Deeper networks have more capacity to learn complex representations of data, but tend be more difficult to train and are more prone to overfitting. Neural nets with a depth greater than three layers are considered "deep learning" models. Wider networks have more parameters and may be better able to capture more intricate features of the data, but can also require more computational resources. In the case of GPT-3 for example, there are 96 layers and 175 billion parameters!The way that the nodes are designed and how they connect to each other also affect how the network can learn. Currently, most LLMs use a specific architecture called a "transformer" (the “T” in “GPT”) which allows the model to process and analyze massive amounts of text in a way that is both efficient and surprisingly effective. Transformers use attention mechanisms that allow the neural network to focus on specific parts of the input data. This can be particularly useful for tasks such as machine translation, where the model needs to attend to different parts of the input sentence at different times. I hope this gives you enough information that you can start to form a picture in your head of what an LLM is and what it's been trained to do. Comment if anything is unclear or if you have additional questions!(830 tokens)
In-context learning reveals how large language models perform an astonishing feat of mental gymnastics, absorbing and applying new patterns without actually changing their underlying structure. Large language models demonstrate a form of intelligence that challenges our understanding of learning itself, showing how static systems can achieve dynamic understanding through clever architectural tricks and emergent behaviors.14 January 2025
The democratization of artificial intelligence transforms everyday communicators into AI specialists, marking a shift where language skills and human intuition become more valuable than technical expertise in unlocking AI's potential.12 January 2025