This article discusses how LLMs (Large Language Models) work, assuming you know how to add and multiply two numbers. This article is completely independent. We start by building a simple generative AI with pen and paper, then explore everything needed to have a solid understanding of modern LLMs and transformer architecture. This article strips away all the fancy language and jargon from ML and simply represents everything with numbers. When reading jargon-laden content, what will be called upon to tie your thoughts together?

Moving to the most advanced AI models today for addition/multiplication without assuming other knowledge or referencing other sources means covering a lot of ground. This is not a toy LLM explanation. A determined person could theoretically reproduce a modern LLM from all the information here. I have cut out all unnecessary words/lines, and this article is not actually being explored.

What will we cover?

  1. Simple neural networks
  2. How is this model trained?
  3. How does all this generate language?
  4. Why do LLMs work so well?
  5. Embeddings
  6. Subword tokenization
  7. Self-attention
  8. SoftMax
  9. Residual connections
  10. Layer normalization
  11. Dropout
  12. Multi-head attention
  13. Positional embeddings
  14. GPT architecture
  15. Transformer architecture

Let's dive in.

The first thing to note is that neural networks can only take numbers as input and output numbers. There are no exceptions. The art is figuring out how to feed the input as numbers and interpreting the output numbers in a way that achieves the goal. Finally, we build a neural network that takes the input you provide and gives the desired output (given an interpretation chosen for these outputs). Let's look at how we get from things like Llama 3.1 to adding and multiplying numbers.

Simple Neural Networks:

Let's work through a simple neural network that can classify objects.

  • Available object data: Dominant color (RGB) and volume (Milli-Liters)
  • Classification: Leaf or flower

Here is the data for leaves and sunflowers.

Now let's build a neural network to perform this classification. We need to determine the input/output interpretation. Our input is already numbers, so we can feed it directly to the network. Our output is the leaf and flower of the two objects that the neural network cannot output. Let's look at a few plans we can use here.

  • We can make the network output a single number. If the number is positive, we say it’s a leaf, and if it’s negative, we say it’s a flower.
  • Or we can make the network output two numbers. We will interpret the first as the number for leaves and the second as the number for flowers, and we will say which number is larger.

Both systems can output a number that the network can interpret as a leaf or a flower. We will choose the second plan here. We will see later. Here is a neural network that performs classification using this system. Let's work through it:

Like a blue circle: (32 * 0.10) + (107 * -0.29) + (56 * -0.07) + (11.2 * 0.46) = -26.6

Some jargon:

Neuron/Node: The number of the circle

Weight: The color number of the line

Layer: A collection of neurons is called a layer. This network can be thought of as having 3 layers.

To calculate the predictions/outputs of this network (called a "forward pass"), we start from the left. We have data available for the neurons in the input layer. To move "forward" to the next layer, we multiply the weights of the neuron pairing by the number of the circle and add them all up. We show the math of the blue and orange circles above. Running the entire network results in the first number in the output layer being higher, leading to the "network classifying (RGB, vol) values." A well-trained network can take various inputs for (RGB, vol) and classify the objects correctly.

This model has no concept of what a leaf or flower is or what (RGB, vol) means. It simply takes 4 numbers and provides exactly 2 numbers. The decision to infer that if the output number is larger, it is a leaf, etc., is our interpretation. And finally, it is up to us to choose the right weights so that the model can take the input numbers and provide the correct two numbers.

An interesting side effect of this is that we can take the same network and instead of supplying RGB, we can supply 4 different numbers like cloud cover, humidity, etc., and interpret the two numbers as "sunny for one hour" or "rainy for one hour," and if the weights are well calibrated, the same network can perform both tasks simultaneously. Whether the network interprets it as classification or prediction is entirely up to you, and it provides two numbers.

For simplicity, the remaining items (ignore without impairing understanding):

  • Activation layer: A crucial missing element in this network is the "activation layer." It’s a fancy term for applying a non-linear function to each number of the circles ( Relu is a common function that takes a number and sets it to 0 if it’s negative and leaves it unchanged if it’s positive). So basically, in the above case, we take the middle layer and replace the two numbers (-26.6 and -47.1) with 0 before proceeding to the next layer. Of course, we will need to retrain the weights here to make the network useful again. Without an activation layer, all additions and multiplications in the network can collapse into a single layer. In our case, you could write the green circle directly as the sum of RGB with some weights, and no middle layer is needed. (0.10 * -0.17 + 0.12 * 0.39 - 0.36 * 0.1) * r + (-0.29 * -0.17 - 0.05 * 0.39 - 0.21 * 0.1) * g, and so on. In the case of non-linearity, this is generally impossible. This helps the network handle more complex situations.
  • Bias: The network generally includes another number associated with each node, which is simply added to the product to calculate the node's value, and this number is called "bias." So if the bias of the top blue node is 0.25, the value of the node would be (32 * 0.10) + (107 * -0.29) + (56 * -0.07) + (11.2 * 0.46) + 0.25 = -26.35. The term parameter is generally used to refer to all numbers in the model, not just neurons/nodes.
  • SoftMax: We generally do not interpret the output layer directly like the model. We convert the numbers into probabilities (i.e., ensuring all numbers are positive and sum to 1). One way to achieve this, if all numbers in the output layer are already positive, is to divide each number by the sum of all numbers in the output layer. The "SoftMax" function can generally handle both positive and negative numbers.

How is this model trained?

In the example above, we had weights magically, and we could put data into the model and get good outputs. But how are these weights determined? The process of setting these weights (or "parameters") is called "model training", and it requires training data to train the model.

Let's assume we have data with inputs and already know whether each input corresponds to a leaf or a flower. This is the "training data" because each set has labels for leaves/flowers for (r, g, b, vol) numbers, which is "labeled data".

Here’s how it works:

  • Start with random numbers and set each parameter/weight to a random number.
  • Now we know when we input the data corresponding to a leaf (r = 32, g = 107, b = 56, vol = 11.2). Let's assume we want more numbers for leaves in the output layer. Let's say we want the number corresponding to leaves to be 0.8 and the number corresponding to flowers to be 0.2 (these are illustrative numbers showing training, but in reality, we wouldn’t want 0.8 and 0.2. It would actually be the probabilities that they are not here, and we would want them to be 1 and 0).
  • We know the desired number and the number we get from the randomly chosen parameters (which is different from what we want) in the output layer. So we will take the difference between the desired number and the number we have for each neuron in the output layer. Then we will add the differences. For example, if the output layer has 0.6 and 0.4 in two neurons, we get this. We can call this "loss." Ideally, we want the loss to be close to 0. In other words, we want to "minimize the loss".
  • When loss occurs, we can slightly change each parameter to see if increasing or decreasing it increases or decreases the loss. This is called the "gradient" of that parameter. Then we can move each parameter a small amount in the direction where the loss decreases (the direction of the gradient). By moving all parameters slightly, the loss should be lower.
  • By repeating this process, we reduce the loss and eventually have a "trained" set of weights/parameters. This entire process is called "gradient descent".

A few notes:

  • You often have multiple training examples, so if you slightly change the weights, minimizing the loss for one example may worsen the loss for another example. One way to deal with this is to define the loss as the average loss across all examples and then take the gradient of that average loss. Each such iteration is called an "epoch". You can then repeat epochs to find weights that reduce the average loss.
  • We don’t actually need to "move the weights" to calculate the gradient for each weight. We can infer it from the formula (e.g., if the weight is 0.17 in the last step and the value of the neuron is positive and we want more numbers in the output, we can see that increasing this number to 0.18 would help).

In practice, training deep networks is a difficult and complex process because gradients can easily spiral out of control during training, moving to 0 or infinity ("vanishing gradient" and "exploding gradient" problems). The simple definition of loss we discussed here is perfectly valid, but it is rarely used because there are better functional forms that fit specific purposes. Training modern models with billions of parameters presents unique challenges requiring large computing resources (memory constraints, parallelization, etc.).

How does all this help generate language?

Neural networks take some numbers, do math based on trained parameters, and provide other numbers. It’s all about interpreting and training the parameters (i.e., setting them with some numbers). If we can interpret two numbers as "leaf/flower" or "rain or sun in one hour," we can interpret them as "the next character of a sentence."

However, English has more than 2 characters, so we need to expand the number of neurons in the output layer to 26 characters in English (let's drop some symbols like space, period, etc.). Each neuron can correspond to a character, and we can say that the highest numbered neuron in the output layer corresponds to the output character. Now we have a network that can take input and output characters.

If we change the input of the network to characters like this, we might set the weights well enough to output "Y" to complete "Humpty Dumpty." One problem remains: how do we input this list of characters into the network? Our network only accepts numbers!!

One simple solution is to assign numbers to each character. Let's assume a = 1, b = 2, etc. Now we can input "Humpty Dumpt" and train it to provide us with "Y." Our network looks like this.

Now, we can provide the network with a list of characters to predict one character ahead. We can use this fact to build entire sentences. For example, after predicting "Y," we can add "Y" to the list of characters we have and ask the network to predict the next character. And if well-trained, it should give us space. Eventually, we should be able to recursively generate "Humpty Dumpty sat on a wall." We have generative AI. Also, now we have a network that can generate language! No one has actually put in randomly assigned numbers, and we will see a smarter system. If you can't wait, check out the encoding section in the appendix.

Astute readers will note that because of the way the diagram is set up, we can't actually input "Humpty Dumpty" into the network, as there are only 12 neurons in the input layer for each character of "Humpty Dumpt" (including space). So how do we put "Y" in for the next pass? To put in the 13th neuron, we would have to modify the entire network. The solution is simple: we will drop "H" and send the last 12 characters. So we send "umpty dumpty," and the network will predict the space. Then we input "mpty dumpty" and produce "S," and so on. This looks like this.

We are discarding a lot of information in the last line by supplying the model with "sitting on a wall." So what do today's latest networks do? Quite precisely. The length of the input we can put into the network is fixed (determined by the size of the input layer). This is called "context length." This is the context provided to the network for future predictions. Modern networks can have very large context lengths (thousands of words), which is helpful. There are several ways to input sequences of infinite length, but the performance of those methods has been impressive but surpassed by other models with large (but fixed) context lengths.

One thing a careful reader might notice is that there is a different interpretation for input and output for the same character! For example, when we input "H," we simply represent the number 8, but instead of a single number in the output layer (8 for "H," 9 for "I," etc.), we are asking the model to output 26 numbers, and we interpret the output as "H" if the 8th number is the highest. Why do we not use the same and consistent interpretation at both ends? We may be more likely to build a better model if we allow ourselves to choose among different interpretations for language. And the most effective current known interpretations for input and output differ. In fact, the way we input numbers in this model is not the best way to do it, and we will soon look at better methods.

Why do large language models work so well?

Generating "Humpty Dumpty sat on a wall" is far from what modern LLMs can do character by character. There are several differences and innovations that take us from the simple generative AI discussed above to human-like bots. Let's go through them:

Embeddings

Remember that the way we input characters into the model is not the best way. We simply assigned random numbers to each character. What are better numbers that could allow us to train a better network? How can we find these better numbers? There’s a clever trick here.

When we trained the model above, the way we did it was to move the weights and see what could give us a small loss. And we slowly and recursively change the weights. At each turn, we do the following:

  • Feed the input
  • Calculate the output layer
  • Compare it to the ideally desired output and calculate the average loss.
  • Adjust the weights and start over

In this process, the input is fixed. This made sense when the input was (RGB, vol). But now the numbers we put in for A, B, C, etc., are chosen arbitrarily by us. In addition to slightly shifting the weights at each iteration, we check if we can lower the loss by shifting the input and using different numbers to represent "A," etc. We are clearly reducing the loss and making the model better (the direction of shifting the input for A is by design). Essentially, gradient descent is about choosing arbitrary numbers for the representation of numbers for the input as well as the weights. This is called "embedding." It’s mapping the input to numbers, and as we just saw, it needs to be trained. The process of training embeddings is very similar to training parameters. However, the biggest advantage of this is that once embeddings are trained, they can be used in other models if desired. The same embedding can be used to represent a single token/character/word continuously.

We talked about embeddings having one number per character. However, in reality, embeddings have more than one number. It’s difficult to capture rich concepts with a single number. Looking at the leaf and flower example, we have 4 numbers (the size of the input layer) for each object. Each of these four numbers conveys properties, and the model can effectively guess the object using all products. If we only had one number, saying the red channel of color would make it much harder for the model. We are trying to capture human language here. We need more than one number.

So instead of representing each character with a single number, can we represent it with multiple numbers to capture richness? Let’s assign many numbers to each character. We will call the ordered collection of numbers a "vector" (the position in order matters, and swapping the positions of two numbers gives a different vector. This is the case for the leaf/flower data where the R and G numbers of leaves can yield different colors. They are no longer the same vector). The length of the vector is simply how many numbers it contains. We will assign a vector to each character. Two questions arise.

  • If we assign a vector to each character instead of a number, how do we now feed "Humpty Dumpt" into the network? The answer is simple. Let’s assume we assign a vector of 10 numbers to each character. Then instead of having an input layer of 12 neurons, we will input 120 neurons because each of the 12 characters of "Humpty Dumpt" will input 10 numbers. Now we are good to go with the neurons side by side.
  • How can we find these vectors? Thankfully, we just learned how to train the numbers we are asking for. Training embedding vectors is no different. Now we have 120 inputs instead of 12, but all you need to do to minimize the loss is to shift them. And you can take the first 10 to get the vector corresponding to "H," etc.

All embedding vectors must, of course, have the same length. Otherwise, there would be no way to input all character combinations into the network. For example, for "Humpty Dumpt" and the next iteration "Umpty Dumpty" - in both cases, we are inputting 12 characters, and if each of the 12 characters is not represented by a vector of length 10, we cannot reliably feed them. They all make for an input layer of length 120. Let’s visualize these embedding vectors.

A collection of vectors of the same size in order is called a matrix. This matrix above is called the embedding matrix. It tells you the column number corresponding to the character, and if you look at that column in the matrix, you get the vector used to represent that character. This can be applied more generally to include any collection of objects.

Subword Tokenization

So far, we have been working with characters as the basic building blocks of language. This has its limitations. Neural network weights have to do a lot of heavy lifting. Here, they need to understand specific orders (e.g., words) that appear next to each other and next to other words. What if we assign embeddings to words directly and let the network predict the next word? Since the network does not understand more than numbers anyway, we can assign 10-length vectors to "humpty," "dumpty," "sat," "on," etc. Two words, and it can give us the next word. A "token" is the term for a single unit we include and feed to the model. Our model has been using characters as tokens so far. Now we are suggesting using entire words as tokens (of course, you can use entire sentences or phrases as tokens if you want).

Using word tokenization has one significant impact on the model. There are over 180k words in English. Using an output interpretation system with neurons for each possible output would require hundreds of thousands of neurons within an output layer of about 26. This problem is less pressing due to the size of the hidden layers needed to achieve meaningful results for modern networks. However, the notable value is that we are treating each word individually, starting with arbitrary number embeddings for each. The embeddings for two words should be close to each other. Undoubtedly, the model will learn. But how can we somehow use this apparent similarity to get a jump start and simplify the problem?

Yes, we can. The most common embedding system in today’s language models is to break words into subwords and include them. In the cat example, we would break "cat" into two tokens "cat" and "##". Now it’s easier for the model to understand the concept of "S" and other familiar words, etc. This also reduces the number of tokens we need ( SentencePiece is a common tokenizer with vocabulary size options in the hundreds of thousands for English). Tokenization takes the input text (e.g., "Humpty Dumpt") and splits it into tokens, which must find the embedding vector for that token in the embedded matrix. For example, in the case of "Humpty Dumpty," if we use character-level tokenization and align the embedding matrix as shown above, the tokenization would first split "Humpty Dumpt" into [ 'h', 'u',… 't']. Then to get the vector corresponding to 'h', we would need to find the 8th column of the embedding matrix, which would return the numbers [8,21,… 20]. Unlike before, it’s not the number 8 for the model. The arrangement of columns in the matrix is entirely arbitrary, and as long as we assign a column to 'h' and find the same vector every time we input 'H', it should be fine. Tokenization provides us with arbitrary (but fixed) numbers to make lookups easy. Our main task with them is to actually split the sentence into tokens.

With embeddings and subword tokenization, the model can look like this.

The next few sections cover recent advancements in language modeling and the sections that have made LLMs as powerful as they are today. However, to understand this, there are a few basic mathematical concepts you need to know. The concepts are:

  • Matrices and matrix multiplication
  • The general concept of functions in mathematics
  • Raising numbers to powers (e.g., a3 = a*a*a)
  • Sample mean, variance, and standard deviation

I have added a summary of these concepts in the appendix.

Self-Attention

So far, we have only seen one simple neural network structure (called a feedforward network), which includes multiple layers, and each layer is fully connected to the next layer (i.e., there are lines connecting two neurons in consecutive layers). However, as you can imagine, there’s nothing stopping us from removing connections or making different connections. Or creating more complex structures. Let’s explore a particularly important structure: self-attention.

Looking at the structure of human language, the next word we want to predict depends on all previous words. However, they may depend more on some words than others. For example, if we are trying to predict the next word in the sentence "Damian is a secret kid, he will belong to ____ with all his belongings and magical orb," the word here could be "her" or "him," and it depends much more on the earlier word girl/boy.

The good news is that a simple feedforward model can connect to all words in context, so it can learn appropriate weights for important words, but the weights connecting specific positions in the model through the feedforward layers are fixed (which is a problem). If important words are always in the same position, we would be fine learning the weights appropriately. However, the words related to the next prediction can be anywhere in the system. We can say the above sentence and when guessing "her" or "his," one of the very important words for this prediction would be boy/girl, regardless of where it is in the sentence. Therefore, we need weights that depend not only on the position but also on the content at that position. How do we achieve this?

Self-attention does something similar to adding embedding vectors for each word, but instead of directly adding them, a little weight is applied to each. So if the embedding vectors for Humpty, Dumpty, and Sat are x1, x2, and x3 respectively, we multiply each by a weight (number) before adding them. The output is like = 0.5 x1 + 0.25 x2 + 0.25 x3, where the output is the self-attention output. If we write the weights as u1, u2, u3, we want output = u1x1 + u2x2 + u3x3. How do we find these weights u1, u2, u3?

Ideally, we want these weights to depend on the vectors being added. Some may be more important than others. But important to whom? To the word we are trying to predict. So we also want them to depend on the word we are trying to predict. Now that’s the problem. Of course, we don’t know the word we are trying to predict before we predict it. So self-attention uses the word immediately before the word we are trying to predict. That is, the last word in the available sentence is why this is the case, and why many things in deep learning are trial and error, and I think this works well).

Great. We want weights for this vector, and we want them to depend on the words we aggregate right before the word we are trying to predict. Essentially, we want a function U1 = F(x1, x3), where x1 is the word we are giving weights to, and x3 is the last word in the order we have (assuming we only have 3 words). Now a simple way to achieve this is to have a vector for x1 (let’s call it K1) and a separate vector for x3 (let’s call it Q3). This will give us numbers that depend on x1 and x3. How do we get these vectors K1 and Q3? We build a small single-layer neural network to move from x1 to K1 (or from x2 to k2, x3 to k3, etc.). And we build another network to move from x3 to Q3, etc. Using matrix notation, we basically present weight matrices WK and WQ as k1 = wkx1 and q1 = wqx1, etc. Now we can take the dot product of K1 and Q3 to get a scalar, so U1 = F(x1, x3) = WKX1 · WQX3.

One additional thing that arises in self-attention is that we do not directly take the weights of the embedding vectors themselves. Instead, we take a weighted sum of some "values" of the embedding vectors obtained by another small single-layer network. This means that similar to K1 and Q1, we also have V1 for the word x1, which we obtain through a matrix WV as V1 = WVX1. This V1 is aggregated. So if we only have 3 words and we are trying to predict the fourth, it all looks like this.

The plus sign indicates simple addition of vectors, which means they must have the same length. One last modification not shown here is that the scalars U1, U2, U3, etc., do not necessarily need to add up to 1. To be weights, they need to add up. So we apply a familiar trick here and use the SoftMax function.

This is self-attention. There is also mutual explanation that Q3 can come from the last word, but K and V can come from different sentences. This is valuable in tasks like translation, for example. Now we know what attention is.

All of this can now be put in a box and called a "self-attention block." Essentially, this self-attention block takes embedding vectors and spits out a single output vector of user-defined length. This block has three parameters: WK, WQ, and WV. It doesn’t need to be more complicated than that. There are many such blocks in the machine learning literature, and they are generally represented as boxes in diagrams with names. Like this:

One thing to note about self-attention is that so far, the positions of things seem irrelevant. Since we are using the same W overall, switching Humpty and Dumpty wouldn’t actually make a difference. All the numbers would be the same. This means we can pay attention to what we should pay attention to, but it does not depend on the position of the words. However, we know that word positions are important in English, and by providing the model with a sense of the position of words, we can improve performance.

So when attention is used, we often do not feed the embedding vectors directly into the self-attention block. Later, we will look at how "positional encodings" are added to the embedding vectors before being fed into the attention block.

Note for those who started early: Those who have read about self-attention will note that K and Q matrices are referenced or masks are not applied in the way these models are generally trained. The data batches are fed, and the model is trained to predict Humpty's Dumpty and predict sitting on Humpty Dumpty simultaneously. This is a matter of gaining efficiency and does not affect interpretation or model output, and we chose to skip the training efficiency hacks here.

SoftMax

We briefly talked about SoftMax in the first note. The problem that SoftMax is trying to solve is as follows: the output interpretation has as many neurons as there are options for the network to choose one. And we said we would interpret the network's choice as the highest value neuron. Then we would say we would calculate the loss as the difference between the value provided by the network and the desired ideal value. But what is the ideal value we want? In the leaf/flower example, we set it to 0.8. But why 0.8? Why not 5, 10, or ten million? The higher the better for that training case. Ideally, we want infinity there! Now we can make the problem easier to handle. All losses would be infinite, and the plan to move the parameters to minimize the loss (remember "gradient descent") would fail. How do we handle this?

One simple thing we can do is to cap the values we want. Let’s say between 0 and 1? This would make all losses finite, but now we have a problem of what happens when the network overfits (e.g., outputs (5,1) for leaves and flowers, and in another case outputs (0,1). The first case is the correct choice, but the loss is worse! Now we need a way to transform the output of the last layer to be in the range of (0,1) while preserving the order. We need a function (in mathematics, a "function" is simply mapping one number to another. You move from one number to another number. The job is done. One possible option is to map all numbers to numbers between (0,1) using the logistic function (see the graph below).

Now we have numbers for each neuron in the last layer between 0 and 1, and we can set the correct neuron to 1 and the others to 0, and take the difference with what the network provides to calculate the loss. This is effective, but can we do better?

Returning to our "Humpty Dumpty" example, let’s assume we are trying to generate Dumpty character by character. Instead of providing "m" as the highest value in the last layer, we provide "u" as the highest value, but "m" is a close second.

Now we can continue with "DUU" and try to predict the next character, but since there aren’t many good continuations for "Humpty Duu ..," the model's confidence will be low. On the other hand, since "M" was a close second, should we take "M" and predict the next few characters and see what happens? Maybe it will give us a better overall word?

So what we are saying here is not to blindly pick the maximum value but to try a few. What’s a good way to do this? Well, we should assign probabilities to each. Let’s say we will choose 50% for the first, 25% for the second, etc. That’s a good way to do it. But perhaps we want to have a chance to rely on the base model predictions. If the model predicts that M and U are close here (compared to other values) - would it be a good idea to have a 50-50 chance to explore these two?

So we need a good rule to take all these numbers and convert them into probabilities. That’s what SoftMax does. It’s a generalization of the logistic function above, but with additional features. Given 10 random numbers - it provides 10 outputs, each between 0 and 1, and importantly, all 10 can add up to 1, interpreted as probabilities. You can find SoftMax as the last layer in almost any language model.

Residual Connections

As the sections progress, we have slowly changed the visualization of the network. We now use boxes/blocks to represent specific concepts. This notation is particularly useful for representing the concept of residual connections. Let’s look at residual connections combined with self-attention blocks.

We simplify things by putting "input" and "output" in boxes, but fundamentally, it’s just the same collection of neurons/numbers as shown above.

So what’s happening here? We are essentially taking the output of the self-attention block and adding the original input before passing it to the next block. The first thing to note is that the dimensions of the output of the self-attention block must now match the dimensions of the input. Since we mentioned that the self-attention output is determined by the user, this is not a problem. But why do this? We won’t go into all the details here, but the key is that as the network gets deeper (more layers between input and output), training becomes increasingly difficult. Residual connections have been shown to help with these training issues.

Layer Normalization

Layer normalization is a fairly simple layer that takes data from the layer, subtracts the mean, and divides by the standard deviation (as you can see a bit more below). For example, to apply layer normalization right after the input, we take all neurons in the input layer and calculate two statistics: the mean and the standard deviation. Let’s assume the mean is M and the standard deviation is S. Then layer normalization does the job of replacing each of these neurons with (X - M) / S, where X represents the original value of the given neuron.

Now how does this help? Essentially, it stabilizes the input vector and helps train deep networks. One concern is that by normalizing the input, we may be removing useful information that could help us learn valuable things about our target. To address this, the layer normalization layer has scale and bias parameters. Essentially, it multiplies each neuron by a scalar and then adds a bias. These scalar and bias values are trainable parameters. This allows the network to learn valuable variations for predictions. And since these are the only parameters, the layer norm block doesn’t have many parameters to train. Everything looks like this.

The scale and bias are trainable parameters. You can see that layer normalization is a relatively simple block where each number operates only as a point (after calculating the initial mean and STD). The main difference that reminds you of the activation layer (e.g., Relu) is that there are some trainable parameters here (much fewer than other layers due to the simple point-wise operation).

The standard deviation is a statistical measure of how values are dispersed. For example, if all values are the same, we would say the standard deviation is 0. Generally, a high standard deviation occurs when each value is far from the average of the same value. The formula for calculating the standard deviation for a set of numbers, a1, a2, a3…. (let’s say n numbers) is as follows. Subtract the average (of these numbers) from each number, square the answer for each of the n numbers. Add all these numbers together and then divide by N. Now take the square root of the answer.

Disclaimer: Experienced ML professionals will note that there is no discussion of batch normalization here. In fact, we have not introduced the concept of batching at all in this article. In most cases, batching is considered another training acceleration that is unrelated to understanding the core concepts (perhaps excluding batch normalization, which we don’t need).

Dropout

Dropout is a simple yet effective way to avoid overfitting in models. Overfitting is a term for when a model is trained too well on the training data, and it works well on that dataset but does not perform well on examples it has not seen. Techniques that help avoid overfitting are called "regularization techniques," and dropout is one of them.

When training a model, it can make errors or overfit in certain ways. If you train another model, it can do the same but in a different way. What if you train several of these models and average their outputs? These are generally called "ensemble models." By combining the outputs of an ensemble of models, you can predict outputs, and ensemble models generally perform better than individual models.

You can do the same in neural networks. You can build several (slightly different) models and then combine their outputs to get a better model. However, this can be computationally expensive. Dropout does not create an ensemble model, but it is a technique that captures some of the essence of the concept.

The concept is simple. By inserting a dropout layer during training, dropout randomly deletes a certain percentage of direct neuron connections between the layers where dropout is inserted. Consider the initial network and inserting a dropout layer between the input and the middle layer with a dropout rate of 50%, it might look like this.

Now the network is trained with a lot of redundancy. Essentially, you are training several different models simultaneously but sharing weights.

Now you can follow the same approach for inference as for ensemble models. You can use dropout to make several predictions and then combine them. However, it is computationally intensive because the model does not make predictions using all weights since it shares common weights. This should give us an approximation of what the ensemble provides.

One problem: a model trained with 50% of the weights will have very different numbers from one that uses all weights in the middle neurons. What we want here is more of an ensemble style. What should we do? Well, a simple way is to multiply all weights by 2 since we are using half the weights, so we simply multiply all weights by 0.5. This is what Dropout does during inference. It uses the entire network with all weights and simply multiplies the weights by (1 - p), where p is the dropout probability. And this has been shown to work well as a regularization technique.

Multi-Head Attention

This is a core block of the transformer architecture. We have already seen what an attention block is. The output of the attention block is determined by the user and is of length V. Essentially, it runs multiple attention heads in parallel (all taking the same input). Then we take all the outputs and simply concatenate them. It looks like this.

The arrows coming from v1-> v1h1 are linear layers. There are matrices that transform each arrow. I didn’t show them to avoid confusion.

What’s happening here is that we generate the same keys, queries, and values for each head. However, we are essentially applying a linear transformation on top of them (separately for each k, q, v) before using the base k, q, v values. This extra layer does not attend to self-attention.

What’s incidental to me is that this is a somewhat surprising way to create multiple attentions. For example, wouldn’t it be better to add a new layer and generate separate WK, WQ, WV matrices for each head without sharing these weights? If you know, let me know - I really don’t.

Positional Encoding and Embeddings

We briefly talked about the motivation for using positional encodings in the self-attention section. What are they? The figure shows positional encodings, but using positional embeddings is more common than using encodings. So here we will talk about common positional embeddings, but the appendix also includes the positional encodings used in the original paper. Positional embeddings are just like embeddings that include numbers 1, 2, 3, etc., instead of including the word vocabulary. So these embeddings are matrices of the same length as word embeddings, corresponding to each position. That’s really all there is to it.

GPT Architecture

Let’s talk about the GPT architecture. This is what is used in most GPT models (overall transformer). If you have been following the article so far, this should be quite trivial to understand. Using box notation, the architecture looks like this at a high level.

At this point, all blocks other than the "GPT transformer block" have been explained in detail. Here, the + sign simply means that two vectors have been added together (i.e., the sizes of the two embeddings must be the same). Let’s take a look at this GPT transformer block.

And that’s almost it. Here it is called a "transformer" because it is a type of transformer architecture that we will see in the next section. Since we have already covered all the building blocks shown previously, it does not affect understanding. Let’s summarize everything we have covered up to this GPT architecture.

  • We have seen how neural networks take numbers as input and output other numbers, having weights as parameters to train the model.
  • We can attach interpretations to these input/output numbers and give the neural network real meaning.
  • We can chain neural networks to create larger networks, and we can create larger networks, calling each "block" and representing them in boxes to make diagrams easier. Each block still does the same thing, taking many numbers and outputting different numbers.
  • We learned about different types of blocks to achieve different purposes.
  • GPT is just a special arrangement of these blocks shown above with the interpretation discussed in Part 1.

As companies build powerful modern LLMs, they have been modified over time, but the basics remain the same.

Now this GPT transformer is actually called a "decoder" in the original transformer paper that introduced the transformer architecture. Let’s take a look at it.

Transformer Architecture

This is one of the key innovations driving rapid acceleration in the capabilities of recent language models. Transformers not only improve prediction accuracy but also make it easier and more efficient than previous models (training), allowing for larger model sizes. The above GPT architecture is based on this.

Looking at the GPT architecture, it is good at generating the next word in order. It essentially follows the same logic discussed in Part 1. Start with a few words and then continue generating one at a time. But what if we want to do translation? If we have a sentence in German (e.g., "Wo Wohnst du?" = "Where do you live?") and we want to translate it into English, how do we train the model?

Well, the first thing we need to do is figure out how to input German words. This means we need to expand the embeddings to include both German and English. Now, do you think there’s a way to input this information? Wouldn’t it be easier to concatenate the German sentence at the start of everything generated in English and feed it contextually? To make it easier for the model, we could add a separator. This would look like this at each step.

This is effective, but there is room for improvement.

  • When the context length is fixed, sometimes the original sentence is lost.
  • This model has a lot to learn here. The two languages are simultaneous, but it needs to know that is a separator token to start the translation.
  • We are processing the entire German sentence with a different offset for each word generation. This means that the same internal representation will have different internal representations, and the model should be able to work through everything for translation.

Transformers were originally designed for this task and consist of two separate blocks called "encoder" and "decoder." One block simply takes the German sentence and provides an intermediate representation (again, fundamentally a lot of numbers) - this is called the encoder.

The second block generates words (we have seen this a lot so far). The only difference is that in addition to supplying the words generated so far, we also supply the encoded German sentence (from the encoder block). When generating language, the context is essentially all the words generated so far and the German sentence. This block is called the decoder.

Each of these encoders and decoders consists of several blocks, particularly attention blocks interspersed between different layers. Let’s look at the diagram of the transformer from the paper "Everything You Need to Know About Attention" and try to understand it.

The vertical set of blocks on the left is called the "encoder," and the blocks on the right are called the "decoder." Let’s go and understand what we haven’t covered yet.

Summary of how to read the diagram: Each box here is a block that takes some input in the form of neurons and spits out a set of neurons as output to be processed by the next block or interpreted. The arrows show where the output of a block goes. As you can see, we often take the output of one block and feed it as input to several blocks. Let’s look at each of these.

Feedforward: A feedforward network is a network that does not include cycles. The original network from Section 1 is feedforward. In fact, this block uses a very similar structure. It includes 2 linear layers, each with a Relu (see Relu in the first section) and a dropout layer. This feedforward network is applied independently to each position. What this means is that there is a feedforward network for the information for position 0, and one for position 1, and so on. However, there is no linkage between the neurons of position X and the feedforward position network. If we don’t do our job, this is important because the network can cheat during training by expecting this.

Cross Attention: You can see that the decoder has multi-head attention with arrows coming from the encoder. What’s happening here? Do you remember the values, keys, and queries from self-attention and multi-head attention? They all came from the same sequence. The query comes from the last word of the sequence. So what if we keep the query but take the values and keys from a completely different sequence? That’s what’s happening here. The values and keys come from the output of the encoder. Mathematically, nothing has changed except where the input for keys and values is coming from.

NX: Here NX simply indicates that this block is repeated N times in a chain. So essentially, we stack the blocks continuously, passing the input from the previous block to the next. This is a way to make the neural network deeper. Now looking at the diagram, there may be confusion about how the encoder output is supplied to the decoder. Let’s say n = 5. Do we supply the output of each encoder layer to the corresponding decoder layer? No. Essentially, we only run the encoder once. Then we take that representation and supply the same one for each of the 5 decoder layers.

ADD & NORM BLOCK: This is basically the same as below (the author seems to have tried to save space)

Everything else has already been discussed. Now we have a complete explanation of simple addition and multiplication operations and the fully self-contained transformer architecture! You know what every line, every sum, every box, and every word means in building from the ground up. Theoretically, this note contains everything you need to code a transformer from scratch. If you are interested in the facts, this repository does that for the above GPT architecture.

Appendix

Matrix Multiplication

We introduced matrices and vectors in the context of embeddings above. A matrix has two dimensions (numbers of rows and columns). A vector can also be thought of as a matrix with one of its dimensions being one. The product of two matrices is defined as follows.

The dot represents multiplication. Now let’s look at the calculations of the blue and organic neurons in the first figure in the second. If we write the weights as a matrix and the input as a vector, we can write the entire operation as follows.

If we call the weight matrix "W" and the input "X," then WX is the result (in this case, the middle layer). We can also switch the two and write XW. This is a matter of preference.

Standard Deviation

We use the concept of standard deviation in the layer normalization section. The standard deviation is a statistical measure of how values (in a set of numbers) are dispersed. For example, if all values are the same, we would say the standard deviation is 0. Generally, a high standard deviation occurs when each value is far from the average of the same value. The formula for calculating the standard deviation for a set of numbers, a1, a2, a3…. (let’s say n numbers) is as follows. Subtract the average (of these numbers) from each number, square the answer for each of the n numbers. Add all these numbers together and then divide by N. Now take the square root of the answer.

Positional Encoding

We talked about positional embeddings above. Positional encoding is simply a vector of the same length as the word that includes numbers instead of including the word vocabulary. We simply assign a unique vector to each position (e.g., a different vector for position 1, a different vector for position 2, etc.). A simple way to do this is to fill the vector for that position with the position number. So the vector for position 1 would be [1,1,1… 1], while for position 2 it would be [2,2,2… 2]. This can be problematic because it can end up with many vectors that create challenges during training. Of course, we can normalize these vectors by dividing all numbers by the maximum position, so if there are a total of 3 words, position 1 would be [.33, .33, ..,. 33] and position 2 would be [.67, .67, ..,. 67], etc. This creates a challenge for the network to learn because we are continuously changing the encoding for position 1 (when feeding a 4-word sentence as input). Therefore, we want a system that assigns a unique vector to each position without exploding the numbers. Essentially, if the context length is d (i.e., the maximum number of tokens/words the network can supply to predict the next token/word, see the discussion in the "How do we generate all languages?" section), the embedding vector is 10 (say). Then we need a matrix with 10 rows and d columns, where all columns are unique and all numbers are between 0 and 1. There are infinitely many numbers between 0 and 1, and we need a finite size matrix. This can be done in several ways.

The approach used in "Everything You Need to Know About Attention" is as follows:

  • Draw 10 sine curves of SI (p) = sin (p/10000 (i/d)) (10k ~ power i/d)
  • Fill the encoding matrix with numbers.

Why choose this method? Changing the power of 10K changes the amplitude of the sine function when viewed on the p-axis. And if we have 10 sine functions with different amplitudes, it will take a long time to get iterations to change the value of p (i.e., all 10 values are the same). And this helps us provide unique values. Now the actual paper uses both sine and cosine functions, and the encoding form is as follows. I/D)) if I’m being weird.

Users who liked