Foundations of Large Language Models

This article synthesizes a comprehensive analysis of Large Language Models (LLMs), outlining the fundamental paradigm shift they represent in artificial intelligence and detailing the core technical pillars of their development and operation. The central insight is that universal models capable of handling diverse problems can be created by acquiring world and language knowledge through large-scale, self-supervised pre-training tasks. This has moved the field from training specialized systems on labeled data to a new paradigm of pre-training foundational models which are then adapted through fine-tuning, alignment, and prompting.

The lifecycle of an LLM is defined by five key stages:

Pre-training: Foundation models like BERT and GPT are built using self-supervised objectives, such as masked or causal language modeling, on massive, unlabeled text corpora. This stage imbues the model with general linguistic and world knowledge.
Generative Model Development: Modern LLMs are primarily decoder-only generative models. Their development involves scaling up model size, data, and compute, governed by predictable “scaling laws.” Key challenges include efficient distributed training and enabling models to process long sequences, which requires architectural innovations in attention mechanisms, KV caching, and positional embeddings.
Prompting: The primary method for interacting with a trained LLM is through prompting. Prompt engineering has emerged as a critical discipline, with techniques ranging from simple instructions to advanced methods like Chain-of-Thought (CoT), which elicits complex reasoning by instructing the model to generate intermediate steps.
Alignment: To ensure LLMs are helpful, harmless, and follow instructions, they undergo an alignment process. This typically involves two phases: Supervised Fine-Tuning (SFT) on high-quality instruction-response data, followed by Reinforcement Learning from Human Feedback (RLHF) or alternatives like Direct Preference Optimization (DPO). RLHF trains a “reward model” on human preference data and then uses it to fine-tune the LLM’s policy, steering its behavior toward desired outcomes.
Inference: The operational phase of deploying an LLM involves generating responses to new prompts. This process is bifurcated into a compute-bound “prefilling” stage for the input prompt and a memory-bound “decoding” stage for generating the output token-by-token. Efficiency at this stage is critical and is achieved through sophisticated batching strategies, KV cache management, and advanced decoding algorithms. Inference-time scaling, through methods like Best-of-N sampling and step-level verification, further enhances performance on complex reasoning tasks without retraining the model.

1. The Foundational Paradigm: Pre-training

The development of LLMs represents a fundamental shift in AI research methodologies. The core idea is that “a significant amount of knowledge about the world can be learned by simply training these AI systems on huge amounts of unlabeled data.” This has moved the field from a paradigm of training specialized systems from scratch to one that leverages large-scale pre-training to create powerful foundation models.

1.1 Self-Supervised Pre-training Tasks

Pre-training is typically accomplished through self-supervised tasks on vast unlabeled text corpora. These tasks create their own supervision signals from the data itself.

Pre-training Method	Description	Applicable Architectures
Causal Language Modeling	Predicts the next token in a sequence given all preceding tokens. This is the standard objective for decoder-only models like GPT.	Decoder-only
Masked Language Modeling	Predicts randomly masked tokens in a sequence based on the surrounding unmasked tokens (bidirectional context). The core task for BERT.	Encoder-only, Encoder-Decoder
Permuted Language Modeling	A variation of MLM that predicts masked tokens in a random order, capturing dependencies beyond simple sequential context.	Encoder-only
Denoising Autoencoding	Trains an encoder-decoder model to reconstruct an original, clean sequence from a corrupted (e.g., masked or shuffled) input.	Encoder-Decoder
Discriminative Training	Creates classification tasks from unlabeled text, such as predicting whether two sentences are sequential (Next Sentence Prediction in BERT).	Encoder-only

1.2 Model Architectures

The pre-training task is closely tied to the model’s architecture, which dictates how information flows and what context is available for predictions.

Decoder-only: Architectures like GPT use causal self-attention, where each token can only attend to preceding tokens. They are naturally suited for generative tasks.
Encoder-only: Architectures like BERT use fully visible self-attention, where each token can attend to all other tokens in the sequence. They excel at understanding and encoding sequences for downstream classification or extraction tasks.
Encoder-Decoder: These architectures, common in machine translation, use an encoder to process an input sequence and a decoder to generate an output sequence, with cross-attention linking the two.

2. Building Generative Models (LLMs)

While the term LLM can broadly cover models like BERT, it is now commonly used to refer to large-scale generative models, primarily based on the decoder-only Transformer architecture.

2.1 Core Components and Training

Architecture: Most modern LLMs are decoder-only Transformers. Key architectural choices include the use of pre-layer normalization (output = LNorm(F(input)) + input) for improved training stability, and advanced activation functions like GeLU, SwiGLU, or GeGLU in the feed-forward network (FFN) sub-layers. Some models also remove bias terms from affine transformations to further stabilize training.
Training Objective: The standard training objective is maximum likelihood estimation, where the model’s parameters (θ) are optimized to maximize the log-probability of sequences in the training dataset (D).
- θ̂ = arg max θ ∑ x∈D Lθ(x), where Lθ(x) = ∑m i=1 log Prθ(xi|x0, ..., xi−1)

2.2 Training at Scale

Scaling up models, data, and compute is fundamental to LLM performance. This presents significant engineering challenges.

Scaling Laws: Research has demonstrated predictable relationships, often power laws, between model performance (test loss) and factors like model size (N), dataset size (D), and compute. The Chinchilla scaling law, for example, models the test loss as a sum of terms related to model and dataset size, plus an irreducible error:
- L(N,D) = 406.4 * N^−0.34 + 410.7 * D^−0.28 + 1.69
Distributed Training: Training large models requires distributing the workload across many devices. Key parallelism strategies include:
- Data Parallelism: Replicating the model on multiple workers, each processing a different subset of the data batch.
- Pipeline Parallelism: Splitting the model’s layers across devices, creating a pipeline where micro-batches of data flow through the layers sequentially but with overlapping computation.
- Tensor Parallelism: Splitting individual matrix operations (e.g., within a self-attention or FFN layer) across multiple devices.

2.3 Long Sequence Modeling

Standard Transformers have a computational cost that grows quadratically with sequence length, making it challenging to handle long contexts. Several techniques address this limitation:

Efficient Architectures: Methods like sparse attention, linear attention, and sliding window attention reduce the quadratic complexity of the attention mechanism.
KV Cache Management: The Key-Value (KV) cache, which stores keys and values for past tokens to avoid recomputation during decoding, grows linearly with sequence length. Techniques like fixed-size caches (e.g., sliding window, FIFO) and memory models (e.g., retrieval-augmented, compressive) manage this memory footprint. Multi-Query Attention (MQA) reduces the cache size by sharing keys and values across attention heads.
Position Extrapolation and Interpolation: To handle sequences longer than those seen during training, positional embedding methods must generalize.
- Relative Positional Embeddings (e.g., T5 Bias): Add a learned bias to the attention score based on the relative distance between query and key tokens.
- Rotary Positional Embedding (RoPE): Encodes positional information by rotating token embeddings in a complex space, which naturally handles relative positions and can be adapted to longer contexts through techniques like linear position interpolation.

3. Interacting with LLMs: Prompting Techniques

Prompting is the primary interface for guiding an LLM to perform a task without updating its parameters. Effective prompt design, or prompt engineering, has become a critical skill.

3.1 Foundational Prompting Strategies

Clarity and Specificity: The most crucial principle is to describe the task clearly and precisely. Vague prompts lead to generic or irrelevant responses.
In-Context Learning (ICL): Providing examples (demonstrations) within the prompt, also known as few-shot prompting, allows the model to learn the desired task format and behavior from context.
Role Prompting: Instructing the model to adopt a specific persona (e.g., “You are a researcher with a deep background in psychology”) can significantly improve the quality and style of its output.

3.2 Advanced Prompting Methods

For complex tasks, especially those requiring reasoning, more sophisticated prompting techniques are necessary.

Chain-of-Thought (CoT) Prompting: This technique improves reasoning by instructing the model to generate intermediate, step-by-step thought processes before arriving at a final answer. This can be triggered by providing few-shot examples that include reasoning steps or by using simple instructions like “Let’s think step by step.”
Problem Decomposition: This involves breaking a complex problem into smaller, more manageable sub-problems. In least-to-most prompting, an LLM is first prompted to generate a sequence of sub-problems and then prompted to solve each one sequentially, using the answers to previous sub-problems as context.
Self-Refinement: This is an iterative process where an LLM generates an initial output, is then prompted to provide feedback or critique on that output, and finally prompted to refine the output based on its own feedback.
Ensembling: This combines multiple outputs to produce a better final result.
- Prompt Ensembling: Using multiple different prompts for the same task and combining the results.
- Output Ensembling (Self-Consistency): Generating multiple reasoning paths from a single CoT prompt via sampling, and selecting the final answer that appears most frequently among the outputs.
Tool Use and Retrieval-Augmented Generation (RAG): To overcome knowledge cutoffs and improve factuality, LLMs can be augmented with external tools. In RAG, an external information retrieval system fetches relevant documents, which are then added to the prompt’s context to ground the LLM’s response.

3.3 Automated Prompting

Manual prompt design can be labor-intensive. Automated methods aim to discover optimal prompts programmatically.

Hard Prompt Learning (APE): Automatic Prompt Engineer (APE) uses an LLM to generate a diverse set of candidate prompts, evaluates them on a task, and selects the best-performing one.
Soft Prompt Learning: Instead of natural language tokens, soft prompts are learnable embedding vectors that are prepended to the input.
- Prompt Tuning: Learns a small set of “soft prompt embeddings” while keeping the main LLM parameters frozen.
- Prefix Tuning: A similar method that learns a prefix for each Transformer layer.

4. Aligning LLMs with Human Values and Instructions

Alignment is the process of steering LLM behavior to be helpful, honest, and harmless, ensuring its outputs align with human expectations and values. It is a crucial step after pre-training.

4.1 Supervised Fine-Tuning (SFT) for Instruction Alignment

The first step in alignment is typically SFT, where a pre-trained LLM is further trained on a high-quality dataset of instruction-response pairs.

Process: The model is trained to maximize the conditional probability of the desired response y given the instruction x. During training, the loss is only calculated on the response tokens, not the instruction tokens.
Data: SFT datasets can be generated manually by human annotators or automatically by using a powerful “teacher” LLM to generate responses to a set of instructions. High-quality, diverse data is more important than sheer quantity.

4.2 Human Preference Alignment: RLHF and DPO

While SFT teaches the model to follow instructions, it struggles to capture nuanced human preferences like tone, style, or safety. Human preference alignment methods address this.

4.2.1 Reinforcement Learning from Human Feedback (RLHF)

RLHF is a three-stage process that fine-tunes an LLM based on human preferences:

Collect Human Preference Data: For a given set of prompts, generate multiple outputs from the SFT model. Human annotators then rank these outputs from best to worst.
Train a Reward Model (RM): A separate LLM is trained to predict which response a human would prefer. It takes a prompt and a response as input and outputs a scalar reward score. The RM is typically trained on the pairwise preference data using a loss function that encourages the preferred response to receive a higher reward.
Fine-Tune the LLM with RL: The SFT model (now the “policy”) is fine-tuned using reinforcement learning. The policy generates a response to a prompt, the RM evaluates it and provides a reward, and an algorithm like Proximal Policy Optimization (PPO) updates the policy’s parameters to maximize the expected reward. A KL-divergence penalty is used to prevent the policy from deviating too far from the original SFT model, which helps maintain language quality.

4.2.2 Direct Preference Optimization (DPO)

DPO is an alternative to RLHF that achieves a similar goal without explicitly training a reward model or using complex reinforcement learning. It directly optimizes the LLM on the preference data using a specialized loss function derived from the same underlying preference model as RLHF. This simplifies the training pipeline significantly.

4.3 Advanced Alignment Techniques

Process-based vs. Outcome-based Supervision: For multi-step reasoning tasks, outcome-based supervision rewards the model only if the final answer is correct. Process-based supervision provides feedback on each intermediate reasoning step, which can more effectively correct flawed reasoning even if the final answer happens to be correct.
Inference-time Alignment (Best-of-N): This technique generates multiple (N) candidate responses from the LLM at inference time and uses the trained reward model to select the one with the highest score.

5. The Operational Phase: LLM Inference

Inference is the process of using a trained LLM to generate outputs for new inputs. Efficiency and quality are the primary concerns.

5.1 The Prefilling-Decoding Framework

LLM inference is a two-phase autoregressive process:

Prefilling: The input prompt is processed in parallel in a single forward pass. This stage is compute-bound and generates the initial KV cache, which contains the keys and values for all tokens in the prompt.
Decoding: The output is generated one token at a time. In each step, the model uses the KV cache from all previous tokens (both prompt and generated) to predict the next token. This stage is memory-bound because its main bottleneck is the latency of reading and writing the large KV cache from GPU memory.

5.2 Decoding Algorithms

These algorithms search the vast space of possible output sequences to find a high-quality one.

Greedy Decoding: At each step, selects the single token with the highest probability. Fast but often leads to repetitive and suboptimal output.
Beam Search: Maintains a fixed number (K, the beam width) of the most probable partial sequences at each step, exploring a wider part of the search space than greedy search.
Sampling-based Decoding: Introduces randomness for more diverse and creative outputs.
- Top-k Sampling: Samples from the k most likely next tokens.
- Top-p (Nucleus) Sampling: Samples from the smallest set of tokens whose cumulative probability exceeds a threshold p.

5.3 Efficient Inference Techniques

Serving LLMs at scale requires optimizing for throughput and latency.

Batching: Processing multiple requests simultaneously to maximize GPU utilization.
- Static Batching: Groups requests of similar length and pads them to a uniform size. Inefficient due to padding and waiting for the entire batch to finish.
- Continuous Batching (Iteration-based Scheduling): A more dynamic approach where the scheduler can add new requests to or remove finished requests from the batch between individual token generation steps. This keeps the inference engine continuously utilized and significantly improves throughput.
Chunked Prefilling: For very long prompts, the prefilling step can be broken into smaller chunks. This allows decoding steps for other requests in the batch to be interleaved with the prefilling chunks, reducing latency for those other requests.

5.4 Inference-time Scaling

Performance on complex tasks can be improved by allocating more compute during inference, without retraining the model.

Context Scaling: Expanding the prompt with more information, such as demonstrations (ICL), reasoning steps (CoT), or retrieved documents (RAG).
Search Scaling: Expanding the search space during decoding, for example, by increasing the beam width in beam search or generating more candidates for reranking.
Verification and Reranking: This involves a “predict-then-verify” cycle.
- Parallel Scaling (Best-of-N): The LLM generates K independent candidate solutions (e.g., by sampling). A separate verifier model (which can be a reward model or another LLM) scores each solution, and the highest-scoring one is selected.
- Sequential Scaling (Self-Refinement): The LLM generates an initial solution, which is then iteratively critiqued and refined by another LLM or the same LLM in subsequent turns.
- Step-level Search: A verifier is used to score or prune paths at each intermediate step of a reasoning chain, allowing the model to explore the reasoning space more effectively, similar to how tree search algorithms work.

Download Here

Foundations of Large Language Models | Summary book + Pdf