What Is a Large Language Model (LLM)
A precise technical definition of large language models — covering transformer architecture, pretraining methodology, instruction tuning, and the distinction between base models and assistant-tuned variants.
Definition
A large language model (LLM) is a neural network trained on a corpus of text data to model the conditional probability distribution of tokens. Given a sequence of input tokens, an LLM outputs a probability distribution over its vocabulary for the next token. Autoregressive sampling from this distribution produces coherent text.
The term “large” is relative and historically contingent, but conventionally refers to models with parameter counts above 1 billion. As of 2024, frontier models range from 70B to estimated 1T+ parameters.
Architecture
Contemporary LLMs are built on the Transformer architecture (Vaswani et al., 2017). Core components:
- Token embedding layer — maps discrete tokens to dense vector representations in high-dimensional space
- Positional encoding — injects sequence-order information (absolute or rotary/RoPE)
- Multi-head self-attention — computes pairwise relevance between all tokens in the context window in parallel
- Feed-forward sublayers — apply nonlinear transformations per token position (often SwiGLU activation)
- Layer normalization — stabilizes gradient flow (pre-norm is now standard over post-norm)
- Final linear + softmax — projects hidden states to vocabulary probability distributions
Decoder-only transformers (GPT family, Llama, Claude, Gemini) are the dominant architecture for generative tasks. Encoder-decoder models (T5, BART) are used for translation and summarization.
Training Phases
Pretraining
The model is trained on a large text corpus using next-token prediction (causal language modeling). The objective minimizes cross-entropy loss between predicted and actual next tokens across the training sequence. Pretraining corpora typically include web crawls, books, code repositories, and academic papers, totaling 1–15 trillion tokens for frontier models.
Supervised Fine-Tuning (SFT)
Base pretrained models are fine-tuned on curated instruction-following datasets — human-written (prompt, response) pairs that demonstrate the desired assistant behavior. This phase aligns the model’s output distribution with the format of helpful responses.
RLHF / DPO Alignment
Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO) further refines the model using human preference judgments between response pairs. This phase optimizes for helpfulness, harmlessness, and honesty (HHH) beyond what SFT alone achieves.
Model Variants
| Variant | Training Method | Primary Use Case |
|---|---|---|
| Base model | Pretraining only | Text completion, research |
| Instruction-tuned | Pretraining + SFT | Chat, task completion |
| RLHF-aligned | + Preference learning | Safety-aligned assistants |
| Domain-specific | Fine-tuned on vertical data | Legal, medical, code |
| Quantized (GGUF/AWQ) | Post-training compression | Local / edge deployment |
| Multimodal | + Vision/audio encoders | Image+text understanding |
Key Metrics
Perplexity — measures how well the model predicts a held-out test corpus. Lower is better. Useful for comparing base models; less meaningful for instruction-tuned models.
MMLU — Massive Multitask Language Understanding. 57-subject multiple-choice benchmark testing world knowledge and reasoning. Scores above 85% indicate strong general capability.
HumanEval — measures code generation correctness on Python programming problems. Pass@1 rates above 80% indicate strong coding ability.
Context window — maximum sequence length the model can process in a single forward pass. Ranges from 4K tokens (early GPT-3) to 1M+ tokens (Gemini 1.5 Pro).