This article synthesizes the foundational principles of machine learning as presented in the book “Mathematics for Machine Learning.” The core thesis is that machine learning is not a collection of disparate algorithms but a field deeply rooted in and expressed through the language of mathematics. A solid grasp of these mathematical underpinnings provides a unified understanding of how and why machine learning models work.

The framework of a machine learning system is presented through three essential components:

  1. Data: Fundamentally represented as vectors, with datasets forming matrices.
  2. Models: Simplified descriptions of a data-generating process, which can be viewed through a non-probabilistic lens (predictors and loss functions) or a probabilistic one (likelihoods and priors).
  3. Learning: The process of fitting a model to data, which is framed as a numerical optimization problem aimed at achieving strong performance on new, unseen data.

The text is structured around two main parts. The first establishes the mathematical foundations, drawing from key fields:

  • Linear Algebra & Analytic Geometry: The language for representing data and transformations.
  • Matrix Decompositions: Tools for analyzing, simplifying, and approximating data and models.
  • Vector Calculus: The engine for optimizing model parameters.
  • Probability & Distributions: The framework for quantifying uncertainty and modeling data generation.

The second part applies these foundations to the “Four Pillars of Machine Learning”:

  1. Regression: Predicting continuous values.
  2. Dimensionality Reduction: Creating compact data representations.
  3. Density Estimation: Modeling the underlying data distribution.
  4. Classification: Predicting discrete labels.

A central theme is the duality between non-probabilistic and probabilistic approaches. Concepts like Empirical Risk Minimization and regularization in the former have direct analogues in Maximum Likelihood/Maximum A Posteriori Estimation and priors in the latter. Ultimately, the synthesis demonstrates that practical questions in machine learning are intrinsically linked to fundamental choices in the underlying mathematical model.

“Math is linked in the popular mind with phobia and anxiety. You’d think we’re discussing spiders.” — (Strogatz, 2014, page 281)

Part I: The Mathematical Foundations

The text establishes that a core set of mathematical disciplines forms the bedrock upon which machine learning is built. These concepts are not merely prerequisites but are actively used to define data, construct models, and execute learning algorithms.

1.1 The Language of Data and Mappings: Linear Algebra & Analytic Geometry

Linear algebra provides the fundamental objects and operations for handling data, while analytic geometry provides the means to measure relationships between them.

  • Data Representation: All data, once processed, is represented numerically as vectors. A collection of data points (examples) forms a table, which is represented as a matrix. This vector-based view is a cornerstone of the entire framework.
  • Matrices as Operators: Beyond storing data, matrices play a central role as linear operators. They compactly represent systems of linear equations (Ax = b) and define linear mappings that transform vectors from one space to another.
  • Vector Spaces: The concept of a vector space formalizes the rules for manipulating vectors, defining a set of objects that is “closed” under addition and scalar multiplication. This concept underpins many ML models, including dimensionality reduction techniques that operate on vector subspaces.
  • Similarity and Distance: Analytic geometry formalizes intuitive notions of similarity through mathematical constructs.
    • Inner Products: Generalize the dot product to measure the relationship between two vectors.
    • Norms: Define the length or magnitude of a vector.
    • Orthogonality: Describes when vectors are perpendicular, a key concept for creating independent bases and projections.
  • Orthogonal Projections: This is a crucial tool for data compression and feature extraction. Projecting a high-dimensional data vector onto a lower-dimensional subspace is the geometric basis for Principal Component Analysis (PCA).

1.2 Decomposing Complexity: Matrix Decompositions

Matrix decompositions are powerful techniques for breaking down matrices into constituent parts, revealing their intrinsic properties and enabling efficient computation and approximation.

  • Eigendecomposition (A = PDP⁻¹): This decomposition applies to square matrices and is fundamental for analyzing linear transformations. It reframes the transformation as a simple scaling along the directions of the eigenvectors, with the scaling factors given by the eigenvalues. This is visualized as a three-step process: a change of basis, a scaling operation, and a reversal of the basis change.
  • Singular Value Decomposition (SVD) (A = UΣVᵀ): Hailed as the “fundamental theorem of linear algebra,” the SVD is a more general decomposition that applies to any rectangular matrix.
    • It breaks down any linear mapping into a sequence of three fundamental operations: a rotation (basis change in the domain), a scaling and dimensionality change, and another rotation (basis change in the codomain).
    • The singular values in the diagonal matrix Σ quantify the amplification of the transformation in different directions.
    • It is the core mechanism for low-rank matrix approximation, where a complex matrix is approximated by the sum of a few simple, rank-1 matrices. This is used extensively in data compression, topic modeling, and recommendation systems.

1.3 The Engine of Learning: Vector Calculus & Optimization

If linear algebra provides the structure, vector calculus and optimization provide the engine that powers the learning process by enabling models to find optimal parameters.

  • Gradients and Derivatives: Vector calculus extends differentiation to functions of multiple variables (vectors and matrices).
    • The gradient (∇f) is a vector of partial derivatives that points in the direction of the steepest ascent of a function.
    • The Jacobian is the matrix of all first-order partial derivatives of a vector-valued function, representing the best local linear approximation.
    • The Hessian is the matrix of second-order partial derivatives, describing the local curvature of a function.
  • Optimization Algorithms: The goal of learning is typically framed as minimizing an objective function (e.g., error or loss).
    • Gradient Descent: An iterative algorithm that finds a local minimum by repeatedly taking steps in the direction of the negative gradient. Variants like Stochastic Gradient Descent (SGD) use subsets of data for computational efficiency.
  • Constrained Optimization and Duality: Many real-world problems involve constraints. Lagrange multipliers provide a method to solve such problems by converting them into an unconstrained form. This gives rise to primal and dual optimization problems, which are central to models like Support Vector Machines.
  • Convex Optimization: A special class of optimization problems where the objective function is convex and the feasible set is a convex set. For these problems, any local minimum is also a global minimum, guaranteeing an optimal solution can be found.

1.4 The Framework for Uncertainty: Probability & Distributions

Probability theory provides the language for modeling uncertainty, which is inherent in data and model predictions.

  • Quantifying Uncertainty: The framework is built on random variables, which map outcomes from a sample space to numerical values. The behavior of these variables is described by probability distributions.
  • Fundamental Rules: All of probability theory is derived from two simple rules: the sum rule (for marginalization) and the product rule (for joint probabilities). These rules combine to form Bayes’ theorem, which is the cornerstone of Bayesian inference for updating beliefs in light of new data.
  • Key Distributions:
    • Gaussian (Normal) Distribution: The most important distribution for continuous variables due to its analytical convenience. The Central Limit Theorem states that sums of independent random variables tend toward a Gaussian. It is fully defined by its mean and covariance and is closed under marginalization, conditioning, and linear transformations.
    • Exponential Family: A general class of distributions that includes the Gaussian, Bernoulli, Binomial, and others. This family possesses useful properties, notably the concept of conjugate priors, where the posterior distribution remains in the same family as the prior, simplifying Bayesian calculations.
  • Summary Statistics: Properties of distributions are summarized by statistics like the mean (expected value), variance (spread), and covariance (relationship between variables).

Part II: The Four Pillars of Machine Learning

The mathematical concepts from Part I are the building blocks for the core tasks, or “pillars,” of machine learning. The text frames the approach to solving these tasks through two parallel philosophical lenses: a non-probabilistic view based on risk and a probabilistic view based on likelihoods.

2.1 Core Learning Principles

Before examining the pillars, several overarching principles of model fitting are established.

PrincipleNon-Probabilistic ViewProbabilistic View
ObjectiveEmpirical Risk Minimization (ERM): Minimize a loss function (e.g., squared error) over the training data.Maximum Likelihood (MLE): Find parameters θ that maximize the likelihood `p(Data
Overfitting ControlRegularization: Add a penalty term (e.g., `λ
Full UncertaintyNot applicable (produces point estimates).Bayesian Inference: Compute the full posterior distribution `p(θ
GeneralizationThe ultimate goal is to minimize the expected risk on unseen data. This is estimated using a held-out test set or via K-fold cross-validation.The marginal likelihood or model evidence `p(Data
Model StructureThe structure of the predictor is defined by its hypothesis class (e.g., linear functions).Directed Graphical Models (Bayesian Networks) provide a visual language to represent conditional dependencies between random variables in the model.

The key challenge in model fitting is navigating the trade-off between underfitting (the model is too simple to capture the data’s structure) and overfitting (the model is too complex and fits the training data’s noise, leading to poor generalization).

2.2 Pillar 1: Regression

Task: Predicting a continuous output variable.

  • Application Example: Predicting a person’s salary based on their age.
  • Model: A typical model is Linear Regression, where the output is a linear function of the input features, plus Gaussian noise: y = f(x) + ε = θᵀx + ε. Non-linear relationships are handled by transforming the input features using basis functions (φ(x)).
  • Learning Approaches:
    • MLE on a Gaussian likelihood leads to the classic least-squares problem.
    • MAP with a Gaussian prior on the parameters θ is equivalent to Ridge Regression (L2 regularization).
    • Bayesian Linear Regression computes the full posterior distribution over θ, enabling predictions that include uncertainty estimates.

2.3 Pillar 2: Dimensionality Reduction

Task: Finding a compact, lower-dimensional representation (code) of high-dimensional data while preserving as much relevant information as possible.

  • Application Example: Compressing images or visualizing high-dimensional datasets.
  • Model: Principal Component Analysis (PCA) is the archetypal method. It finds a linear projection of the data onto a lower-dimensional subspace.
  • Methodology: PCA can be derived from two equivalent perspectives:
    1. Maximum Variance: Find the orthogonal directions (principal components) that capture the maximum variance in the data.
    2. Minimum Reconstruction Error: Find the subspace that minimizes the average squared distance between the original data points and their projections.
  • The solution is found by performing an eigendecomposition of the data’s covariance matrix or an SVD of the centered data matrix.
  • Probabilistic PCA (PPCA) provides a generative, latent-variable model perspective on PCA, which allows it to handle missing data more naturally.

2.4 Pillar 3: Density Estimation

Task: Modeling the underlying probability distribution of a dataset in an unsupervised manner.

  • Application Example: Anomaly detection or generating new data samples that resemble the original data.
  • Model: A Gaussian Mixture Model (GMM) is a flexible model that represents a complex data distribution as a weighted sum of multiple simpler Gaussian components.
  • Methodology: Since the component assignments for each data point are unknown (latent variables), a direct MLE solution is intractable. The parameters are learned using the Expectation-Maximization (EM) algorithm:
    1. E-Step (Expectation): Calculates the posterior probability (or “responsibility”) that each data point belongs to each Gaussian component, given the current model parameters.
    2. M-Step (Maximization): Updates the model parameters (means, covariances, and weights of the Gaussians) to maximize the likelihood, using the responsibilities as soft assignments.

2.5 Pillar 4: Classification

Task: Predicting a discrete class label for a given input.

  • Application Example: Determining whether an email is spam or not spam.
  • Model: The Support Vector Machine (SVM) is a powerful classifier that seeks to find a linear decision boundary (a hyperplane) that separates the classes.
  • Methodology:
    • Maximum Margin: The SVM is unique in that it optimizes for the hyperplane that has the maximum possible margin (distance) to the nearest data points of any class. These critical points on the margin are called support vectors.
    • Hinge Loss & Regularization: The SVM objective can be formulated as minimizing a combination of the hinge loss (which penalizes misclassified points and correctly classified points inside the margin) and an L2 regularizer on the model weights (which encourages a larger margin).
    • The Kernel Trick: The SVM’s greatest strength is its ability to learn non-linear decision boundaries. The dual formulation of its optimization problem depends only on inner products between data points. The kernel trick replaces this inner product with a kernel function k(xᵢ, xⱼ), which corresponds to an inner product in a high-dimensional (even infinite-dimensional) feature space. This allows the SVM to find a linear separator in this complex feature space, which corresponds to a non-linear separator in the original input space, without ever explicitly computing the feature mapping.

Get the book