Mathematical Foundations of Machine Learning

Prof. Matthieu Bloch

Monday, October 7, 2024 (v1.0)

Last time

  • Last class: Wednesday October 02, 2024
    • We talked about the kernel trick and how we can solve regression without ever seeing the RKHS
    • We talked about Aronszjan's theorem and how it helps identify kernels
    • Homework 4 will help clarify ideas around this
  • Today: We will start working towards the SVD.
  • To be effectively prepared for today's class, you should have:
    1. Come to class on Wednesday October 02, 2024
    2. Gone over slides and read associated lecture notes here and there and there and there
    3. Started reviewing your notes for the midterm
  • Logistics: use office hours to review for the midterm!
    • Jack Hill office hours: Wednesday 11:30am-12:30pm in TSRB and hybrid
    • Anuvab Sen office hours: Thursday 12pm-1pm in TSRB and hybrid
    • Dr. Bloch: Monday, October 07, 2024 6pm - online
  • Homework 4: posted, due Sunday October 20.
    • We agreed to give you room to breathe, prepare for the midterm and enjoy Fall break
    • However, the pace of subsequent homework will be faster (~1/week)

Midterm

  • Date: Wednesday October 9, 2024 3:30pm-4:45pm in usual classroom
    • Please be on time and plan to finish when asked
    • Open notes (slides, annotations, lecture notes, your notes)
    • no Internet, no AI, no textbook
  • Coverage
    • Everything up to and including Lecture 11 on Wednesday September 25, 2024
    • Representer theorem is in scope (functionals and RKHS are not in scope)
    • Remember what \(B\)-splines are but don't worry about proving the detailed properties proved in Homework 1
  • My expectations
    • You have worked on all the homework
    • You have read the lecture notes
  • What I am thinking about testing (this is not meant to be a hint as to what will be on the test)
    • Abstract Hilbert spaces: can you manipulate functions as if they were vectors?
    • Orthogonality principle: can you apply it properly to solve an optimization problem?
    • Least square regression: are you able to characterize the solution?

Aronszjan's theorem

An inner product kernel is a mapping \(k:\bbR^d\times\bbR^d\to\bbR\) for which there exists a Hilbert space \(\calH\) and a mapping \(\Phi:\bbR^d\to\calH\) such that \[\forall \bfu,\bfv\in\bbR^d\quad k(\bfu,\bfv)=\langle\Phi(\bfu),\Phi(\bfv)\rangle_\calH\]

A function \(k:\bbR^d\times\bbR^d\to\bbR\) is a positive semidefinite kernel if

  1. \(k\) is symmetric, i.e., \(k(\bfu,\bfv)=k(\bfv,\bfu)\)
  2. for all \(\{\bfx_i\}_{i=1}^N\), the Gram matrix \(\bfK\) is positive semidefinite, i.e., \[\bfx^\intercal\bfK\bfx\geq 0\text{ with }\bfK=[K_{i,j}]\text{ and }K_{i,j}\eqdef k(\bfx_i,\bfx_j)\]

A function \(k:\bbR^d\times\bbR^d\to\bbR\) is an inner product kernel if and only if \(k\) is a positive semidefinite kernel.

Examples

Regression using linear and quadratic functions in \(\bbR^d\)

Regression using Radial Basis Functions

  • Examples of kernels
    • Homogeneous polynomial kernel: \(k(\bfu,\bfv) = (\bfu^\intercal\bfv)^m\) with \(m\in\bbN^*\)
    • Inhomogenous polynomial kernel: \(k(\bfu,\bfv) = (\bfu^\intercal\bfv+c)^m\) with \(c>0\), \(m\in\bbN^*\)
    • Radial basis function (RBF) kernel: \(k(\bfu,\bfv) = \exp\left(-\frac{\norm{\bfu-\bfv}^2}{2\sigma^2}\right)\) with \(\sigma^2>0\)

Systems of symmetric equations

  • Least square problems involved the normal equations \(\bfX^\intercal\bfX \bftheta=\bfX^\intercal\bfy\)

  • This is a system of symmetric equations \(\bfA\bfx=\bfy\) with \(\bfA^\intercal=\bfA\)

    • Ultimately we will talk about the non-symmetric/non square case

A real-valued matrix \(\bfA\) is symmetric if \(\bfA^\intercal=\bfA\)

A complex-valued matrix \(\bfA\) is Hermitian if \(\bfA^\dagger=\bfA\) (also written \(\bfA^H=\bfA\))

Given a matrix \(\matA\in\bbC^{n\times n}\), if a vector \(\vecv\in\bbC^n\) satisfies \(\matA\bfv=\lambda\bfv\) for some \(\lambda\in\bbC\), then \(\lambda\) is an eigenvalue associated to the eigenvector \(\bfv\).

  • If \(\lambda\) is an eigenvalue, there are infinitely many eigenvectors associated to it

Change of basis

  • Consider the canonical basis \(\set{e_i}_{i=1}^n\) for \(\bbR^n\); every vector can be viewed as coefficients \(\set{\alpha_i}_{i=1}^n\), \[ \bfx = \sum_{i=1}^n \alpha_i e_i = \mat{cccc}{\alpha_1&\alpha_2&\cdots&\alpha_n}^\intercal \]

  • How do we find the representation of \(\bfx\) in another basis \(\set{v_i}_{i=1}^n\)? Write \(e_i=\sum_{j=1}^n\beta_{ij}v_j\)

  • Regroup the coefficients \[ \bfx = \cdots + \left(\sum_{i=1}^n\beta_{ij}\alpha_i\right) v_j + \cdots \]

  • In matrix form \[ \bfx_{\text{new}} = \mat{cccc}{\beta_{11}&\beta_{21}&\cdots&\beta_{n1}\\ \beta_{12}&\beta_{22}&\cdots&\beta_{n2}\\\vdots&\vdots&\vdots&\vdots\\\beta_{1n}&\beta_{2n}&\cdots&\beta_{nn}}\bfx \]

Similarity

  • A change of basis matrix \(\matP\) is full rank (basis vectors are linearly independent)
  • Any full rank matrix \(\matP\) can be viewed as a change of basis
  • \(\matP^{-1}\) takes you back to the original basis
  • Warning: the columns of \(\bfP\) describe the old coordinates as a function of the new ones

If \(\matA,\bfB\in\bbR^{n\times n}\) then \(\bfB\) is similar to \(\bfA\) if there exists an invertible matrix \(\bfS\in\bbR^{n\times n}\) such that \(\bfB=\bfP^{-1}\bfA\bfP\)

  • Intuition: similar matrices are the same up to a change of basis

\(\matA\in\bbR^{n\times n}\) is diagonalizable if it is similar to a diagonal matrix, i.e., there exists an invertible matrix \(\bfP\in\bbR^{n\times n}\) such that \(\bfD=\bfP^{-1}\bfA\bfP\) with \(\matD\) diagonal

  • Not all matrices are diagonalizable!

Spectral theorem

Every complex matrix \(\matA\) has at least one complex eigenvector and every real symmetrix matrix has real eigenvalues and at least one real eigenvector.

Every matrix \(\matA\in\bbC^{n\times n}\) is unitarily similar to an upper triangular matrix, i.e., \[ \bfA = \bfV\boldsymbol{\Delta}\bfV^\dagger \] with \(\boldsymbol{\Delta}\) upper triangular and \(\bfV^\dagger=\bfV^{-1}\).

Every hermitian matrix is unitarily similar to a real-valued diagonal matrix.

  • Note that if \(\matA = \matV\matD\matV^\dagger\) then \(\matA = \sum_{i=1}^n\lambda_i \vecv_i\vecv_i^\dagger\)
  • How about real-valued matrices \(\matA\in\bbR^{n\times n}\)

Symmetric positive definite matrices

A symmetric matrice \(\matA\) is positive definite if it has positive eigenvalues, i.e., \(\forall i\in\set{1,\cdots,n}\quad\lambda_i>0\).

A symmetric matrice \(\matA\) is positive semidefinite if it has nonnegative eigenvalues, i.e., \(\forall i\in\set{1,\cdots,n}\quad\lambda_i\geq 0\).

  • Convention: \(\lambda_1\geq \lambda_2\geq \cdots \geq \lambda_n\)
  • Variational form of extreme eigenvalues for symmetric positive definite matrices \(\bfA\) \[ \lambda_1 = \max_{\vecx\in\bbR^n:\norm[2]{\bfx}=1}\vecx^\intercal \matA\vecx = \max_{\vecx\in\bbR^n}\frac{\vecx^\intercal \matA\vecx}{\norm[2]{\vecx}^2}\qquad \lambda_n = \min_{\vecx\in\bbR^n:\norm[2]{\bfx}=1}\vecx^\intercal \matA\vecx = \min_{\vecx\in\bbR^n}\frac{\vecx^\intercal \matA\vecx}{\norm[2]{\vecx}^2} \]

For any analytic function \(f\), we have \[ f(\matA) = \sum_{i=1}^n f(\lambda_i)\vecv_i\vecv_i^\intercal \]

System of symmetric definite equations

  • Consider the system \(\vecy=\matA\vecx\) with \(\matA\) symmetric positive definite

Let \(\set{\vecv_i}\) be the eigenvectors of \(\matA\). \[ \vecx = \sum_{i=1}^n\frac{1}{\lambda_i}\dotp{\vecy}{\vecv_i}\vecv_i \]

  • Assume that there exists some observation error \(\vecy=\matA\vecx+\vece\)
    • \(\vece\) is unknown
    • we try to reconstruct \(\vecx\) as \(\widetilde{\vecx}\) by applying \(\matA^{-1}\)

\[ \frac{1}{\lambda_1^2}\norm{\vece}^2\leq \norm[2]{\vecx-\tilde{\vecx}}\leq \frac{1}{\lambda_n^2}\norm{\vece}^2. \]

Next time

  • Next class: Wednesday October 09, 2024
  • To be effectively prepared for the midterm, you should:
    1. Not panic
    2. Not start reviewing your notes the night before
    3. Continue reviewing your notes
    4. Post questions on Piazza
    5. Come to office hours with questions
  • Optional
    • Export slides for next lecture as PDF (be on the lookout for an announcement when they're ready)