Mathematical Foundations of Machine Learning

Prof. Matthieu Bloch

Wednesday, September 11, 2024 (v1.1)

Last time

Image
  • Last class: Monday September 09, 2024
    • We talked about separability
    • This is a way to ensure that our understanding in finite dimensions carries over to infinite dimension
    • We essentially guarantee the existence of a countable orthobasis
    • Not every Hilbert space is separable!
  • To be effectively prepared for today's class, you should have:
    1. Gone over slides and read associated lecture notes
    2. Be considering submitting Homework 2 (due Friday September 13, 2024)
  • Logistics
    • Office hours
      • Jack Hill: 11:30am-12:30pm on Wednesdays in TSRB and hybrid
      • Anuvab: 12pm-1pm on Thursdays in TSRB and hybrid
      • Dr. Bloch: 6pm on Friday online
    • Solutions of HW1 will be released shortly
    • Grading of HW1 in progress

Isomorphism with \(\ell_2\)

  • Key take away for separable Hilbert spaces
    • \(x=\sum_{i=1}^\infty\dotp{x}{v_i}v_i\) is perfectly well defined for an orthonormal basis
    • Parseval's identity tell us that \(\norm{x}^2=\sum_{k\geq 1}\abs{\dotp{x}{v_k}}^2\).
    • We don't need to even worry about the nature of \(\calH\) and only think about coefficients \(\dotp{x}{v_i}\)

Any separable Hilbert space is isomorphic to \(\ell_2\); the isomorphism preserves the norm and inner product.

Why does this matter?

  • Why are we doing all this theory?
    1. Abstract Hilbert spaces allow us to do geometry on objects that are functions \[ \min_{\hat{x}\in\calT}\norm[2]{x-\hat{x}} \]
    2. Machine Learning is about learning functions for classification, regression.
    3. Interesting Hilbert spaces of functions are infinite dimensional
    4. With proper concepts (completeness, separability), infinite dimension is not a problem
    5. We can worry about coordinates instead of worrying about the vectors themselves
    6. We can understand what is a good representation and approximation of a function
  • Is that all?
    • No! We'll actually soon introduce the notion of Reproducing Kernel Hilbert Space (RKHS)
    • We'll see that infinite dimension can be useful to introduce features in ML tasks
    • We'll see that we do geometry even when it's not a priori clear (think tRNA strings)

Non-orthogonal bases in finite dimension

Let \(\set{v_i}_{i=1}^n\) be a linearly independent set in a Hilbert space \(\calH\) of dimension \(n\). Then, for any \(x\in\calH\), \(x=\sum_{i=1}^n\alpha_iv_i\) for some \(\bfalpha\in\bbR^n\). In addition, there exists \(A,B>0\) such that \[ A\norm[2]{\bfalpha}^2 \leq \norm[\calH]{x}^2\leq B\norm[2]{\bfalpha}^2 \]

  • Remarks
    • Inequality is tight on both sides
    • For orthobases, \(A=B=1\)
  • Interpretation:
    • The values of \(A\) and \(B\) govern the stability of the representation

Dual basis in finite dimension

  • Recall from orthobases:
    • Perfectly stable representation \(A=B=1\)
    • Efficient computation of representations: \(\alpha_i=\dotp{x}{v_i}\)

For any \(x\in\calH\) with basis \(\set{v_i}_{i=1}^n\) we have \[ x=\sum_{i=1}^n\alpha_iv_i\qquad\text{with}\qquad\bfalpha = \matG^{-1}\mat{c}{\dotp{x}{v_1}\\\dotp{x}{v_2}\\\vdots\\\dotp{x}{v_n}} \] There also exists a basis \(\set{\tilde{v}_i}_{i=1}^n\) such that \(\alpha_i=\dotp{x}{\tilde{v}_i}\)

Non-orthogonal bases in infinite dimension

\(\set{v_i}_{i=1}^\infty\) is a Riesz basis for Hilbert space \(\calH\) if \(\text{cl}(\text{span}(\set{v_i}_{i=1}^\infty))=\calH\) and there exists \(A,B>0\) such that \[ A\sum_{i=1}^\infty\alpha_i^2\leq \norm[\calH]{\sum_{i=1}^n\alpha_iv_i}^2\leq B\sum_{i=1}^\infty\alpha_i^2 \] uniformly for all sequences \(\set{\alpha_i}_{i\geq 1}\) with \(\sum_{i\geq 1}\alpha_i^2<\infty\).

  • In infinite dimension, the existence of \(A,B>0\) is not automatic.

Dual basis in infinite dimension

  • Computing expansion on Riesz basis not as simple in infinite dimension: Gram matrix is "infinite"

  • The Gramian is a linear operator \[ \calG:\ell_2(\bbZ)\to\ell_2(\bbZ): \bfx\mapsto \bfy\text{ with }[\calG(\bfx)]_n\eqdef y_n=\sum_{\ell=-\infty}^\infty\dotp{v_\ell}{v_n}x_\ell \]

  • Fact: there exists another linear operator \(\calH:\ell_2(\bbZ)\to\ell_2(\bbZ)\) such that \[ \calH(\calG(\bfx)) = \bfx \] We can replicate what we did in finite dimension!

Regression

  • A fundamental problem in supervised machine learning can be cast as follows \[ \textsf{Given dataset }\calD\eqdef\{(\vecx_i,y_i)\}_{i=1}^n\textsf{, how do we find $f$ such that $f(\bfx_i)\approx y_i$ for all }i? \]

    • Often \(\vecx_i\in\bbR^d\), but sometimes \(\vecx_i\) is a weirder object (think tRNA string)
    • if \(y_i\in\calY\subseteq\bbR\) with \(\card{\calY}<\infty\), the problem is called classification
    • if \(y_i\in\calY=\bbR\), the problem is called regression
  • We need to introduce several ingredients to make the question well defined

    1. We need a class \(\calF\) to which \(f\) should belong
    2. We need a loss function \(\ell:\bbR\times\bbR\to\bbR^+\) to measure the quality of our approximation
  • We can then formulate the question as \[ \min_{f\in\calF}\sum_{i=1}^n\ell(f(\bfx_i),y_i) \]

  • We will focus quite a bit on the square loss \(\ell(u,v)\eqdef (u-v)^2\), called least-square regression

Least square linear regression

  • A common choice of \(\calF\) is the set of continuous linear functions.

    • \(f:\bbR^d\to\bbR\) is linear iff \[ \forall \bfx,\bfy\in\bbR^d,\lambda,\mu\in\bbR\quad f(\lambda\bfx + \mu\bfy) = \lambda f(\bfx)+\mu f(\bfy) \]

    • We will see that every continuous linear function on \(\bbR^d\) is actually an inner product, i.e., \[ \exists \bftheta_f\in\bbR^d\textsf{ s.t. } f(\bfx)=\bftheta_f^\intercal\bfx \quad\forall \bfx\in\bbR^d \]

Least square affine regression

  • Canonical form I

    • Stack \(\bfx_i\) as row vectors into a matrix \(\bfX\in\bbR^{n\times d}\), stack \(y_i\) as elements of column vector \(\bfy\in\bbR^n\) \[ \min_{\bftheta\in\bbR^d} \norm[2]{\bfy-\matX\bftheta}^2\textsf{ with } \matX\eqdef\mat{c}{-\vecx_1^\intercal-\\\vdots\\-\vecx_n^\intercal-} \]
  • Canonical form II

    • Allow for affine functions (not just linear)

    • Add a 1 to every \(\vecx_i\) \[ \min_{\bftheta\in\bbR^{d+1}} \norm[2]{\bfy-\matX\bftheta}^2\textsf{ with } \matX\eqdef\mat{c}{1-\vecx_1^\intercal-\\\vdots\\1-\vecx_n^\intercal-} \]

Nonlinear regression using a basis

  • Let \(\calF\) be an \(d\)-dimensional subspace of a vector space with basis \(\set{\psi_i}_{i=1}^d\)

    • We model \(f(\bfx) = \sum_{i=1}^d\theta_i\psi_i(\bfx)\)
  • The problem becomes \[ \min_{\bftheta\in\bbR^d}\norm[2]{\bfy-\boldsymbol{\Psi}\bftheta}^2\textsf{ with }\boldsymbol{\Psi}\eqdef \mat{c}{-\psi(\bfx_1)^\intercal-\\\vdots\\-\psi(\bfx_n)^\intercal-}\eqdef\mat{cccc}{\psi_1(\bfx_1)&\psi_2(\bfx_1)&\cdots&\psi_d(\bfx_1)\\ \vdots&\vdots&\vdots&\vdots\\ \psi_1(\bfx_n)&\psi_2(\bfx_n)&\cdots&\psi_d(\bfx_n) } \]

  • We are recovering a nonlinear function of a continuous variable

    • This is the exact same computational framework as linear regression.

Next time

  • Next class: Monday September 16, 2024
    • We will talk about solving regression problems
    • This will highlight why linear algebra concepts are relevant for machine learning
  • To be effectively prepared for next class, you should:
    1. Go over today's slides and read associated lecture notes
    2. Work on Homework 2 (due date: Friday September 13, 2024)
  • Optional
    • Export slides for next lecture as PDF (be on the lookout for an announcement when they're ready)