Random Matrix Lecture 02

[[lecture-data]]

:LiArrowBigLeftDash: Random Matrix Lecture 01 | Random Matrix Lecture 03 :LiArrowBigRightDash:

Class Notes

pg 7 - 12 (1.4 - 2.1)

Summary

Proof of concentration of Gaussian random vector norms (see end of [[Random Matrix Lecture 01]] / [[concentration inequality for magnitude of standard gaussian random vector]])
how to think about high-dimensional geometry
multiplying by a Gaussian matrix
Gaussian process viewpoint

1. Random Vector Theory

1.4 Consequences for High-Dimensional Geometry

Last time, we saw the statement and proof for the [[concentration inequality for magnitude of standard gaussian random vector]]. This is one of the fundamental facts of high-dimensional geometry.

This tells us that, with high probability, we have

\begin{align} \lvert \lvert x \rvert \rvert {#2} &= d + {\cal O}(\sqrt{ d }) \\ \implies \lvert \lvert x \rvert \rvert &= \sqrt{ d + {\cal O}(\sqrt{ d }) } \\ &= \sqrt{ d } \cdot \sqrt{ 1+ {\cal O}\left( \frac{1}{\sqrt{ d }} \right)} \\ &= \sqrt{ d }\left( 1+{\cal O}\left( \frac{1}{\sqrt{ d }} \right) \right) \\ &= \sqrt{ d } + {\cal O}(1) \end{align}

ie, the standard [[gaussian random vector]] falls close to the spherical shell of width $O (1)$ around the sphere of radius $\sqrt{d}$ .

Important

The standard [[gaussian random vector]] tends towards {as||the surface of the sphere of radius $\sqrt{d}$ ||"typical set" shape} and falls within an "error region" surrounding {ha||that surface that has a width of $O (1)$ ||error region description}

Note

This means a high-dimensional Gaussian has a {1||non-convex} typical set - it is the shape of {2||a hollow spherical shell||shape}!

(Sometimes these vectors are called thin-shell vectors in convex geometry and probability)
Contrast this with how we normally think of a Guassian high-probability/"typical" set in 1- or 2-D. This is usually some sort of solid blob around the origin.

There are some other high-dimensional behaviors of random variables that can be explained in the same way.

Example

Consider $x \sim Uniform ([- 1, + 1])^{\otimes d}$ . A simple calculation yields

\mathbb{E}[\lvert \lvert x \rvert \rvert {#2} ] = d \mathbb{E}[x_{1}^2] = \frac{d}{3}

And a similar calculation to the concentration inequality in the gaussian case yields, with high probability, that

| | x | | = \sqrt{\frac{d}{3}} + O (1)

Which means almost all the mass of this distribution is at the corners of this $d$ -dimensional box instead of the inscribed solid unit ball $B^{d} := {x \in R^{d} : | | x | | \leq 1}$ . The concentration inequality in this case also implies that the ball relative to the box has bound

\begin{align} \frac{\text{Vol}(B^d)}{\text{Vol}([-1, +1]^d)} &= \mathbb{P}_{x \sim \text{Unif}([-1, +1]^d)}\,[\lvert \lvert x \rvert \rvert \leq 1] \\ &\leq \mathbb{P}_{x} [\,\lvert \,\,\lvert \lvert x \rvert \rvert {#2-d} \,\rvert \geq d-1] \\ &= \exp(-{\cal O}(d)) \end{align}

ie, the volume of the unit sphere is exponentially smaller than the volume of the box.

Our [[concentration inequality for magnitude of standard gaussian random vector]] also gives us an approximation

Take Away

$N (0, I_{d}) \approx Unif (S^{d - 1} (\sqrt{d}))$

Caution

This is an informal approximation!

We get this by noting

When $x \sim N (0, I_{d})$ , we have $Law (\frac{x}{| | x | |}) = Unif (S^{d - 1})$ ( which we get from [[normalized standard gaussian random vectors have the orthogonally invariant distribution on the unit sphere]] )
$| | x | | \approx \sqrt{d}$ , which follows from the [[concentration inequality for magnitude of standard gaussian random vector]]

This idea is formalized in [[Borel's central limit theorem]]

(see [[gaussian random vectors are approximately uniform on the hollow sphere]])

Theorem

Let $x \sim Unif (S^{d - 1})$ . Then $x_{1} \cdot \sqrt{d} \to N (0, 1)$ in distribution as $d \to \infty$ .

Further, for any fixed $k \geq 1$ , we have $\sqrt{d} \cdot (x_{1}, \dots, x_{k}) \to N (0, I_{k})$ in distribution as $d \to \infty$

(without proof, for now)

See [[Borel's central limit theorem]]

2. Rectangular Matrices

We begin by looking at asymmetric, rectangular matrices with iid Gaussian entries.

We are particularly interested in the $d \times m$ matrix

G \sim N (0, 1)^{\otimes d \times m}

2.1 Gaussian Random Field Viewpoint

[?] What does $G$ do to some $y \in R^{m}$ ?
[?] What about several $y_{1}, \dots, y_{n} \in R^{m}$ ?

We can use our results about gaussian random vectors to characterize these images.

Proposition

Let $G \sim N (0, 1)^{\otimes d \times m}$ be a [[random matrix]] and $y \in R^{m}$ . Then

\text{Law}(Gy) = {\cal N}(0, \lvert \lvert y \rvert \rvert {#2} I_{d})

Proof

$G y$ is linear in $G$ . The entries of $G$ form a [[gaussian random vector]], and so $G y$ must be Gaussian as well. Thus, it suffices to calculate the mean and covariance to find its law.

Since each entry of $G$ , call them $G_{i α}$ , is iid $N (0, 1)$ , we get

\begin{aligned} E (G y)_{i} & = E [\sum_{α = 1}^{m} G_{i α} y_{α}] \\ = 0 \end{aligned}

And for the variance

\begin{align} \text{Cov}(\,(Gy)_{i}, (Gy)_{j}\,) &= \mathbb{E}[\,(Gy)_{i}(Gy)_{j}\,] \\ &= \mathbb{E}\left( \left( \sum_{\alpha=1}^m G_{i\alpha} y_{\alpha} \right) \left( \sum_{\beta=1}^m G_{j\beta}y_{\beta} \right) \right) \\ &= \sum_{\alpha,\beta=1}^m y_{\alpha}y_{\beta} \cdot \mathbb{E} [G_{i\alpha} G_{j\beta}] \\ &= \begin{cases} 0 & i \neq j \\ \lvert \lvert y \rvert \rvert {#2} & i=j \end{cases} \\ \implies \text{Cov}(Gy,Gy) &= \lvert \lvert y \rvert \rvert {#2} I_{d} \end{align}

Above, we found the expectation and covariance in terms of the entry-wise products and sums. We can also do the calculation in terms of matrices.

Proof (matrices)

Expectation
If there is only one random matrix involved, we can write

E [G y] = E [G] y (= 0)^{(*)}

$_^{(*)}$ In terms of our same setting above. This is a simple result of linearity of expectation.

Covariance
If $g_{1}, \dots, g_{m} \in R^{d}$ are the columns of $G$ so that each $g_{α} \sim N (0, I_{d})$ iid, then we see that

\begin{align} \text{Cov}(Gy) &= \mathbb{E}\left[ (Gy)(Gy)^{\intercal} \right] \\ &= \mathbb{E}\left[ \left( \sum_{\alpha=1}^m y_{\alpha} g_{\alpha} \right)\left( \sum_{\beta=1}^m y_{\beta} g_{\beta}^{\intercal} \right) \right] \\ &= \sum_{\alpha,\beta=1}^m y_{\alpha} y_{\beta} \cdot \mathbb{E}\left[ g_{\alpha}g_{\beta}^{\intercal} \right] \\ (*) &= \sum_{\alpha=1}^m y_{\alpha}^2 \cdot I_{d} \\ &= \lvert \lvert y \rvert \rvert {#2} I_{d} \end{align}

Where again $(*)$ is in terms of the setting above. In this line we use the law of the $g_{α}$ .

\begin{matrix} ◼ \end{matrix}

see [[gaussian random matrix transforms vectors into gaussian random vectors]]

Using the same techniques, we can easily find the joint law of $G y_{1}, \dots, G y_{n}$ for $n < \infty$ .

Proposition

Let $y_{1}, \dots, y_{n} \in R^{m}$ and $G \sim N (0, 1)^{\otimes d \times m}$ . Then

\text{Law}\left(\begin{bmatrix} Gy_{1} \\ \vdots \\ Gy_{n} \end{bmatrix}\right) = {\cal N}\left(0,\, \begin{bmatrix} \lvert \lvert y_{1} \rvert \rvert {#2} I_{d} & \langle y_{1},y_{2} \rangle I_{d} & \dots & \langle y_{1},y_{n} \rangle I_{d} \\ \langle y_{2} , y_{1} \rangle I_{d} & \lvert \lvert y_{2} \rvert \rvert {#2} I_{d} & \dots & \langle y_{2}, y_{n} \rangle I_{d} \\ \vdots & \vdots & \ddots & \vdots \\ \langle y_{n}, y_{1} \rangle I_{d} & \langle y_{n} , y_{2} \rangle I_{d} & \dots & \lvert \lvert y_{n} \rvert \rvert {#2} I_{d} \end{bmatrix}\right)

Or, if we define $Y$ such that the $i$ th column is $y_{i}$ , then the the covariance matrix is given by

Cov ([\begin{matrix} G y_{1} \\ ⋮ \\ G y_{n} \end{matrix}]) = Y^{⊺} Y \otimes I_{d}

Proof

This follows immediately from [[gaussian random matrix transforms vectors into gaussian random vectors]] by applying the calculation to each block of the covariance matrix. (the expectation is identical)

see [[joint of gaussian random transform of finite vectors]]

This [[joint of gaussian random transform of finite vectors]] is the characterization of a Gaussian random field, which is a term for a high(er)-dimensional Gaussian Process.

Characterization

Proposition

Let $y_{1}, \dots, y_{n} \in R^{m}$ and $G \sim N (0, 1)^{\otimes d \times m}$ . Then
$Law ([\begin{matrix} G y_{1} \\ ⋮ \\ G y_{n} \end{matrix}]) = N (0, [\begin{matrix} | | y_{1} | |^{2} I_{d} & ⟨ y_{1}, y_{2} ⟩ I_{d} & \dots & ⟨ y_{1}, y_{n} ⟩ I_{d} \\ ⟨ y_{2}, y_{1} ⟩ I_{d} & | | y_{2} | |^{2} I_{d} & \dots & ⟨ y_{2}, y_{n} ⟩ I_{d} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ ⟨ y_{n}, y_{1} ⟩ I_{d} & ⟨ y_{n}, y_{2} ⟩ I_{d} & \dots & | | y_{n} | |^{2} I_{d} \end{matrix}])$
Or, if we define $Y$ such that the $i$ th column is $y_{i}$ , then the the covariance matrix is given by

$\text{Cov}\left(\begin{bmatrix}
Gy_{1} \
\vdots \
Gy_{n}
\end{bmatrix}\right) = Y^{\intercal}Y \otimes I_{d} $

In particular, this characterization associates the [[random vector]] $G y \in R^{d}$ to each fixed vector $y \in R^{m}$ .

(see gaussian random field )

Geometric Interpretations of Multiplication

From the characterization, we can better understand what $G$ is doing when it transforms vectors.

Expected Magnitude

$\begin{align} \mathbb{E}[\lvert \lvert Gy \rvert \rvert {#2} ] &= \mathrm{Tr}(\text{Cov}(Gy) ) \\ &= d \lvert \lvert y \rvert \rvert {#2}$

\end{align}$$

Expected Direction

$\begin{aligned} E [⟨ G y_{1}, G y_{2} ⟩] & = Tr (Cov (G y_{1}, G y_{2})) \\ = d ⟨ y_{1}, y_{2} ⟩ \end{aligned}$

Note

The expected direction follows directly from the polarization identity

\langle y_{1},y_{2} \rangle = \frac{1}{4}\lvert \lvert y_{1}+y_{2} \rvert \rvert {#2}

>> [! e x e r c i s e] >> V e r i f y > I n f a c t, $ G $ a c t s i n t h i s w a y o n t h e e n t i r e [[G r a m m a t r i x]] f o r a n y f i n i t e c o l l e c t i o n o f v e c t o r s . > [! d e f] R e c a l l > L e t $ v_{1}, \dots, v_{n} \in F^{k} $ b e a (f i n i t e) c o l l e c t i o n o f v e c t o r s i n $ F^{k} $, w h i c h h a s [[i n n e r p r o d u c t]] $ ⟨ \cdot, \cdot^{'} ⟩ $ . T h e * * G r a m M a t r i x * * $ G $ o f t h e $ v_{i} $ i s g i v e n b y > $ $ G_{i j} = ⟨ v_{i}, v_{j} ⟩, G = V^{*} V

Where $V$ is the matrix with $v_{i}$ as its $i$ th column

We can see then that if $G \sim N (0, 1)^{\otimes d \times m}$ and $Y \in R^{m \times n}$ , then

Expected Gram Matrix

$E [(G Y)^{⊺} (G Y)] = d Y^{⊺} Y$

Notes on the Gram Matrix

The lengths of the $v_{i}$ are the entries along the diagonal
The angles between each pair are on the off-diagonal
This information describes the entire geometry of the set of vectors; If any other set of vectors $u_{1}, \dots, u_{n}$ has the same Gram matrix, then there must exist some $Q \in O (k)$ such that $Q v_{i} = u_{i}$ for all $i$ .

Important

In this case, it is possible to get from one point cloud to the other via some orthogonal transformation - this is an if and only if!

From the [[joint of gaussian random transform of finite vectors]], we can also consider

\hat{G} := \frac{1}{\sqrt{d}} G, Law (\hat{G}) = N {(0, \frac{1}{d})}^{\otimes d \times m}

Then we can see that

\begin{align} \mathbb{E}[\,\lvert \lvert \hat{G}y \rvert \rvert {#2} \,] &= \lvert \lvert y \rvert \rvert {#2} \\ \mathbb{E}[\,\langle \hat{G}y_{1}, \hat{G}y_{2} \rangle \,] &= \langle y_{1}, y_{2} \rangle \\ \mathbb{E}\left[ \,(\hat{G}Y)^{\intercal}(\hat{G}Y)\, \right] &= Y^{\intercal}Y \end{align}

ie, in expectation, $\hat{G}$ preserves the lengths, angles, and Gram matrices. That is, it preserves all the geometry of $R^{m}$

see [[normalized gaussian random matrix preserves geometry in expectation]]

Note

Note that this holds for any $d$ , including $d ≪ m$ and $d = 1$ . This is because we are taking expectations.

If $y$ is selected first before we realize $\hat{G}$ , then we hope that $| | \hat{G} y | |$ will concentrate about $| | y | |$ . Our calculations for the joint show that

\text{Law}(\hat{G}y) = {\cal N}\left( 0, \frac{1}{d}\lvert \lvert y \rvert \rvert {#2} I_{d} \right)

So (with rescaling, obviously), our [[concentration inequality for magnitude of standard gaussian random vector]] holds and we should see the desired behavior.

Caution

Assuming we ensure $\hat{G} ⊥ y$ , the probability of actually preserving geometry depends on $d$ .

Example

If $d = 1$ , then $\hat{G} = g^{⊺}$ for some $g \in N (0, I_{d})$ . We still have

\mathbb{E}[\lvert \lvert \hat{G}y_{i} \rvert \rvert {#2} ] = \mathbb{E}[\langle g,y_{i} \rangle {#2} ] =\lvert \lvert y_{i} \rvert \rvert {#2}

But clearly $⟨ g, y_{i} ⟩^{2} \approx | | y_{i} | |^{2}$ cannot hold simultaneously for many $y_{i}$ regardless of if we choose them before realizing $\hat{G}$ .

Exercise

Construct some examples of this case

Caution

If $d ≪ m$ , then once $\hat{G}$ is realized, we can pick $y \in Null (\hat{G})$ so that $0 = | | \hat{G} y | | ≪ | | y | |$ .

In this case, $\hat{G}$ will not preserve the geometry of arbitrary vectors $y$ .
In particular, we can find vectors $y$ that depend on $\hat{G}$ such that the geometry is badly distorted.

Next time:
As long as

$d$ is not too small
The point cloud $y_{1}, \dots, y_{n}$ is fixed before drawing $\hat{G}$
$\hat{G}$ approximately preserves the geometry of the $y_{i}$

acting as approximate isometry even for $d ≪ m$

Review

#flashcards/math

const { dateTime } = await cJS()

return function View() {
	const file = dc.useCurrentFile();
	return <p class="dv-modified">Created {dateTime.getCreated(file)}     ֍     Last Modified {dateTime.getLastMod(file)}</p>
}