2025-02-17 graphs lecture 8

[[lecture-data]]

2025-02-17

Quick note from last lecture on sparse matrix representations

coordinate representation and compressed sparse row representation - better explanation of CSR:

Compressed Sparse Row (CSR) representation

$A \in R^{n \times m}$ can become a set of 3 tensors/"lists". We get this representation by realizing that we can encode the row index in the COO representation a bit more efficiently.

Instead of writing out the row index for each element, we can collect the column indices for each row, put each of these collections together, and then have a pointer tell us where to start reading.

If $A$ has $z$ non-zero entries, then

the column tensor contains $z$ entries. Each entry contains the column index for one of the non-zero elements, sorted by ascending row index.
row or pointer tensor contains $n + 1$ entries
- the first $n$ entries contain the starting index for where the row's elements start in the column tensor
- the last entry is $z$
value tensor contains $z$ entries and contains the non-zero elements sorted by row then column index.

If you've taken a numerical LA class, then familiar with seeing how the total # operations is
$total operations = \sum_{i = 1}^{n} | N_{i} | = \sum_{i = 1}^{n} d_{i} = 1^{T} A 1 = | E |$

Summary

Last Time

information theoretic threshold
community detection
Today
See [[School/Archived/Equivariant Machine Learning/course_PDFs/Lecture 8.pdf]]
Popular GNN architectures that can be expressed in convolutional form
- MPNN
- GCN
- ChebNet
- graphSAGE
GATS

1. Graph Signals and Graph Signal Processing

Recall the definition of a graph neural network:

GNN

A "full-fledged" GNN layer is a multi-dimensional generalization of the multi-layer graph perceptron given by
$x_{ℓ} = σ (U_{ℓ}) = σ (\sum_{k = 0}^{K - 1} S^{k} x_{ℓ - 1} H_{ℓ, k})$
$H_{ℓ, k} \in R^{d_{ℓ - 1} \times d_{ℓ}}$ for $1 \leq ℓ \leq L$ , where

$X_{0} = X \in R^{n \times d_{0}}$
$Y = X_{L} \in R^{n \times d_{ℓ}}$
$σ : R \to R$ is a pointwise nonlinearity
and the multiplication is a convolutional filter bank

Here, $X_{ℓ} \in R^{n \times d_{ℓ}}$ is still called an $ℓ$ -layer embedding.

This is a convolutional GNN because it is based on graph convolution.

GNNs have great properties which help explain their practical success
This form is general enough to describe a majority of the most popular architectures in use today

Message Passing Neural Network (MPNN)

This type of network was initially introduced by Gilmer et al. In a message passing neural network, each layer consists of 2 operations: the message and the update.

The message is a function that processes

the signal at node $i$
the message /signal at each of the neighbors of $i$
the edge weights

m_{ℓ} = M_{ℓ} (x), (m_{ℓ})_{i} = \sum_{j \in N (i)} M_{ℓ} ((x_{ℓ})_{i}, (x_{ℓ})_{j}, A_{i j})

The update is a function on the signal of the graph from the previous layer to the next, taking into account the message and the current value at each node:

(x_{ℓ})_{i} = U_{ℓ} ((x_{ℓ})_{i}, (m_{ℓ})_{i})

(see message passing neural network)

Idea

Consider an MPNN. As long as $M_{ℓ}$ is linear, the "message" can be expressed as a graph convolution.

Example

$\begin{aligned} M_{ℓ} [(x_{ℓ})_{i}, (x_{ℓ})_{j}, A_{i j}] & = α \cdot (x_{ℓ})_{i} + β \cdot A_{i j} (x_{ℓ})_{j} \\ ⟹ m_{ℓ} & = α x_{ℓ} + β A x_{ℓ} \end{aligned}$

which is a graph convolution with $K = 2, S = A, h_{0} = α, h_{1} = β$

As long as $U_{ℓ}$ is the composition of a pointwise nonlinearity $σ$ with a linear function $M_{ℓ}$ , then $x_{ℓ}$ can be expressed as a GNN layer

Example

$\begin{aligned} U_{ℓ} [(x_{ℓ})_{i}, (m_{ℓ})_{i}] & = σ (α^{'} (x_{ℓ})_{i} + β^{'} (m_{ℓ})_{i}) \\ ⟹ x_{ℓ + 1} & = σ ([α^{'} + β^{'} α] x_{ℓ} + β^{'} β A x_{ℓ}) \end{aligned}$

Which is a graph convolution with $K = 2, S = A, h_{0} = α^{'} + β^{'} α, h_{1} = β^{'} β$

(see MPNNs can be expressed as graph convolutions)

Question

From the perspective of learning the weights ( $h_{0}, h_{1}$ vs $α, α^{'}, β, β^{'}$ ), what is different?

Graph Convolutional Networks

Introduced by Kipf and Willing, 2017, a graph convolutional network layer is given by

(x_{ℓ})_{i} = σ [\sum_{j \in N (i)} \frac{(x_{ℓ - 1})_{j}}{| N (i) |} H_{ℓ}]

Note that $H_{ℓ}$ are row vectors. We can think of each layer as a "degree normalized aggregation"

(see graph convolutional network)

Idea

We can write a graph convolutional network layer as a graph convolution. This is easy to see by simply writing

x_{ℓ} = σ (S x_{ℓ - 1} H_{ℓ})

Which is exactly the form of a graph convolution with

$K = 2$
$S = D_{binary}^{- 1 / 2} A_{binary} D_{binary}^{- 1 / 2}$
$H_{ℓ, 0} = 0, H_{ℓ, 1} = H_{ℓ}$

And $A_{binary}$ is the binary adjacency matrix (given as A[A>0] in Python)

(see GCN layers can be written as graph convolutions)

Note

Advantages of GCNs:

One local diffusion step per layer
Simple and interpretable
Neighborhood size normalization prevents vanishing or exploding gradient problem - only one diffusion step

Disadvantages

no edge weights
only supports the adjacency $S = A$
No self loops unless they are present in $A$ (embedding at layer $ℓ$ not informed by the embedding at layer $ℓ - 1$ ). Even if self-loops are in the graph, there is no ability to weight $(x_{ℓ})_{i}$ and $(x_{ℓ - 1})_{j}, j \in N (i)$ differently.

ChebNet

The idea behind ChebNets, which were introduced by Defferrard et al. in 2017, in to create a "fast" implementation of graph convolution in the spectral domain. Each layer of a ChebNet is given by

x_{ℓ} = σ (\sum_{k = 0}^{K - 1} T_{k} (\bar{L}) x_{ℓ - 1} H_{ℓ, k})

Where $\bar{L} = \frac{2 L}{λ_{max}} - I$ to ensure $λ \in (- 1, 1) \forall λ$ and the $T_{k}$ are the Chebyshev Polynomials.

We use the Chebyshev polynomials instead of Taylor polynomials to approximate an analytic function because

By the chebyshev equioscillation theorem , they provide a better approximation with fewer coefficients
chebyshev polynomials are orthogonal, meaning we can find an orthonormal basis for free. There is no need to implement the filters in the spectral domain.

(see ChebNet)

Learning in the spectral domain provides an "inductive bias" for learning localized spectral filters

In practice, we need to use an analytic function surrogate for this type of filter (solid red line).

Problem

Recall that we can represent any analytic function with convolutional graph filters, but translating back to the spectral domain is expensive because we need to compute the graph fourier transform $G F T (x) = \hat{x} = V^{T} x$ and also inverse graph fourier transform $h (λ) \hat{x}, y = V h (λ) \hat{x}$ .

This costs 2 additional matrix-vector multiplications, plus the diagonalization of $L$

In the typical spectral representation of a convolutional graph filter, we have

h (λ) = \sum_{k = 0}^{K - 1} h_{k} λ^{k} (L)

Where $S = L$ is the laplacian. Since the spectral graph filter operates on a signal pointwise, this tells us that each of the weights $h_{k}$ are the same.

Recall that we can represent any analytic function with convolutional graph filters. Thus, the expressivity of filters is quite strong. However, translating back to the spectral domain is expensive because we need to compute the graph fourier transform $G F T (x) = \hat{x} = V^{T} x$ and also inverse graph fourier transform $h (λ) \hat{x}, y = V h (λ) \hat{x}$ . This costs 2 additional matrix-vector multiplications, plus the diagonalization of $L$ . Is there a way to keep the expressivity of my graph convolution, but without having to compute as many expensive operations?

Chebyshev Polynomial

The Chebyshev Polynomials can be calculated using the recurrence relation:
$\begin{aligned} T_{0} (λ) & = 1 \\ T_{1} (λ) & = λ \\ ⋮ \\ T_{k + 1} (λ) & = 2 λ T_{k} (λ) - T_{k - 1} (λ) \end{aligned}$
And each $T_{n}$ has the property that $T_{n} = \cos (n θ)$

In conventional signal processing, filters based on these polynomials have better cutoff behavior in the spectral domain (compared to general polynomial filters)

ie, they need fewer coefficients than the Taylor approximation for the same quality

For the formal theorem statement/proof: see chebyshev equioscillation theorem

Using $h (λ) = \sum_{k = 0}^{K - 1} h_{k} λ^{k}$ requires more coefficients than

h (λ) = \sum_{k = 0}^{K^{'} - 1} h_{k} T_{k} (λ)

ie $K^{'} < K$ for the same approximation quality.

Intuition why this is better than the Taylor polynomials:

Chebyshev Polynomial expansion minimizes $L_{\infty}$ norm over the approximation interval ( $| | x | |_{\infty} = max_{i} (x_{i})$ ), while Taylor series is a local approximation around some $\tilde{λ}$

Additionally,

Theorem

Let $⟨ f, g ⟩ = \int_{- 1}^{1} f (x) g (x) \frac{1}{\sqrt{1 - x^{2}}} d x$

The Chebyshev Polynomials $T_{0}, T_{1}, \dots$ are orthogonal with respect to the inner product $⟨ \cdot, \cdot ⟩$ .

(see chebyshev polynomials are orthogonal)

Proof

$\begin{aligned} \int_{- 1}^{1} T_{n} (x) T_{m} (x) \frac{1}{\sqrt{1 - x^{2}}} d x & ⟹ x = \cos θ, \frac{d x}{d θ} = \sin θ \\ = \int_{π}^{2} π T_{n} (\cos θ) T_{m} (\cos θ) \frac{d θ \sin θ}{\sqrt{1 - \cos^{2} θ}} \\ = \int_{π}^{2} π \cos (n θ) \cos (m θ) d θ since T_{n} = \cos (n θ) \\ = \int_{π}^{2} π \frac{e^{i n θ} + e^{- i n θ}}{2} + \frac{e^{i m θ} + e^{- i m θ}}{2} d θ \\ = \frac{1}{4} \int_{π}^{2 π} e^{i n θ} + e^{- i n θ} + e^{i m θ} + e^{- i m θ} d θ \\ = {\begin{cases} \frac{1}{4} \int_{π}^{2 π} 4 d θ = θ |_{π}^{2 π} = π if n = m = 0 \\ \frac{1}{4} \int_{π}^{2 π} 2 d θ + \frac{1}{4} \int_{π}^{2 π} 2 \cos ((n + m) θ) d θ = \frac{π}{2} + 0 = \frac{π}{2} if n = m \neq 0 \\ \frac{1}{4} \int_{π}^{2 π} 2 \cos ((n + m) θ) + 2 \cos ((n - m) θ) d θ = 0 if n \neq m \end{cases} \end{aligned}$

Note

Advantages

Fast and cheap - easy to calculate polynomials and they are orthogonal for free
localized spectral filters

Disadvantages

less stable than Taylor approximations
- perturbing the adjacency matrix can cause large fluctuations in the chebyshev polynomial approximation, whereas the taylor approximations will remain basically the same
$S$ is restricted to $L$
difficult to interpret in the node domain

Graph SAGE

Introduced by Hamilton-Ying-Leskovee in 2017, each Graph SAGE layer implements the following 2 operations:

Aggregate

(U_{ℓ})_{i} = {AGGREGATE}_{ℓ} ({(x_{ℓ})_{j}, j \in N (i)})

Concatenate

(x_{ℓ})_{i} = σ (CONCAT ((x_{ℓ - 1})_{i}, (U_{ℓ})_{i}))

The standard $AGGREGATE$ operation is an average over $N (i)$ . Letting $H = [H_{0} H_{1}]^{T}$ , we get

\begin{aligned} (U_{ℓ})_{i} & = \frac{1}{| N (i) |} \sum_{j \in N (i)} (x_{ℓ - 1})_{j} \\ ⟹ U_{ℓ} & = S X_{ℓ - 1}, S = D^{- 1 / 2} A D^{- 1 / 2} \\ x_{ℓ} = σ ([x_{ℓ - 1} U_{ℓ}] H) & = σ (x_{ℓ - 1} H_{0} + S X_{ℓ - 1} H_{1}) \end{aligned}

where $A$ is the binary adjacency matrix, which is equivalent to a GCN with nodewise normalization $\frac{(x_{ℓ})_{i}}{| | (x_{ℓ})_{i} | |_{2}}$ . This normalization helps in some cases (empirically).

(see graph SAGE)

notes

SAGE implementations allow for a variety of $AGGREGATE$ functions, including $max$ and LSTMs (which look at $N (i)$ as a sequence, losing permutation equivariance). In these cases, SAGE is no longer a graph convolution.
- There are both advantages and disadvantages of staying a GCN and using other functions.
the authors of SAGE popularized the $AGGREGATE = UPDATE$ representation of GNNs (with $K = 2$ ) meaning

x_{ℓ} = σ (S X_{ℓ - 1} H_{ℓ})

$S x_{ℓ}$ is the "aggregate" and $σ (\dots H_{ℓ})$ is the "update"