2025-02-17 graphs lecture 8

[[lecture-data]]

2025-02-17

Quick note from last lecture on sparse matrix representations
Compressed Sparse Row (CSR) representation

ARn×m can become a set of 3 tensors/"lists". We get this representation by realizing that we can encode the row index in the COO representation a bit more efficiently.

Instead of writing out the row index for each element, we can collect the column indices for each row, put each of these collections together, and then have a pointer tell us where to start reading.

If A has z non-zero entries, then

  • the column tensor contains z entries. Each entry contains the column index for one of the non-zero elements, sorted by ascending row index.
  • row or pointer tensor contains n+1 entries
    • the first n entries contain the starting index for where the row's elements start in the column tensor
    • the last entry is z
  • value tensor contains z entries and contains the non-zero elements sorted by row then column index.

If you've taken a numerical LA class, then familiar with seeing how the total # operations is

total operations=i=1n|Ni|=i=1ndi=1TA1=|E|
Summary

Last Time

  • information theoretic threshold
  • community detection
    Today
  • See [[School/Archived/Equivariant Machine Learning/course_PDFs/Lecture 8.pdf]]
  • Popular GNN architectures that can be expressed in convolutional form
    • MPNN
    • GCN
    • ChebNet
    • graphSAGE
  • GATS

1. Graph Signals and Graph Signal Processing

Recall the definition of a graph neural network:

GNN

A "full-fledged" GNN layer is a multi-dimensional generalization of the multi-layer graph perceptron given by
x=σ(U)=σ(k=0K1Skx1H,k)
H,kRd1×d for 1L, where

Here, XRn×d is still called an -layer embedding.

This is a convolutional GNN because it is based on graph convolution.

Message Passing Neural Network (MPNN)

Message Passing Neural Network (MPNN)

This type of network was initially introduced by Gilmer et al. In a message passing neural network, each layer consists of 2 operations: the message and the update.

The message is a function that processes

  • the signal at node i
  • the message /signal at each of the neighbors of i
  • the edge weights
m=M(x),(m)i=jN(i)M((x)i,(x)j,Aij)

The update is a function on the signal of the graph from the previous layer to the next, taking into account the message and the current value at each node:

(x)i=U((x)i,(m)i)

(see message passing neural network)

Idea

Consider an MPNN. As long as M is linear, the "message" can be expressed as a graph convolution.

Example

M[(x)i,(x)j,Aij]=α(x)i+βAij(x)jm=αx+βAx

which is a graph convolution with K=2,S=A,h0=α,h1=β

As long as U is the composition of a pointwise nonlinearity σ with a linear function M, then x can be expressed as a GNN layer

Example

U[(x)i,(m)i]=σ(α(x)i+β(m)i)x+1=σ([α+βα]x+ββAx)

Which is a graph convolution with K=2,S=A,h0=α+βα,h1=ββ

(see MPNNs can be expressed as graph convolutions)

Question

From the perspective of learning the weights (h0,h1 vs α,α,β,β), what is different?

Graph Convolutional Networks

Introduced by Kipf and Willing, 2017, a graph convolutional network layer is given by

(x)i=σ[jN(i)(x1)j|N(i)|H]

Note that H are row vectors. We can think of each layer as a "degree normalized aggregation"

(see graph convolutional network)

Idea

We can write a graph convolutional network layer as a graph convolution. This is easy to see by simply writing

x=σ(Sx1H)

Which is exactly the form of a graph convolution with

  • K=2
  • S=Dbinary1/2AbinaryDbinary1/2
  • H,0=0,H,1=H

And Abinary is the binary adjacency matrix (given as A[A>0] in Python)

(see GCN layers can be written as graph convolutions)

Note

Advantages of GCNs:

  • One local diffusion step per layer
  • Simple and interpretable
  • Neighborhood size normalization prevents vanishing or exploding gradient problem - only one diffusion step

Disadvantages

  • no edge weights
  • only supports the adjacency S=A
  • No self loops unless they are present in A (embedding at layer not informed by the embedding at layer 1). Even if self-loops are in the graph, there is no ability to weight (x)i and (x1)j,jN(i) differently.
ChebNet

The idea behind ChebNets, which were introduced by Defferrard et al. in 2017, in to create a "fast" implementation of graph convolution in the spectral domain. Each layer of a ChebNet is given by

x=σ(k=0K1Tk(L¯)x1H,k)

Where L¯=2LλmaxI to ensure λ(1,1)λ and the Tk are the Chebyshev Polynomials.

We use the Chebyshev polynomials instead of Taylor polynomials to approximate an analytic function because

  1. By the chebyshev equioscillation theorem , they provide a better approximation with fewer coefficients
  2. chebyshev polynomials are orthogonal, meaning we can find an orthonormal basis for free. There is no need to implement the filters in the spectral domain.

(see ChebNet)

Learning in the spectral domain provides an "inductive bias" for learning localized spectral filters
2025-02-17-graph.png
In practice, we need to use an analytic function surrogate for this type of filter (solid red line).

Problem

Recall that we can represent any analytic function with convolutional graph filters, but translating back to the spectral domain is expensive because we need to compute the graph fourier transform GFT(x)=x^=VTx and also inverse graph fourier transform h(λ)x^,y=Vh(λ)x^.

This costs 2 additional matrix-vector multiplications, plus the diagonalization of L

In the typical spectral representation of a convolutional graph filter, we have

h(λ)=k=0K1hkλk(L)

Where S=L is the laplacian. Since the spectral graph filter operates on a signal pointwise, this tells us that each of the weights hk are the same.

Recall that we can represent any analytic function with convolutional graph filters. Thus, the expressivity of filters is quite strong. However, translating back to the spectral domain is expensive because we need to compute the graph fourier transform GFT(x)=x^=VTx and also inverse graph fourier transform h(λ)x^,y=Vh(λ)x^. This costs 2 additional matrix-vector multiplications, plus the diagonalization of L. Is there a way to keep the expressivity of my graph convolution, but without having to compute as many expensive operations?

Chebyshev Polynomial

The Chebyshev Polynomials can be calculated using the recurrence relation:
T0(λ)=1T1(λ)=λTk+1(λ)=2λTk(λ)Tk1(λ)
And each Tn has the property that Tn=cos(nθ)

In conventional signal processing, filters based on these polynomials have better cutoff behavior in the spectral domain (compared to general polynomial filters)

For the formal theorem statement/proof: see chebyshev equioscillation theorem

Using h(λ)=k=0K1hkλk requires more coefficients than

h(λ)=k=0K1hkTk(λ)

ie K<K for the same approximation quality.

Intuition why this is better than the Taylor polynomials:

Additionally,

Theorem

Let f,g=11f(x)g(x)11x2dx

The Chebyshev Polynomials T0,T1, are orthogonal with respect to the inner product ,.

(see chebyshev polynomials are orthogonal)

Proof

11Tn(x)Tm(x)11x2dxx=cosθ,dxdθ=sinθ=π2πTn(cosθ)Tm(cosθ)dθsinθ1cos2θ=π2πcos(nθ)cos(mθ)dθ since Tn=cos(nθ)=π2πeinθ+einθ2+eimθ+eimθ2dθ=14π2πeinθ+einθ+eimθ+eimθdθ={14π2π4dθ=θ|π2π=π if n=m=014π2π2dθ+14π2π2cos((n+m)θ)dθ=π2+0=π2 if n=m014π2π2cos((n+m)θ)+2cos((nm)θ)dθ=0 if nm

Note

Advantages

  • Fast and cheap - easy to calculate polynomials and they are orthogonal for free
  • localized spectral filters

Disadvantages

  • less stable than Taylor approximations
    • perturbing the adjacency matrix can cause large fluctuations in the chebyshev polynomial approximation, whereas the taylor approximations will remain basically the same
  • S is restricted to L
  • difficult to interpret in the node domain
Graph SAGE

Introduced by Hamilton-Ying-Leskovee in 2017, each Graph SAGE layer implements the following 2 operations:

Aggregate

(U)i=AGGREGATE({(x)j,jN(i)})

Concatenate

(x)i=σ(CONCAT((x1)i,(U)i))

The standard AGGREGATE operation is an average over N(i). Letting H=[H0H1]T, we get

(U)i=1|N(i)|jN(i)(x1)jU=SX1,S=D1/2AD1/2x=σ([x1U]H)=σ(x1H0+SX1H1)

where A is the binary adjacency matrix, which is equivalent to a GCN with nodewise normalization (x)i||(x)i||2. This normalization helps in some cases (empirically).

(see graph SAGE)

notes

  • SAGE implementations allow for a variety of AGGREGATE functions, including max and LSTMs (which look at N(i) as a sequence, losing permutation equivariance). In these cases, SAGE is no longer a graph convolution.

    • There are both advantages and disadvantages of staying a GCN and using other functions.
  • the authors of SAGE popularized the AGGREGATE=UPDATE representation of GNNs (with K=2) meaning

x=σ(SX1H)

Sx is the "aggregate" and σ(H) is the "update"