2025-02-18 equivariant lecture 3

[[lecture-data]]

2025-02-18

Readings

Summary

Today

invariant functions on sets
invariant functions on point clouds

n. Subject

Invariant functions on sets
Zaheer et al 2017
Goals:

Define a class of functions that can deal with sets/multisets as inputs
Translate this class of functions into an efficient neural network
Show that it works well (state-of-the-art) for several real-world applications

Sets/multisets ${x_{1}, \dots, x_{n}} = {x_{σ (1)}, \dots, x_{σ (n)}}$ for all permutations $σ : [n] \to [n]$
+++ see notes
We can think of these as $R^{n} / S_{n}$ , or unordered tuples from the reals

example
$(1, 1, 2) = (1, 2, 1) \neq (1, 2)$

Motivation

sometimes the order is not important for the "downstream task"
Applications in the paper
- learning to estimate pop stats (mean, variance)
- learning to sum digits from images
- point cloud classification
- outlier detection
- predicting a set of tags (for text)

prior implementations of invariant functions on these sets
$f : R^{n} \to R, f (x_{1}, \dots, x_{n}) = f (x_{σ (1)} \dots, x_{σ (n)})$

Invariant polynomials

$f$ is poly that can be written as combo of elementary symmetric polynomials

$e_{1} (x_{1}, \dots, x_{n}) = \sum_{i = 1}^{n} x_{i}, e_{2} = \sum_{1 \leq i \leq j \leq n} x_{i} x_{j}, \dots, e_{n} = x_{1} x_{2} \dots x_{n}, e_{n + 1} = 0$
Equivalently $e_{1} = \sum_{i} x_{i}, e_{2} = \sum_{i} x_{i}^{2}, e_{3} = \sum_{i} x_{i}^{3}, \dots e_{n} = \sum_{i} x_{i}^{n}$ is also a basis/generator for my symmetric polynomials
+++++ see notes

This requires $n$ polynomials of degree $n$ .

Another way to describe them

Canonicalization

Given $(x_{1}, \dots, x_{n})$ , find $σ \in S_{n}$ that sorts the entries such that
$x_{σ (1)} \leq x_{σ (2)} \leq \dots \leq x_{σ (n)}$
A function $h$ is invariant then $⟺$ $h$ is a function of the sorted entries

choose a "representative" or "template" of my tuple

Deep Sets Paper

Input $X = {x_{1}, \dots, x_{m}}, x_{i} \in X$ where $X$ is countable. Consider space of functions $f : 2^{X} \to R$
$f$ is permutation invariant if and only if $f$ can be decomposed as

f (x) = ρ (\sum_{x \in X} ϕ (x))

for suitable $ϕ : X \to R$ and $ρ : R \to R$

sum does not need to be a sum. Just needs to be a permutation invariant function applied to each element in the set

(can think of this as a layer in a neural network)

Wait whoa

Can write an invariant function as a sum of just one feature per element

Previous polynomial representation, we needed $n$ features per element.

will see in proof that perhaps we can get simpler representations by using a higher dimensional latent space ( $ϕ : X \to R^{k}$ )

Note

f (x) = ρ (\sum_{i = 1}^{M} ϕ (x))

in 2nd lecture

if $f (x)$ and $G$ is finite then $f (x) = \frac{1}{| G |} = \sum_{g \in G} \tilde{f} (g \circ x)$ (group averaging)

Here, $G = S_{m}$ , permutations in $M$ objects. So $| G | = M!$ which is a huge set

But Deep Seets tells us that we only need to sum over $M$ objects instead of $M!$

Proof (sketch)
$(⟹)$ Suppose we have a function $f (x)$ such that

f (x) = ρ (\sum_{i = 1}^{M} ϕ (x))

$f$ is permutation invariant to permutation $π$ :

f (π (x)) = ρ \sum_{i = 1}^{M} ϕ (x_{π (i)}) = ρ (\sum_{i = 1}^{M} ϕ (x_{i}))

$(⟸)$ (not the cleanest version, but first paper showing this)

use $ϕ$ to encode each possible input into a real number
- injective over the whole set of sets
learn $ρ : R \to R$ that executes that function on the set

Since $X$ is countable, there exists a bijection $c : X \to N$ . We can encode each subset of $X$ (ie each element of $2^{X}$ )

Let $Y \subset X$ . We can encode $Y$ as a binary sequence $0. a_{1} a_{2} a_{3} \dots a_{j} \dots$
such that $a_{j} = 1 ⟺ x_{i} \in Y$

this gives us a real number per orbit and then learning an invariant function is the same as learning the function in each orbit representative

There are also better theoretical proofs of the same thing Tabaghi and Wang 2024
$f : R^{k \times n} \to R^{s}$

sets of $n$ elements in $R^{k}$

If this function is $S_{n}$ invariant (permutation invariant to the elements in the set), then there exists some
$ϕ : R^{k} \to R^{ℓ}$ and $ρ : R^{ℓ} \to R^{s}$ such that

f (V) = ρ (\sum_{v_{i} \in cols (V)} ϕ (v_{i}))

and $ℓ$ can be chosen such that $ℓ \leq 2 k n$

architecture (++++++see notes)

point clouds

invariant functions on point clouds

motivation:

emulation of cosmological simulations
galaxy properties preditions from dark matter using only simultations
"regression" framework for these types of movements
++++ see notes

$n$ points in $d$ dimensions

invariant to translations
rotations
reflections

$f : R^{d \times n} \to R, E (d)$ invariant and $v \to f (v)$ permutation invariant

translations

center point cloud (translation invariance w sorting getting permutation invariance - choosing a representative from our orbit)

permutations
Want $O (d)$ and $S_{n}$

$f : v \to R$ is $O (d)$ invariant $⟺$ $f (v) = \tilde{f} (V^{T} V)$
where $V^{T} V$ is gram matrix
$f (v π) = f (v) ⟹ \tilde{f} (π^{T} V^{T} V π) = \tilde{f} (V^{T} V)$
$\tilde{f}$ invariant by permutations acting by conjugations
symmetry of graphs that we saw in the first class
- Graph neural networks (cover GNNs in next lectures)

want functions that are invariant to permutations of rows and columns

Why invariant theory?

construct the graph with $O (n^{2})$ complexity, we'll be able to learn invariant functions with $O (n d)$ complexity
++++ see notes
Galois theory - interesting mathematical properties

Galois Theory (setup)
$S_{n}$ is acting by conjugation (permuting rows and columns)

always sending diag to diag and off-diag to off-diag
(permuting rows and columns at the same time)

the $S_{n}$ above is a subgroup of $S_{n} \times S_{(\binom{n}{2})}$ (which are off-diag to off-diag and diag to diag independently)

++++see notes
$f$ is $S_{n}$ invariant then if $f$ is $S_{n} \times S_{(\binom{n}{_{2}})}$ invariant also
$f^{d}, f^{o}$ characterize ++++ notes

Galois theory approach

idea: if we construct a set of invariants that are only fixed by the desired group (ie no other bigger group fixes all the invariants) then those invariants generate the field of invariant functions
(fundamental theorem rephrased)
Rosenlicht (1956) the orbits that the field generators fail to uniquely identiyfy (bad) has to satisfy that a finite set of equations (bad set is measure zero)

$I d \subseteq S_{n} \subseteq S_{n} \times S_{(\binom{n}{2})}$

field extension $R (f^{*} = \sum_{i \neq j} x_{i i} x_{i j})$ "ties up" the permutation of diagonals and off diagonals
- constrains the consistency of the permutation between diagonals and off diagonals
- this generates the field of invariant functions
can construct a polynomial that satisfies this property

++see notes

$R (f^{d}, f^{o}) = \frac{polynomials generated by (f^{d}, f^{o})}{polynomials generated by (f^{d}, f^{o})}$
$f^{d} = {\sum x_{i i}, \sum x_{i i}^{2}, \dots}$
$f^{o} = {\sum_{i \neq j} x_{i j}, \sum_{i \neq j} x_{i j}^{2}, \dots}$

that is the field generated by $f^{d}$ and $f^{o}$ and the sets are field generators for the space.
And note that $R (f^{d}, f^{o}, f^{*})$ is algebraically independent from $R (f^{d}, f^{o})$ .

separating orbits under group actions is considered

(galois theory is not a necessary argument, uses easier argument - construct set of invariants that are only fixed by desired group, then can separate orbits almost everywhere)

using Deep Sets
universally approximate invariant functions on symmetric function sets outside an invariant, closed, zero-measure "bad set" where orbits are not separable

From $O (n^{2})$ to $O (n d)$ invariant features
$V = [v_{1} \dots v_{n}] \in R^{d \times n}$

consider $C (V) \in R^{d \times k}$ "centers" that is $O (d)$ and S_n invariant
- ex $k$ -means centroids ordered by 2norm

$V C = []$

Invariant machine learning model with $O (n d)$ features :
$V \to h (deepset (C^{T} V), C^{T} C)$ and this is $S_{n}$ invariant

Theorem (Huang Blum-Smith 2024 etc)

the machine learning model above can universally approximate invariant functions on symmetric function sets outside an invariant, closed, zero-measure "bad set" where orbits are not separable