GCNs

Semi Supervised Classification with Graph Convolutional Networks

MIM Lab
Katsuya Ogata

Agenda
  1. Basic Knowledge of GNN
  2. Introduction
  3. Fast Approximate Convolutions on Graphs
  4. Semi-Supervised Node Classification
  5. Experiments
  6. Results
  7. Discussion
  8. Appendix
Basic Knowledge of GNN

Basic Elements of a Graph

  • Node
    A vertex in the graph, e.g., a user in a social network or an atom in a molecule.

  • Edge
    A connection between two nodes, e.g., friendship between users or bonds between atoms.

  • Adjacency Matrix
    A matrix representing the connectivity of nodes. If node ii and node jj are connected, Aij=1A_{ij} = 1; otherwise, 00.

Basic Knowledge of GNN

Graph Spectral Theory

  • Study the properties of graphs by analyzing the eigenvalues and eigenvectors of matrices associated with the graph.

  • We reveal crucial structural and global properties, such as:

    • Connectivity (how well the graph is connected)
    • Bipartiteness (if the graph can be divided into two independent sets)
    • The presence of certain motifs or communities
Basic Knowledge of GNN

Graph Laplacian

  • Definition
    L:=DAL := D - A

    Where:

    • DD = Degree matrix (diagonal, DiiD_{ii} = degree of node ii)
    • AA = Adjacency matrix
  • Properties

    • Symmetric (if undirected graph)
    • Captures the difference between a node and its neighbors
Basic Knowledge of GNN

Normalized Laplacian

  • Definition
    L~:=D1/2LD1/2=ID1/2AD1/2=IA~\tilde{L} := D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2} = I -\tilde{A}

    Where:

    • A~=D1/2AD1/2\tilde{A} = D^{-1/2} A D^{-1/2} is the normalized adjacency
  • Why normalize?

    • Removes scale differences due to node degrees
    • Makes it easier to compare graphs with different structures
Basic Knowledge of GNN

Graph Fourier Transform

  • Forward transform

    F(x)=UTxF(x) = U^T x

  • Inverse transform

    F1(x)=UxF^{-1}(x) = U x

Where:

  • UU: matrix of eigenvectors (L~=UΛUT\tilde{L} = U \Lambda U^T)
  • xx: graph signal
Basic Knowledge of GNN

Spectral Convolution

  • Definition (using Fourier domain):

    gx=F1(F(g)F(x))=U(UTgUTx)g * x = F^{-1}(F(g) \odot F(x)) = U (U^T g \odot U^T x)

Where:

  • \odot: element-wise multiplication
Basic Knowledge of GNN

Practical Filtering

  • Direct use of UTgU^T g is often impractical.

  • Instead, we typically use a learnable diagonal matrix gwg_w:

    gwx=UgwUTxg_w * x = U g_w U^T x

This simplifies the filter design and makes learning scalable.

Basic Knowledge of GNN

Summary

  1. Transform the graph signal ( x ) into the spectral domain
  2. Apply a filter in the spectral domain
  3. Return to the original space

Introduction

Loss Functions of GNN

L=L0+λLregL = L_0 + \lambda L_{\text{reg}}

where:

  • L0L_0 : supervised loss over the labeled part of the graph
  • λ\lambda : weighting factor controlling the strength of the regularization
  • LregL_{reg}: graph regularization term
Introduction

Regularization Term

Lreg=i,jAijf(Xi)f(Xj)2=f(X)Δf(X)L_{\text{reg}} = \sum_{i, j} A_{ij} \| f(X_i) - f(X_j) \|^2 = f(X)^\top \Delta f(X)

where:

  • f()f(\cdot): neural network-like differentiable function
  • XX: matrix of node feature vectors XiX_i
  • AA: adjacency matrix
  • Δ=DA\Delta = D - A: unnormalized graph Laplacian
Introduction

Homophily Hypothesis

  • Homophily refers to the tendency of connected nodes to share similar attributes or labels.

  • In graph learning, it is assumed that:

    "Connected nodes are likely to belong to the same class."

Introduction

Proposed Methods

loss=L0loss = L_0

output=f(X,A)output = f(X, A)

Where:

  • XRN×DX \in \mathbb{R}^{N \times D}: Node feature matrix,
    where NN = number of nodes, DD = number of features
  • ARN×NA \in \mathbb{R}^{N \times N}: Adjacency matrix
  • f()f(\cdot): Neural network mapping features and graph structure
  • L0L_0: Supervised loss on labeled nodes
Fast Approximate Convolutions on Graphs

Computational Cost Issues

gθx=UgθUTxg_\theta \star x = U g_\theta U^T x

(θRN,xRN)( \theta \in R^N, x \in R^N )

  1. Multiplication with eigenvector matrix UU:

    • Computational complexity is O(N2)O(N^2) (NN: number of nodes)
    • Very expensive for large graphs
  2. Eigendecomposition of graph Laplacian LL:

    • Necessary to obtain UU
    • Also computationally prohibitive for large graphs
Fast Approximate Convolutions on Graphs

Solution: Approximation via Chebyshev Polynomials

Proposal by Hammond et al. (2011):
Approximate gθ(Λ)g_\theta(\Lambda) by a truncated expansion in terms of Chebyshev polynomials Tk(x)T_k(x) up to KthK^{th} order.

Approximation (Eq. 4):

gθ(Λ)k=0KθkTk(Λ~)g_{\theta'}(\Lambda) \approx \sum_{k=0}^K \theta'_k T_k(\tilde{\Lambda})

Fast Approximate Convolutions on Graphs

Components of Eq. (4) and Chebyshev Polynomials

  • Λ~\tilde{\Lambda} (Rescaled eigenvalues):

    Λ~=2λmaxΛIN\tilde{\Lambda} = \frac{2}{\lambda_{max}}\Lambda - I_N

    • λmax\lambda_{max}: Largest eigenvalue of L~\tilde{L}
    • The range of Λ~\tilde{\Lambda} becomes [1,1][-1, 1], matching the domain of Chebyshev polynomials.
  • Chebyshev polynomials Tk(x)T_k(x):

    • T0(x)=1T_0(x) = 1
    • T1(x)=xT_1(x) = x
    • Tk(x)=2xTk1(x)Tk2(x)T_k(x) = 2xT_{k-1}(x) - T_{k-2}(x) (Recursive definition)
Fast Approximate Convolutions on Graphs

Convolution using the Approximation

Applying the approximation (Eq. 4) to the original convolution definition
gθx=Ugθ(Λ)UTxg_\theta \star x = U g_\theta(\Lambda) U^T x:

gθxU(k=0KθkTk(Λ~))UTxg_{\theta'} \star x \approx U \left( \sum_{k=0}^K \theta'_k T_k(\tilde{\Lambda}) \right) U^T x

gθxk=0KθkUTk(Λ~)UTxg_{\theta'} \star x \approx \sum_{k=0}^K \theta'_k U T_k(\tilde{\Lambda}) U^T x

Fast Approximate Convolutions on Graphs

Using

(UΛ~UT)k=UΛ~kUT(U\tilde{\Lambda}U^T)^k = U\tilde{\Lambda}^kU^T

Tk(L~)=UTk(Λ~)UTT_k(\tilde{L}) = U T_k(\tilde{\Lambda}) U^T

we get:

gθxk=0KθkTk(L~)xg_{\theta'} \star x \approx \sum_{k=0}^K \theta'_k T_k(\tilde{L})x

where:

L~=2λmaxLIN\tilde{L} = \frac{2}{\lambda_{max}}L - I_N

Fast Approximate Convolutions on Graphs

Important Properties:

  1. KK-localized:

    • Tk(L~)T_k(\tilde{L}) is a KthK^{th}-order polynomial in the Laplacian.
    • The convolution result depends only on nodes that are at maximum KK steps away from the central node (Kth-order neighborhood).
  2. Computational Complexity:

    • Evaluating Eq. (5) is O(E)O(|E|) (E|E|: number of edges).
    • Tk(L~)xT_k(\tilde{L})x can be computed efficiently through repeated sparse matrix-vector multiplications.
Fast Approximate Convolutions on Graphs

Approximation and Simplification

We approximate λmax2\lambda_{max} \approx 2 and K=1K=1.(It is expected that neural network parameters will adapt to this change in scale during training.)

Under these approximations, Eq. 5 simplifies to:

gθxθ0x+θ1(LIN)xg_{\theta'} \star x \approx {\theta'}_0 x + {\theta'}_1 (L-I_N)x

gθx=θ0xθ1D1/2AD1/2xg_{\theta'} \star x = {\theta'}_0 x - {\theta'}_1 D^{1/2}AD^{1/2}x

gθxθ(IN+D1/2AD1/2)xg_{\theta'} \star x \approx \theta(I_N + D^{1/2}AD^{1/2})x

Fast Approximate Convolutions on Graphs

Renormalization Trick

To alleviate the problem of numerical instabilities, the following renormalization trick is introduced:

IN+D1/2AD1/2D~1/2A~D~1/2I_N + D^{-1/2}AD^{-1/2} \rightarrow \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}

Where:

  • A~=A+IN\tilde{A} = A + I_N (Adjacency matrix with self-connections added)
  • D~ii=jA~ij\tilde{D}_{ii} = \sum_j \tilde{A}_{ij} (Degree matrix of A~\tilde{A})
Fast Approximate Convolutions on Graphs

Generalization to Multiple Channels/Filters

Z=D~1/2A~D~1/2XΘ(Eq.8)Z = \tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}X\Theta \quad (Eq. 8)

Where:

  • XRN×CX \in \mathbb{R}^{N \times C} is the matrix of input graph signal.
  • ΘRC×F\Theta \in \mathbb{R}^{C \times F} is the matrix of filter parameters.
  • ZRN×FZ \in \mathbb{R}^{N \times F} is the convolved signal matrix.
  • This filtering operation has a complexity of O(EFC)O(|E|FC), as A~X\tilde{A}X can be efficiently implemented as a product of a sparse matrix with a dense matrix.
Semi-Supervised Node Classification

Two-Layer GCN

H(l+1)=σ(D~1/2A~D~1/2H(l)W(l))H^{(l+1)} = \sigma(\tilde{D}^{1/2}\tilde{A}\tilde{D}^{1/2}H^{(l)}W^{(l)})

Z=f(X,A)=softmax((A^ReLU(A^XW(0))W(1)))Z = f(X, A) = softmax((\hat{A} ReLU(\hat{A}XW^{(0)})W^{(1)}))

Where:

  • A^=D~1/2AD~1/2\hat{A} = \tilde{D}^{1/2}A\tilde{D}^{1/2}, W(0)RC×HW^{(0)} \in R^{C \times H}, W(0)RH×FW^{(0)} \in R^{H \times F}

Semi-Supervised Node Classification

Loss Function

Evaluate the cross-entropy error over all labeled examples:

L=lYLf=1FYlflnZlfL = -\sum_{l \in \mathcal{Y}_L} \sum_{f=1}^{F} Y_{lf} \ln Z_{lf}

Explanation of symbols:

  • YL\mathcal{Y}_L: index set of labelled nodes
  • FF: number of output classes
  • YlfY_{lf}: true label for class ff of node ll.
  • ZlfZ_{lf}: predicted probability for class ff of node ll.
Experiments

Datasets Overview

Label Rate: Number of labeled nodes used for training / Total nodes

Experiments

Citeseer, Cora, and Pubmed (Citation networks)

Structure:

  • Nodes: Documents
  • Edges: Citation links (treated as undirected)
  • Features: Sparse bag-of-words vectors
  • Adjacency Matrix: Binary, symmetric

Training Setup:

  • Only 20 labels per class for training
  • All feature vectors available
  • Each document has a class label
Experiments

Knowledge Graph Structure (NELL)

Original Format:

  • Entities connected with directed, labeled edges (relations)
  • Example: (entity₁, relation, entity₂)

Preprocessing:

  • Convert to bipartite graph: (e₁, r₁) and (e₂, r₂)
  • 55,864 relation nodes + 9,891 entity nodes
  • Extended features: 61,278-dim sparse vectors

Extreme Semi-supervised Setting:

  • Only 1 labeled example per class (210 classes total)
Experiments

For Runtime Analysis (Random graphs)

Generation Process:

  • NN nodes → 2N2N edges assigned uniformly at random
  • Feature Matrix: Identity matrix INI_N (featureless approach)
  • Each node represented by unique one-hot vector
  • Dummy labels: Yi=1Y_i = 1 for all nodes

Purpose: Measure training time per epoch across different graph sizes

Experiments

Model Configuration

  • Architecture: 2-layer GCN (Section 3.1)
  • Test Set: 1,000 labeled examples
  • Validation Set: 500 labeled examples for hyperparemeter optimization (dropout rate, the number of hidden units and L2 regularization factor)
  • Optimizer: Adam (learning rate = 0.01)
  • Max Epochs: 200
  • Learning Rate 0.01
  • Early Stopping: Window size = 10
  • Weight Initialization: Glorot & Bengio (2010)
Results

Semi-Supervised Node Classification Results

Results

Propagation Model Evaluation

Comparing Different Variants

Results

Training Time Analysis

Key Finding

Linear scalability enables application to very large graphs

Discussion

Why GCN Outperforms Traditional Methods

1. End-to-End Learning

  • Other methods: Multi-step pipeline (embedding learning → classifier training)
  • GCN: Unified optimization with single loss function

2. Efficient Information Propagation

  • Graph Laplacian methods: Limited by assumption that edges = node similarity
  • GCN: Propagates feature information through neighbors at each layer

3. Computational Efficiency

  • Complexity: O(|E|) - linear in number of edges
  • Speed: 3-4x faster than Planetoid (Cora: 13s→4s)
Discussion

Limitations and Future Work

Memory Requirements

  • Mini-batch SGD, Approximate methods, Distributed training

Directed edges and edge features

  • Native directed graph support, Edge feature integration, Heterogeneous graph handling

Current Limiting Assumptions

A~=A+λIN\tilde{A} = A + λI_N

  • λ: Learnable trade-off parameter

## Iterative Classification Algorithm (ICA) ### Two-stage Process 1. **Local Classifier**: - Train on labeled nodes using local features only - Bootstrap unlabeled nodes 2. **Relational Classifier**: - Use local features + aggregation operator - 10 iterations with random node ordering - Hyperparameters chosen via validation **Note**: TSVM omitted due to scalability issues with large class numbers

---

### Renormalization Trick Superior - **Best overall performance** across all datasets - Balances efficiency and representation power ### Graph Structure Matters - **MLP baseline** performs significantly worse - Confirms importance of graph convolution operations ### Simpler Can Be Better - **Renormalization trick** outperforms complex Chebyshev polynomials - **Fewer parameters** → better generalization - **Lower computational cost** → practical advantages

--- # Appendix --- <!-- _header: Related Work

## Graph-Based Semi-Supervised Learning - **Traditional**: Graph Laplacian regularization, graph embedding (DeepWalk, etc.). Multi-step pipelines were a limitation. - **Recent**: Planetoid injects label info during embedding. ## Neural Networks on Graphs - **Early Work**: Graph Neural Networks (Gori et al., 2005) - **Convolution-Based**: - Spectral Methods (Bruna et al., 2014): O($N^2$) complexity. - Localized Convolutions (Defferrard et al., 2016): Fast Chebyshev approximation. - Degree-Specific Weights (Duvenaud et al., 2015): Scalability issues for wide degree distributions.