Neural Networks and Graph Neural Networks: A Comprehensive Review from Mathematical Foundations to Modern Applications

Introduction

The increased proliferation of complex data structures calls for advancements in data processing approaches beyond traditional neural networks. This write-up provides a comprehensive analysis of neural networks and Graph Neural Networks (GNNs), examining their mathematical foundations, architectural principles, and practical applications using the structured approach of Stanford CS224W: Machine Learning with Graphs taught by Jure Leskovec. The write-up also heavily leans on Leni Aniva's CS224W summary and notes. Their work can be found here: http://cs224w.Stanford.edu.

Neural Networks are remarkably successful in processing grid-like data structures by leveraging weighted transformations and gradient-based optimization. Graph Neural Networks extend these capabilities to irregular, relational data by incorporating message passing mechanisms and permutation-invariant operations.

This work explores the mathematical underpinnings of both paradigms, from fundamental concepts of weighted sums and backpropagation in traditional networks to the specialized message passing frameworks that enable GNNs to process graph-structured information. We examine key GNN architectures including Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and GraphSAGE, analyzing their theoretical properties and computational characteristics. The analysis extends to PyTorch Geometric (PyG), a comprehensive framework that facilitates practical implementation of Graph Neural Networks while maintaining compatibility with standard deep learning workflows (PyTorch). Applications across social networks, molecular analysis, and knowledge graphs demonstrate the transformative potential of these approaches in addressing complex relational learning problems.

Graphs

Graphs provide a general language for describing and analyzing entities with relations and interactions. Instead of domains as isolated data points, CS224W describes graphs as networks of relationships between entities, enabling more faithful modeling of underlying phenomena. By leveraging the graph-nature of databases, relational deep learning offers powerful analysis on data using relations. Challenges over traditional neural networks include the lack of fixed node ordering as well as arbitrary size and complex topological structure as opposed to spatial locality provided by grids. Additionally, graphs are often dynamic and/or have multimodal features.

The process of creating a graph involves considering what defines a "node" and an "edge" in the context of application, as well as directionality, weights, properties, types, and attributes.

Graphs include objects (nodes, vertices), interactions, (edges, links), and systems (networks, graphs). Other properties include undirected/directed edges, allow/disallow self-loop, and allow/disallow multiple edges between nodes (multi-graphs).

A heterogenous graph is defined by $G = (V, E, R, T)$ :

Nodes with node types $v_i \in V$
Edges with relation types $(v_i, r, v_j) \in E$
Node type $T(v_i)$
Relation type $r \in R$
Nodes and edges have attributes/features

Another type is bipartite graph: where nodes can be divided into two disjoint sets $U, V$ such that every link connects a node from $U$ to $V$ (the sets remain independent).

Traditional Machine Learning

Using an input graph, we can extract the node, link, and graph-level features, then train a model (SVM, neural network, etc.) to map features to labels. Aniva emphasizes using effective features over graphs is the key to achieving good model performance.

Node-level Graph Neural Network prediction — Figure 2: Node-level prediction

Node-level prediction: characterize the structure and position of a node in the network:

Node degree (number of neighbors)
Node importance & position - types of centrality:
- Eigenvector centrality: a node is important if it is surrounded by important nodes.
  We define the centrality of node $v$ as the centrality of neighbouring nodes. This leads to a set of $|N|$ simultaneous linear equations where $\lambda$ is a normalization constant and $A$ is the adjacency matrix of the graph:
  $c_v := \frac{1}{\lambda} \sum_{u \in N(v)} c_u \Leftrightarrow \lambda c = Ac$
- Betweenness centrality: a node is important if it is a gatekeeper (i.e. when it lies on many shortest paths between other nodes), useful for social networks
  $c_v := \sum_{u \neq v \neq t} \frac{|\{\text{shortest paths between } s, t \text{ containing } v\}|}{|\{\text{shortest paths between } s, t\}|}$
- Closeness centrality:
  $e_v := \frac{1}{C(k_v, 2)} |\{\text{edges among } N(c)\}| \in [0, 1]$
Node's subgraphs - Graphlets: Connected non-isomorphic subgraphs of a larger graph, or per the CS224W slides, "a count vector of rooted subgraphs."
- Use Graphlet Degree Vectors (GDVs) to characterize nodes based on their participation in different graphlet patterns. As opposed to network motifs, graphlets do not require enrichment tests. Instead, the emphasis is comparison, clustering, and alignment.

We also have:

Motifs: Recurring, significant patterns of interconnections in complex networks.
Network Motifs: Small subgraph patterns that appear significantly more frequently than expected in random networks, indicating functional significance. These are statistically significant and are usually used to form functional hypotheses.
Clustering coefficients: how connected $v$ 's neighboring nodes are.
An induced graph is a graph formed by taking a subset of vertices in a larger graph such that only edges connecting the remaining vertices are preserved.
Isomorphic: graphs with identical topologies

Edge-level Graph Neural Network prediction — Figure 3: Node, edge, and graph level

Edge/Link-level prediction: predict new/missing/unknown links based on the existing links. Aniva writes this can be either "trying to find missing links or finding new links as time progresses".

There are two formulations of link-level prediction:

Links missing at random: remove a random set of links and then aim to predict them
Links over time: Given a graph defined by edges between two times, output a ranked list of edges that predicted to appear in over another period of time

Graph-level prediction: prediction for an entire graph or a subgraph of the graph, specifically over time.

Graph Kernel Methods are widely used for traditional ML for graph-level prediction. Instead of designing feature vectors, we design kernels. Graph-Level Graphlet features count the number of different graphlets in a graph.

Graph Representation Learning alleviates the need to do feature engineering every single time. Here, our goal is efficient task-independent feature learning for machine learning with graphs.

Figure 4: Graph Neural Network Representation Learning

Node Embeddings

Representation learning avoids the need of doing feature engineering every time with the goal of mapping nodes or graphs to vectors. These mappings are embeddings.

Embedding: list of numbers (vector form) representing more complex data (text, image, graph). Useful for seeing similarities of more complex data through similar vectors (close together, large dot product).

Node embedding:

Each node $u$ in the graph is given its own vector.
There are 3 design choices:
- Encoder: can be shallow (like a simple lookup table mapping nodes to vectors), or deep as seen in GNNs later in the text
- Neighbors or nodes with similar neighborhoods should have similar vectors.
- Decoder: given 2 embeddings, measure the similarity, usually a dot product
Similarity: A measure of similarity between the nodes approximated by a combination of encoder and decoder. Could be distance in the graph, neighborhood topology, etc.

Graph embedding:

One vector for the whole graph by aggregating node embeddings or add a "virtual node" and learn its embedding

Random Walk Embedding

An unsupervised/self-supervised node similarity metric is based on random walks. We are not using node labels or features; the embeddings are task independent. Choosing random walks provides: both expressivity and efficiency. The former being a flexible stochastic definition of node similarity that incorporates lower and higher order neighborhood information. The latter means the graph does not need to be thoroughly traversed when training.

More coming soon.