“Overview In this article I will answer questions commonly asked by users who are not familiar with AI/ML/CV with graphs or graph neural networks. I provide a Pytorch example to clarify the thinking behind this relatively new and exciting model.
Overview In this article I will answer questions commonly asked by users who are not familiar with AI/ML/CV with graphs or graph neural networks. I provide a Pytorch example to clarify the thinking behind this relatively new and exciting model.
The questions I ask in this part of the tutorial are:
Why is the graph data structure useful?
Why is it difficult to define convolutions on graphs?
What makes a neural network a graph neural network?
To answer these questions, I’ll provide motivating examples, papers, and python code to make it a tutorial on Graph Neural Networks (GNNs). The reader will need some basic machine learning and computer vision knowledge, however, I will also provide some background and intuitive explanations as I go along.
First, let’s briefly review what is a graph? A graph is a set of nodes (vertices) connected by directed/undirected edges. Nodes and edges usually come from some expert knowledge or intuition about the problem. So it can be atoms in molecules, users in social networks, cities in transportation systems, athletes in team sports, neurons in the brain, interacting objects in dynamic physical systems, pixels in images, bounding boxes or split mask. In other words, in many practical situations, you actually decide what are the nodes and edges in the graph.
In many practical cases, it’s actually you who decide what the nodes and edges in the graph are.
This is a very flexible data structure that generalizes many other data structures. For example, if there are no edges, then it becomes a set; if there are only “vertical” edges, and any two nodes are connected by exactly one path, then we have a tree. This flexibility is good and bad, which I will discuss in this tutorial.
1. Why is the graph data structure useful?
In the context of computer vision (cv) and machine learning (ml), studying graphs and the models that learn from them can give us at least four benefits:
1.1 We can get closer to solving important problems that were too challenging before, such as: drug discovery in cancer (Veselkov et al., Nature, 2019); better understanding of the human brain connectome (Diez & Sepulcre, Nature Communications, 2019); Materials discovery for energy and environmental challenges (Xie et al., Nature Communications, 2019).
1.2 In most cv/ml applications, data can actually be seen as graphs, even if you once represented them as another data structure. Representing your data as graphs gives you a lot of flexibility and can give you a very different and interesting perspective on your problem. For example, instead of learning from image pixels, you can learn from “superpixels”, as described in (Liang et al., ECCV2016) and in our forthcoming BMVC paper. Graphs also allow you to impose a relational inductive bias in the data – some prior knowledge about the problem. For example, if you want to reason about human pose, your relational bias can be a graph of human skeleton joints (Yan et al., AAAI, 2018); or if you want to reason about video, your relational bias can be moving bounding boxes Figures of (Wang & Gupta, ECCV2018). Another example is representing facial landmarks as graphs (Antonakos et al., CVPR, 2015) to reason about facial attributes and identities.
1.3 Your favorite neural network itself can be thought of as a graph where nodes are neurons and edges are weights, or nodes are layers and edges represent a flow of forward/backward passes (in this case we The discussion is about computational graphs used in tensorflow, pytorch, and other dl frameworks). Applications can be optimization of computational graphs, neural architecture search, analysis of training behavior, etc.
1.4 Finally, you can solve many problems more efficiently where data can be more naturally represented as graphs. This includes, but is not limited to, molecular and social network classification (Knyazev et al., Neurips-W, 2018) and generation (Simonovsky & Komodakis, ICANN, 2018), 3D mesh classification and correspondence (Fey et al., CVPR, 2018) and generation (Wang et al., ECCVV, 2018), modeling behavior of dynamic interacting objects (Kipf et al., ICML, 2018), visual scene graph modeling (see upcoming ICcv workshop) and question answering (Narasimhan, Neurips, 2018), program synthesis (Allamanis et al., ICLR, 2018), different reinforcement learning tasks (Bapst et al., ICML, 2019) and many other exciting problems.
Since my previous research was about recognizing and analyzing faces and emotions, I especially liked the diagram below.
2. Why is it difficult to define convolutions on graphs?
To answer this question, I first give some motivation for using convolutions in general, and then describe “convolutions on images” in graph terms, which should make the transition to “convolutions on graphs” smoother.
2.1 Why is convolution useful?
Let’s understand why we care so much about convolution and why we want to use it to draw graphs. Compared to fully connected neural networks (a.k.a.nns or mlps), convolutional networks (a.k.a.cnns or convnets) have some of the following advantages explained in terms of an image of a nice old Chevrolet .
First, ConvNets exploit natural priors in images, which are more formally described in (Bronstein et al., 2016), e.g.
1. Translation invariance – If we translate the car in the image above left/right/up/down, we should still be able to recognize it as a car. This can be exploited by sharing filters at all locations, i.e. applying convolutions.
2. Location – Nearby pixels are closely related, usually representing some semantic concept, such as a scroll wheel or a window. This can be exploited by using a relatively large filter that captures image features in a local spatial neighborhood.
3. Compositional (or Hierarchical) – Larger regions in an image are often the semantic parents of the smaller regions it contains. For example, a car is the parent of the doors, windows, wheels, driver, etc., and the driver is the parent of the head, arms, etc. This is exploited implicitly by stacking convolutional layers and applying pooling.
Second, the number of trainable parameters (i.e. filters) in a convolutional layer does not depend on the input dimension, so technically we can train the exact same model on 28×28 and 512×512 images. In other words, the model is parametric.
Ideally, our goal is to develop a model that is as flexible as Graph Neural Nets, capable of summarizing and learning from any data, but at the same time, we want to control (regulate) by turning on/off certain priors. ) this flexibility factor.
All these excellent properties make ConvNets less prone to overfitting (higher training set accuracy and lower validation/test set accuracy), more accurate in different vision tasks, and easily scalable to Large image and datasets. Therefore, it is attractive to transfer all these properties to a graph neural network GNN to normalize its flexibility and make it scalable when we want to solve the important task of the input data adopting a graph structure. Ideally, our goal is to develop a model that is as flexible as a GNN and can summarise and learn from any data, but at the same time we want to control (regulate) by turning on/off certain priors This flexibility factor. This can open many interesting directions of research. Controlling this tradeoff is challenging, however.
2.2 Convolving images with graphs
Let us consider an undirected graph G with N nodes. Edge E represents an undirected connection between nodes. Nodes and edges usually come from your intuition about the problem. For images, our intuition is that nodes are pixels or superpixels (a group of weirdly shaped pixels) and edges are the spatial distances between them. For example, the MNIST image in the lower left is usually represented as a matrix of size 28×28. We can also represent this as a set of N = 28 * 28 = 784 pixels. So our graph G will have N = 784 nodes, and the edges of pixels with closer edge locations will have larger values (thicker edges in the image below), and the edges of distant pixels will have smaller values (thinner edges).
When we train Neural Networks or ConvNets on images, we implicitly define images on a graph – the image below is a regular 2D grid. Since this grid is the same for all training and test images and is regular, i.e. all pixels of the grid are connected to each other in exactly the same way on all images (i.e. have the same number of neighbors, side lengths Wait). then this regular grid diagram has no information to help us distinguish one image from another. Below, I visualize some 2D and 3D regular grids, where the order of nodes is color-coded. By the way, I’m using NetworkX in Python to do this, e. g. G = networkx. grid_graph([4, 4]).
With this 4×4 regular grid in hand, let’s briefly look at how 2D convolution works to understand why it’s so hard to translate this operator into a graph. Filters on a regular grid have the same order of nodes, but modern convolutional networks typically have smaller filters, such as 3×3 in the example below. This filter has 9 values: W 1, W 2, …, W 3, which are the values we are updating during training with the backpropagator to minimize loss and solve downstream tasks. In the example below, we heuristically initialize this filter as an edge detector.
When we do convolution, we slide this filter in two directions: to the right and to the bottom, but there’s nothing stopping us from starting at the bottom corner – it’s important to slide in all possible positions. At each location, we compute the dot product between the value on the grid (denoted by X) and the filter value W: W:X? W? + X? W? +…+X? W? , and store the result in the output image. In our visualization, we change the color of the nodes during the slide to match the color of the nodes in the grid. In a regular grid, we can always put the X? W? The nodes of the mesh are matched with the nodes of the grid. Unfortunately, this is not the case for graphs, which I will explain below.
The dot product used above is one of the so called “aggregation operators”. Broadly speaking, the goal of aggregation operators is to summarize data into a simplified form. In the example above, the dot product aggregates the 3×3 matrix into a single value. Another example is pooling in ConvNets. Remember that methods like max pooling or total pooling are permutation-invariant, i.e. they merge the same values from the spatial region even if you randomly shuffle all the pixels within the region. Just to be clear, the dot product is not permutation invariant, just because in general: X? W? +X? W? ≠X? W? +X? W? .
Now, let’s use our MNIST image and illustrate what regular grids, filters and convolutions mean. Remember our graph terminology, this regular 28×28 grid will be our graph G, so each cell in this grid is a node, and the node feature is the actual image X, i.e. each node There will be only one feature – pixel intensity from 0 (black) to 1 (white).
Next, we define a filter and make it a well-known Gabor filter with some (almost) arbitrary parameters. Once we have the image and filter, we can perform the convolution by sliding the filter (7 bits in this case) over that image and placing the result of the dot product onto the output matrix.
This is all cool, but as I mentioned before, it gets tricky when you try to generalize convolutions to graphs.
A node is a collection, no permutation of that collection will change it.Therefore, the aggregation operators that people apply should be permutation-invariant
As I already mentioned, the dot product used above to compute the convolution at each step is order sensitive. This sensitivity enables us to learn edge detectors similar to Gabor filters, which are important for capturing image features. The problem is that there is no well-defined order of nodes in a graph, and unless you learn to order the nodes, or come up with some heuristics, that will result in a consistent order (norm) from graph to graph. In short, a node is a collection, and any permutation of this collection does not change it. Therefore, the aggregation operators that people apply should be permutation-invariant. The most popular options are the mean (GCN, Kipf & Welling, ICLR, 2017) and summation (GIN, Xu et al., ICLR, 2019) of all neighbors, i.e. summation or mean pooling followed by a trainable vector W projection. For other aggregators, see (Hamilton et al., NIPS, 2017).
For example, for the top left graph above, the output of the summation aggregator for node 1 is X? = (X? +X? +X? +X?) W? , for node 2: X? =(X?+X?+X?+X?) W? etc for nodes 3, 4 and 5, i.e. we need to apply this aggregator to all nodes. As a result, we will get a graph with the same structure, but the node features will now contain neighbor features. We can use the same idea with the graph on the right.
Colloquially, people call this averaging or summation a “convolution” because we also “slide” from one node to another and apply an aggregation operator at each step. However, it is important to keep in mind that this is a very specific form of convolution, where the filter has no sense of direction. Below, I’ll show what these filters look like and give ideas on how to make them even better.
3.What makes a neural network a graph neural network
You know how a classical neural network works, right? We have some C-dimensional features X as input to the network. Using our running MNIST example, X will be our C=784 dimensional pixel feature (i.e. the “flattened” image). These features are multiplied by the C×F dimension weights W that we update during training to bring the output closer to our expectations. The results can be used directly to solve the task (e.g. in the case of regression), or they can be further fed into some non-linearity (activation) such as ReLU or other differentiable (or more precisely, differentiable) functions to form multiple layer network. Typically, the output of some layers is:
The signal in MNIST is so strong that just using the formula above and the cross-entropy loss, you can get 91% accuracy without any non-linearity and other tricks (I used a slightly modified PyTorch example to do this at this point). This kind of model is called multinomial (or multiclass, since we have 10 classes of numbers) logistic regression.
Now, how do we turn ordinary neural networks into graph neural networks? As you already know, the core idea behind GNNs is aggregation by “neighbors”. Here, it’s important to understand that in many cases it’s you who actually specify the “neighbors”.
Let’s consider a simple case first, when you get some graphs. For example, this could be a fragment (subgraph) of a social network with 5 people, with an edge between a pair of nodes indicating whether two people are friends (or at least one of them thinks so). The adjacency matrix (usually denoted A) in the lower right figure is a way to represent these edges in matrix form, which is very convenient for our deep learning framework. Yellow cells in the matrix represent edges, while blue means no edges.
Now, let’s create an adjacency matrix A for our MNIST example based on pixel coordinates (full code is provided at the end of the article):
import numpy as np
from scipy. spatial. distance import cdist
img_size = 28 # MNIST image length and width
col, row = np. meshgrid(np.arange(img_size), np.arange(img_size))
coord = np. stack((col, row), axis=2). reshape(-1, 2) / img_size
dist = cdist(coord, coord) # see bottom left
sigma = 0.2 * np. pi # Gaussian distribution width
A = np. exp (- dist / sigma ** 2) # See the middle of the figure below
This is a typical but not the only way to define an adjacency matrix for vision tasks (Defferrard et al., NIPS, 2016; Bronstein et al., 2016). This adjacency matrix is our prior matrix, which is our inductive bias, we intuitively connect nearby pixels together, while distant pixels should not or should have very thin edges (smaller edges), with This is imposed on the model. This is due to the observation that in natural images, nearby pixels often correspond to the same object or frequently interacting objects (the locality principle we mentioned in Section 2.1), so connecting such pixels is very useful significance.
So now, instead of just feature X, we have some fancy matrix A with values in the range[0, 1]. It is important to note that once we know that the input is a graph, we assume that there is no canonical order of nodes to be consistent across all other graphs in the dataset. In the case of images, this means that the pixels are assumed to be randomly shuffled. In practice, finding the canonical order of nodes is combinatorially unsolvable. Even though technically for MNIST we can cheat by knowing this order (since the data is originally from a regular grid), it doesn’t work for real graph datasets.