For a machine learning project
at Polytechnique we required random walks on a graph similar to the random walks generated within Deepwalk or PinSAGE. In this post I compare a simple, pythonic implementation based on NetworkX (version 2.2) with a less intuitive C implementation which outperforms the Python implementation by 50x (serial) and 300x (6 threads). The code for this post can be found on github in the notebook called post.ipynb
.
The code based on NetworkX is given below.
def nx_random_walk(G, num_walks=10, num_steps=3):
walks = list()
for i in G.nodes():
for walk in range(num_walks):
curr_walk = [i]
curr = i
for step in range(num_steps):
neighbors = list(G.neighbors(curr))
if len(neighbors) > 0:
curr = random.choice(neighbors)
curr_walk.append(curr)
walks.append(curr_walk)
return walks
walks
is a list of lists where there will be num_walks
lists per node of the graph. We assume an undirected graph.
The final C code and Python wrapper can be found in the rwalk package on github.
The C implementation uses NumPy to offer an easy interface to Python. The graph is stored by two 1-D NumPy arrays ptr
and neighs
. ptr
has shape (n+1, )
where n
is the number of nodes in the graph. We assume that the graph nodes are enumerated by consecutive integers starting at 0
. ptr[0] = 0
and for all nodes i
there holds: ptr[i+1] - ptr[i]
is the degree of node i
. neighs
holds the neighbors s.t. neighs[ptr[i]:ptr[i+1]]
are the neighbors of node i
. Therefore neighs
has shape (ptr[n], )
. The format is equivalent to the CSR format for sparse matrices and ptr
, neighs
are the equivalents of indptr
, indices
as in scipy.sparse.csr_matrix. This format is memory efficient and allows fast access to the neighbors of node i
which is crucial for the random walk algorithm. For brevity we will not consider the complex topic of graph reordering.
The rwalk package also offers a possibility to read edge lists as generated by NetworkX.
seed = 111413
n = 20000
avg_deg = 50
p = avg_deg / n
G = nx.fast_gnp_random_graph(n, p, seed=seed)
nx.write_edgelist(G, "test.edgelist", data=False)
ptr, neighs = rwalk.read_edgelist("test.edgelist")
The C code is given below.
#include "rwalk.h"
#include <omp.h>
#include <stdlib.h>
void random_walk(int const* ptr, int const* neighs, int n, int num_walks,
int num_steps, int seed, int nthread, int* walks) {
if (nthread > 0) {
omp_set_num_threads(nthread);
}
#pragma omp parallel
{
int thread_num = omp_get_thread_num();
unsigned int private_seed = (unsigned int)(seed + thread_num);
#pragma omp for
for (int i = 0; i < n; i++) {
int offset, num_neighs;
for (int walk = 0; walk < num_walks; walk++) {
int curr = i;
offset = i * num_walks * (num_steps + 1) + walk * (num_steps + 1);
walks[offset] = i;
for (int step = 0; step < num_steps; step++) {
num_neighs = ptr[curr + 1] - ptr[curr];
if (num_neighs > 0) {
curr = neighs[ptr[curr] + (rand_r(&private_seed) % num_neighs)];
}
walks[offset + step + 1] = curr;
}
}
}
}
}
The implementation contains two subtleties:
rand_r
instead of rand
in multi-threaded applications.num_neighs > 0
as otherwise % num_neighs
is undefined. num_neighs > 0
can be well predicted by the branch predictor and thus the if
condition is not a performance bottleneck.We will use the method described in SciPy Lectures. This is neither the only nor the most elegant method, but it is easy, works well on Linux and forces a C interface. We consider the last point to be an advantage as using only C types keeps the syntax and data representation clean and simple.
Based on the information above we have to compile the C code into a shared library which we achieve with the following Makefile.
CC?=gcc # Set compiler if CC is not set
CFLAGS= -fopenmp -fPIC -O3 -D NDEBUG -Wall -Werror # Full optimization. Code does not use floats
all: librwalk.so
librwalk.so: rwalk.o
$(CC) $(CFLAGS) -shared -Wl,-soname,librwalk.so -o librwalk.so rwalk.o
rm rwalk.o
rwalk.o: rwalk.c
$(CC) -c $(CFLAGS) rwalk.c -o rwalk.o
clean :
rm -rf librwalk.so rwalk.o __pycache__
Based on the method used, we use the following code in rwalk/__init__.py
to interface with the librwalk.so
library.
import numpy as np
import numpy.ctypeslib as npct
from ctypes import c_int
from os.path import dirname
array_1d_int = npct.ndpointer(dtype=np.int32, ndim=1, flags='CONTIGUOUS')
librwalk = npct.load_library("librwalk", dirname(__file__))
print("rwalk: Loading library from: {}".format(dirname(__file__)))
librwalk.random_walk.restype = None
librwalk.random_walk.argtypes = [array_1d_int, array_1d_int, c_int, c_int, c_int, c_int, c_int, array_1d_int]
def random_walk(ptr, neighs, num_walks=10, num_steps=3, nthread=-1, seed=111413):
assert(ptr.flags['C_CONTIGUOUS'])
assert(neighs.flags['C_CONTIGUOUS'])
assert(ptr.dtype == np.int32)
assert(neighs.dtype == np.int32)
n = ptr.size - 1;
walks = -np.ones((n*num_walks, (num_steps+1)), dtype=np.int32, order='C')
assert(walks.flags['C_CONTIGUOUS'])
librwalk.random_walk(
ptr,
neighs,
n,
num_walks,
num_steps,
seed,
nthread,
np.reshape(walks, (walks.size,), order='C'))
return walks
It is important to assert
the datatype and layout as the C library only works in this specific case. We can then call the function like walks = rwalk.random_walk(ptr, neighs)
where ptr
and neighs
are e.g. the outputs of rwalk.read_edgelist("test.edgelist")
.
The speedup over the NetworkX implementation is about 50x for the C implementation using one thread and about 300x for the C implementation using six threads (the machine had six cores). These numbers are also machine dependent and we encourage the reader to run the post.ipynb
notebook for themselves and check the numbers!
We used Amdahl's law to obtain a least squares estimate for p
, the fraction of the program that can be parallelized. For the specific example presented p
is consistently larger than 95%.
To further investigate the scaling of the parallel implementation we plotted the runtime for one to six threads and compared to the runtime predicted by Amdahl's law using the p
estimated before. We observe good agreement and conclude that the implementation provided is well parallelized and would scale well to even a higher number of cores.
The provided C implementation greatly outperforms the Python implementation. Interfacing C using NumPy ctypes
works very well but is not suited for cross-platform deployment. If one is comfortable writing C code it might thus be a viable option to implement core parts of the algorithm in C. Additionally, this offers a simple way to circumvent the Global Interpreter Lock.
I hope you find this post interesting and I am happy to answer any questions or inquiries via E-Mail.
Back to homepage.