## Fast random walks in C with Python interface

For a machine learning project at Polytechnique we required random walks on a graph similar to the random walks generated within Deepwalk or PinSAGE. In this post I compare a simple, pythonic implementation based on NetworkX (version 2.2) with a less intuitive C implementation which outperforms the Python implementation by 50x (serial) and 300x (6 threads). The code for this post can be found on github in the notebook called `post.ipynb`.

### NetworkX Implementation

The code based on NetworkX is given below.

```def nx_random_walk(G, num_walks=10, num_steps=3):
walks = list()
for i in G.nodes():
for walk in range(num_walks):
curr_walk = [i]
curr = i
for step in range(num_steps):
neighbors = list(G.neighbors(curr))
if len(neighbors) > 0:
curr = random.choice(neighbors)
curr_walk.append(curr)
walks.append(curr_walk)
return walks
```

`walks` is a list of lists where there will be `num_walks` lists per node of the graph. We assume an undirected graph.

### C/NumPy Implementation

The final C code and Python wrapper can be found in the rwalk package on github.

#### Graph Representation

The C implementation uses NumPy to offer an easy interface to Python. The graph is stored by two 1-D NumPy arrays `ptr` and `neighs`. `ptr` has shape `(n+1, )` where `n` is the number of nodes in the graph. We assume that the graph nodes are enumerated by consecutive integers starting at `0`. `ptr = 0` and for all nodes `i` there holds: `ptr[i+1] - ptr[i]` is the degree of node `i`. `neighs` holds the neighbors s.t. `neighs[ptr[i]:ptr[i+1]]` are the neighbors of node `i`. Therefore `neighs` has shape `(ptr[n], )`. The format is equivalent to the CSR format for sparse matrices and `ptr`, `neighs` are the equivalents of `indptr`, `indices` as in scipy.sparse.csr_matrix. This format is memory efficient and allows fast access to the neighbors of node `i` which is crucial for the random walk algorithm. For brevity we will not consider the complex topic of graph reordering.

#### Input

The rwalk package also offers a possibility to read edge lists as generated by NetworkX.

```seed = 111413
n = 20000
avg_deg = 50
p = avg_deg / n
G = nx.fast_gnp_random_graph(n, p, seed=seed)
nx.write_edgelist(G, "test.edgelist", data=False)
```

#### C Code

The C code is given below.

```#include "rwalk.h"
#include <omp.h>
#include <stdlib.h>

void random_walk(int const* ptr, int const* neighs, int n, int num_walks,
int num_steps, int seed, int nthread, int* walks) {
}
#pragma omp parallel
{
unsigned int private_seed = (unsigned int)(seed + thread_num);
#pragma omp for
for (int i = 0; i < n; i++) {
int offset, num_neighs;
for (int walk = 0; walk < num_walks; walk++) {
int curr = i;
offset = i * num_walks * (num_steps + 1) + walk * (num_steps + 1);
walks[offset] = i;
for (int step = 0; step < num_steps; step++) {
num_neighs = ptr[curr + 1] - ptr[curr];
if (num_neighs > 0) {
curr = neighs[ptr[curr] + (rand_r(&private_seed) % num_neighs)];
}
walks[offset + step + 1] = curr;
}
}
}
}
}
```

The implementation contains two subtleties:

• It is crucial to use `rand_r` instead of `rand` in multi-threaded applications.
• We have to explicitly check that `num_neighs > 0` as otherwise `% num_neighs` is undefined. `num_neighs > 0` can be well predicted by the branch predictor and thus the `if` condition is not a performance bottleneck.

#### Compilation

We will use the method described in SciPy Lectures. This is neither the only nor the most elegant method, but it is easy, works well on Linux and forces a C interface. We consider the last point to be an advantage as using only C types keeps the syntax and data representation clean and simple.

Based on the information above we have to compile the C code into a shared library which we achieve with the following Makefile.

```CC?=gcc # Set compiler if CC is not set
CFLAGS= -fopenmp -fPIC -O3 -D NDEBUG -Wall -Werror # Full optimization. Code does not use floats

all: librwalk.so

librwalk.so: rwalk.o
\$(CC) \$(CFLAGS) -shared -Wl,-soname,librwalk.so -o librwalk.so rwalk.o
rm rwalk.o

rwalk.o: rwalk.c
\$(CC) -c \$(CFLAGS) rwalk.c -o rwalk.o

clean :
rm -rf librwalk.so rwalk.o __pycache__
```

#### Python Interface

Based on the method used, we use the following code in `rwalk/__init__.py` to interface with the `librwalk.so` library.

```import numpy as np
import numpy.ctypeslib as npct
from ctypes import c_int
from os.path import dirname

array_1d_int = npct.ndpointer(dtype=np.int32, ndim=1, flags='CONTIGUOUS')

librwalk.random_walk.restype = None
librwalk.random_walk.argtypes = [array_1d_int, array_1d_int, c_int, c_int, c_int, c_int, c_int, array_1d_int]

def random_walk(ptr, neighs, num_walks=10, num_steps=3, nthread=-1, seed=111413):
assert(ptr.flags['C_CONTIGUOUS'])
assert(neighs.flags['C_CONTIGUOUS'])
assert(ptr.dtype == np.int32)
assert(neighs.dtype == np.int32)

n = ptr.size - 1;
walks = -np.ones((n*num_walks, (num_steps+1)), dtype=np.int32, order='C')
assert(walks.flags['C_CONTIGUOUS'])

librwalk.random_walk(
ptr,
neighs,
n,
num_walks,
num_steps,
seed,
np.reshape(walks, (walks.size,), order='C'))

return walks
```

It is important to `assert` the datatype and layout as the C library only works in this specific case. We can then call the function like `walks = rwalk.random_walk(ptr, neighs)` where `ptr` and `neighs` are e.g. the outputs of `rwalk.read_edgelist("test.edgelist")`.

### Runtime Analysis

The speedup over the NetworkX implementation is about 50x for the C implementation using one thread and about 300x for the C implementation using six threads (the machine had six cores). These numbers are also machine dependent and we encourage the reader to run the `post.ipynb` notebook for themselves and check the numbers!

We used Amdahl's law to obtain a least squares estimate for `p`, the fraction of the program that can be parallelized. For the specific example presented `p` is consistently larger than 95%.

To further investigate the scaling of the parallel implementation we plotted the runtime for one to six threads and compared to the runtime predicted by Amdahl's law using the `p` estimated before. We observe good agreement and conclude that the implementation provided is well parallelized and would scale well to even a higher number of cores. ### Conclusion

The provided C implementation greatly outperforms the Python implementation. Interfacing C using NumPy `ctypes` works very well but is not suited for cross-platform deployment. If one is comfortable writing C code it might thus be a viable option to implement core parts of the algorithm in C. Additionally, this offers a simple way to circumvent the Global Interpreter Lock.

I hope you find this post interesting and I am happy to answer any questions or inquiries via E-Mail.

Back to homepage.