Fast random walks in C with Python interface

For a machine learning project at Polytechnique we required random walks on a graph similar to the random walks generated within Deepwalk or PinSAGE. In this post I compare a simple, pythonic implementation based on NetworkX (version 2.2) with a less intuitive C implementation which outperforms the Python implementation by 50x (serial) and 300x (6 threads). The code for this post can be found on github in the notebook called post.ipynb.

NetworkX Implementation

def nx_random_walk(G, num_walks=10, num_steps=3):
    walks = list()
    for i in G.nodes():
        for walk in range(num_walks):
            curr_walk = [i]
            curr = i
            for step in range(num_steps):
                neighbors = list(G.neighbors(curr))
                if len(neighbors) > 0:
                    curr = random.choice(neighbors)
                curr_walk.append(curr)
            walks.append(curr_walk)
    return walks

walks is a list of lists where there will be num_walks lists per node of the graph. We assume an undirected graph.

C/NumPy Implementation

The final C code and Python wrapper can be found in the rwalk package on github.

Graph Representation

The C implementation uses NumPy to offer an easy interface to Python. The graph is stored by two 1-D NumPy arrays ptr and neighs. ptr has shape (n+1, ) where n is the number of nodes in the graph. We assume that the graph nodes are enumerated by consecutive integers starting at 0. ptr[0] = 0 and for all nodes i there holds: ptr[i+1] - ptr[i] is the degree of node i. neighs holds the neighbors s.t. neighs[ptr[i]:ptr[i+1]] are the neighbors of node i. Therefore neighs has shape (ptr[n], ). The format is equivalent to the CSR format for sparse matrices and ptr, neighs are the equivalents of indptr, indices as in scipy.sparse.csr_matrix. This format is memory efficient and allows fast access to the neighbors of node i which is crucial for the random walk algorithm. For brevity we will not consider the complex topic of graph reordering.

Input

The rwalk package also offers a possibility to read edge lists as generated by NetworkX.

seed = 111413
n = 20000
avg_deg = 50
p = avg_deg / n
G = nx.fast_gnp_random_graph(n, p, seed=seed)
nx.write_edgelist(G, "test.edgelist", data=False)
ptr, neighs = rwalk.read_edgelist("test.edgelist")

C Code

#include "rwalk.h"
#include <omp.h>
#include <stdlib.h>

void random_walk(int const* ptr, int const* neighs, int n, int num_walks,
                 int num_steps, int seed, int nthread, int* walks) {
  if (nthread > 0) {
    omp_set_num_threads(nthread);
  }
#pragma omp parallel
  {
    int thread_num = omp_get_thread_num();
    unsigned int private_seed = (unsigned int)(seed + thread_num);
#pragma omp for
    for (int i = 0; i < n; i++) {
      int offset, num_neighs;
      for (int walk = 0; walk < num_walks; walk++) {
        int curr = i;
        offset = i * num_walks * (num_steps + 1) + walk * (num_steps + 1);
        walks[offset] = i;
        for (int step = 0; step < num_steps; step++) {
          num_neighs = ptr[curr + 1] - ptr[curr];
          if (num_neighs > 0) {
            curr = neighs[ptr[curr] + (rand_r(&private_seed) % num_neighs)];
          }
          walks[offset + step + 1] = curr;
        }
      }
    }
  }
}

Compilation

We will use the method described in SciPy Lectures. This is neither the only nor the most elegant method, but it is easy, works well on Linux and forces a C interface. We consider the last point to be an advantage as using only C types keeps the syntax and data representation clean and simple.

Based on the information above we have to compile the C code into a shared library which we achieve with the following Makefile.

CC?=gcc # Set compiler if CC is not set
CFLAGS= -fopenmp -fPIC -O3 -D NDEBUG -Wall -Werror # Full optimization. Code does not use floats

all: librwalk.so

librwalk.so: rwalk.o
  $(CC) $(CFLAGS) -shared -Wl,-soname,librwalk.so -o librwalk.so rwalk.o
  rm rwalk.o

rwalk.o: rwalk.c
  $(CC) -c $(CFLAGS) rwalk.c -o rwalk.o

clean :
  rm -rf librwalk.so rwalk.o __pycache__

Python Interface

Based on the method used, we use the following code in rwalk/__init__.py to interface with the librwalk.so library.

import numpy as np
import numpy.ctypeslib as npct
from ctypes import c_int
from os.path import dirname

array_1d_int = npct.ndpointer(dtype=np.int32, ndim=1, flags='CONTIGUOUS')

librwalk = npct.load_library("librwalk", dirname(__file__))

print("rwalk: Loading library from: {}".format(dirname(__file__)))
librwalk.random_walk.restype = None
librwalk.random_walk.argtypes = [array_1d_int, array_1d_int, c_int, c_int, c_int, c_int, c_int, array_1d_int]


def random_walk(ptr, neighs, num_walks=10, num_steps=3, nthread=-1, seed=111413):
    assert(ptr.flags['C_CONTIGUOUS'])
    assert(neighs.flags['C_CONTIGUOUS'])
    assert(ptr.dtype == np.int32)
    assert(neighs.dtype == np.int32)

    n = ptr.size - 1;
    walks = -np.ones((n*num_walks, (num_steps+1)), dtype=np.int32, order='C')
    assert(walks.flags['C_CONTIGUOUS'])

    librwalk.random_walk(
        ptr,
        neighs,
        n,
        num_walks,
        num_steps,
        seed,
        nthread,
        np.reshape(walks, (walks.size,), order='C'))

    return walks

It is important to assert the datatype and layout as the C library only works in this specific case. We can then call the function like walks = rwalk.random_walk(ptr, neighs) where ptr and neighs are e.g. the outputs of rwalk.read_edgelist("test.edgelist").

Runtime Analysis

The speedup over the NetworkX implementation is about 50x for the C implementation using one thread and about 300x for the C implementation using six threads (the machine had six cores). These numbers are also machine dependent and we encourage the reader to run the post.ipynb notebook for themselves and check the numbers!

We used Amdahl's law to obtain a least squares estimate for p, the fraction of the program that can be parallelized. For the specific example presented p is consistently larger than 95%.

To further investigate the scaling of the parallel implementation we plotted the runtime for one to six threads and compared to the runtime predicted by Amdahl's law using the p estimated before. We observe good agreement and conclude that the implementation provided is well parallelized and would scale well to even a higher number of cores.

Conclusion

The provided C implementation greatly outperforms the Python implementation. Interfacing C using NumPy ctypes works very well but is not suited for cross-platform deployment. If one is comfortable writing C code it might thus be a viable option to implement core parts of the algorithm in C. Additionally, this offers a simple way to circumvent the Global Interpreter Lock.

I hope you find this post interesting and I am happy to answer any questions or inquiries via E-Mail.