This post is about how to use the
petsc4py
and PyCUDA
together to write applications that use several GPUs
in parallel.
PETSc supports the use of CUDA GPUs via the
CUSP C++ library.
The PETSc provided VECCUSP
and AIJCUSP
classes
are used to store vectors and matrices respectively on GPUs.
Vector operations and matrix vector products
with these classes are performed on the GPU.
All the PETSc linear solvers (except BiCG)
are thus able to run entirely on the GPU.
Sometimes, you’ll want to perform computations with vectors
that can’t be expressed using the available vector operations
or matrix-vector products.
In this case, you’ll need access to the
underlying GPU (device) buffer,
which you pass to your own CUDA kernel that performs the computation.
In C++, this is done using the PETSc functions
VecCUSPGetArrayRead
and VecCUSPGetArrayWrite
that expose the underlying CUSP vectors.
From the CUSP vector,
you can get a pointer to the raw device memory
using the thrust::raw_pointer_cast
function.
This device pointer can be passed to your own custom kernels,
functions from other GPU libraries, whatever.
Here’s how to do it with petsc4py
and PyCUDA.
We’ll multiply two vectors (elementwise)
using a custom kernel.
The following bits of code all go in a single Python script gpumult.py
.
First, we’ll create the input and output vectors:
from petsc4py import PETSc
from pycuda import autoinit
import pycuda.driver as cuda
import pycuda.compiler as compiler
import pycuda.gpuarray as gpuarray
import numpy as np
N = 8
# create input vectors
a = PETSc.Vec().create()
a.setType('cusp')
a.setSizes(N)
b = PETSc.Vec().create()
b.setType('cusp')
b.setSizes(N)
# create output vectors
c = PETSc.Vec().create()
c.setType('cusp')
c.setSizes(N)
# set initial values:
a.set(3.0)
b.set(2.0)
c.set(0.0)
Next, we’ll use the getCUDAHandle
method
to get the raw CUDA pointers
of the underlying GPU buffers:
a_ptr = a.getCUDAHandle()
b_ptr = b.getCUDAHandle()
c_ptr = c.getCUDAHandle()
Next, we’ll write a CUDA kernel implementing
the elementwise product, and use PyCUDA to interface with it:
kernel_text = '''
__global__ void gpuElemwiseProduct(double* a_d,
double* b_d, double* c_d, int n) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
c_d[i] = a_d[i]*b_d[i];
}
'''
mult = compiler.SourceModule(kernel_text,
options=['-O2']).get_function('gpuElemwiseProduct')
mult.prepare('PPPi')
Now, we’ll perform the multiplication:
mult.prepared_call((1, 1, 1), (N, 1, 1),
a_ptr, b_ptr, c_ptr, N)
The API requires us to “restore” the CUDA handles.
a.restoreCUDAHandle(a_ptr)
b.restoreCUDAHandle(b_ptr)
c.restoreCUDAHandle(c_ptr)
Actually, it doesn’t matter what the input is to
the restoreCUDAHandle
function - but whatever.
Now look at the output vector:
Here’s a run of the above program on 2 processes:
$ mpiexec -n 2 python gpumult.py
Vec Object: 2 MPI processes
type: mpicusp
Process [0]
6.
6.
6.
6.
Process [1]
6.
6.
6.
6.
You can also use the
raw pointer to construct a PyCUDA GPUArray,
or pass it to a function from the
scikits.cuda
library, or wherever else.