Notes on GPU Programming with Julia

Eric Aubanel
January 2021

Introduction

Julia supports two different programming models for programming Nvidia GPUs: CUarrays, a high-level model for data parallel programming with arrays, and CUDAnative, a low-level model for writing CUDA kernels. Low-level programming of AMD GPUs is possible using AMDGPUnative, but will not be used here. See JuliaGPU for a complete list of Julia packages for GPU programming.

A simple dynamic programming problem will be used to as an example for GPU programming. The subset sum problem asks the question whether there is a subset of a set $s$ of positive integers that sums to a given sum $S$. For example, for the set $s={2,6,8,19}$ there is a subset that sums to 10 (${2,8}$), but not one that sums to 11. The answer in general can be computed using the recurrence relation:

\[ F[i,j] = \begin{cases} F[i,j-1] \lor F[i-s_j,j-1] & \text{if $j > 0$}\,;\\ 1 & \text{if $i = 0$}\,;\\ 0 & \text{otherwise.} \end{cases} \]

where $F[i,j]$ is 1 if the subset $s_j$ of the first $j$ integers sums to $i$. The algorithm fills table $F$ column by column from left to right. There’s one column for each integer in set $s$, and the length of each column is $S+1$. Note that I have reversed rows and columns of the usual definition of the recurrence relation to make the Julia implementation more efficient, as it stores two-dimensional arrays in column major order. Here is a simple implementation:

function subsetSum(s, S)
    n = length(s)
    F = zeros(Int8,S+1,n)
    for i in 1:n
        F[1,i] = 1
    end
    # subset of first integer summing to itself
    s[1] S && (F[s[1]+1,1] = 1)
    @inbounds for j in 2:n
        for i in 2:S+1
            F[i,j] = F[i,j-1]
            i > s[j] && (F[i,j] = F[i,j] | F[i-s[j],j-1])
        end
    end
    return Bool(F[S+1,n])
end
subsetSum (generic function with 1 method)

I've used Int8 here instead of Bool as the type for the table elements, trading off space for computational efficiency. It's possible to make this algorithm more efficient (for instance, stopping as soon as there is a 1 in the last row), but I'm keeping it simple to focus on parallelization.

Here's the script to benchmark a subset sum problem with a random array of 10 positive integers in the range 1 to 10^7 (with a fixed seed for reproducibility)

using Random, BenchmarkTools
Random.seed!(12)
s = rand(1:10000000,10)
S = (10000000*10)÷4
@btime subsetSum(s,S)
274.534 ms (2 allocations: 238.42 MiB)
false

The answer is false, since there isn't a subset of the generated array s that sums to S. This test was performed on a 2.40GHz Intel Xeon Platinum 8260M CPU. Not shown here is a verification that this program is actually correct! This was done by summing each member set of the powerset (set of all subsets, using Combinatorics.jl) of s, for a problem with a set of 10 elements in the range 1:100, and verifying that subsetSum() returned true for all of these sums, and false for all other possible values of S.

Parallel Subset Sum

The easiest way to parallelize the algorithm is to do a data decomposition of each column, since computation of each column depends on elements from the previous column and computation of elements of a column can be done independently. The computations in each column are data-parallel, in that the same operation is being performed on different data. A nice way to notate a data-parallel algorithm is to use the set notation I employ in my book, Elements of Parallel Computing [1], which comes from the article on parallel algorithms in the Algorithms and Theory of Computation Handbook [2]. The inner for loop can be replaced by:

\[ \{F[i,j] \leftarrow F[i,j-1] : i \in [2..S+1]\}\\ \{F[i,j] \leftarrow F[i,j] \lor F[i-s[j],j-1] : i \in [2..S+1] \mid i > s[j]\}\\ \]

GPU Programming with Julia

Installation of the CUDA.jl package needed is not covered here, as it is well described in Julia's CUDA documentation. This document doesn't assume any GPU programming experience, but effective GPU programming relies on understanding the important differences between GPUs and CPUs. In addition to my book [1], a good introduction is given in the article Understanding Throughput Oriented Architectures [3]. The programs given below were verified in two ways: 1) by checking that the resulting F matrices were identical to the matrix produced sequentially by subsetSum() for the large problem used for benchmarking and 2) by applying the same test described above using powersets.

CuArray Type

CuArray is array type for Nvidia GPUs, which facilitates allocating data on the GPU and copying data between the GPU and CPU. It also supports array operations on the GPU. Here's a parallel subset sum using CuArrays:

using CUDA

function subsetSumCuArrays(s, S)
    n = length(s)
    F_d = CUDA.zeros(Int8, S+1, n)
    s_d = CuArray{Int64,1}(s)
    F_d[1,:] .= 1
    s_d[1] S && (F_d[s_d[1]+1,1] = 1)
    @views @inbounds for j in 2:n
        F_d[2:S+1,j] .=  F_d[2:S+1,j-1]
        if(s_d[j] <= S)
            F_d[s_d[j]+1:S+1,j] .= F_d[s_d[j]+1:S+1,j] .| F_d[1:S+1-s_d[j],j-1]
        end
    end
    synchronize()
    return Bool(F_d[S+1,n])
end
subsetSumCuArrays (generic function with 1 method)

This program shows two ways of creating a CuArray: using the CUDA version of the zeros method and creating a CuArray from regular array s. In both cases the array is created on the GPU. A good convention is to append _d to the name of the array to document that it is a GPU array. In the second case the CuArray is created from regular array s, which involves copying array s to the GPU.

Once you have your CuArrays, you can perform operations on them just as you would with regular arrays, with one important difference. Scalar operations are extremely slow on a GPU, which derive their performance from high memory bandwidth and massive on-chip multithreaded parallelism. Therefore operations on CuArrays should be done using Julia's dot syntax, which performs operations on each element of the arrays in the expression. So, the set notated operation $\{F[i,j] \leftarrow F[i,j-1] : i \in [2..S+1]\}$ becomes in Julia:

F_d[2:S+1,j] .=  F_d[2:S+1,j-1]

The statement $\{F[i,j] \leftarrow F[i,j] \lor F[i-s[j],j-1] : i \in [2..S+1] \mid i > s[j]\}$ constrains the $\lor$ operation to take place on elements $s[j]+1$ to $S+1$, and so becomes in Julia:

F_d[s_d[j]+1:S+1,j] .=F_d[s_d[j]+1:S+1,j] .| F_d[1:S+1-s_d[j],j-1]

contained in an if statement since a set element that is larger than S can't be a member of the subset. Note the use of the @views macro, which is needed so that all enclosed uses of the sub-indexing notation (e.g. F_d[2:S+1,j]) don't allocate new arrays, but instead provide views of the parent array.

Another important difference between arrays and CuArrays is that operations on the latter are non-blocking. This can seem strange for CPU programmers, who expect that all the effects of a statement are complete once processing moves to the next statement. For operations on CuArrays, the statement may not be complete by the time the next CPU statement (i.e. not involving CuArrays) is executed. If a statement is running on the GPU, there's no problem having processing continue on the CPU. However, we can be sure that multiple statements that involve the GPU will be completed as expected.

One remaining problem is how do we know that the GPU has finished its work? If we wanted to retrieve the F_d CuArray from the GPU (using F = Array(F_d)), then we could be certain that F_d wouldn't be copied to the CPU until the previous statements had completed. In the program above we are just interested in one element of F_d, so we use the synchronize() function from CUDA.jl to make sure that all GPU processing is finished.

In summary, the CuArray type allows programmers to work with arrays on the GPU in the same way as they are used to with CPU arrays. However, to obtain good performance it's important to understand that CuArray require data parallel operations to obtain good performance, and that these operations take place on a separate device, with consequences for data movement and completion of statements.

Let's see how well the CuArray parallelization of subset sum performs, on an Nvidia Tesla V100-SXM2 GPU:

@btime subsetSumCuArrays(s,S)
4.413 ms (743 allocations: 21.34 KiB)
false

CUDAnative

We found good speedup with CuArrays, but let's see if we can do better by programming at a lower level. Thankfully we can stay with Julia, and have the choice to focus on which statements using CuArrays that we want to rewrite; here we'll focus on the statements in the for loop.

Native CUDA programming involves writing functions, called kernels, that are designed to run on a GPU. The programming model is SPMD (single program, multiple data), where each GPU thread executes the same kernel. What each thread does is determined by the programmer using the thread's unique id. Obtaining a thread's id is a bit more complicated than for CPU threads, since CUDA threads are grouped into blocks, and these blocks are in turn grouped into grids. These groups can be indexed in three dimensions, but we'll stay in the x dimension here. A thread's unique id can be computed as:

id = (blockIdx().x - 1) * blockDim().x + threadIdx().x

threadIdx().x and blockIdx().x return a thread's local id in its block and the id of its block, respectively; blockDim().x returns the number of threads in each block. This expression converts from a thread's local id to a unique global id. Note that 1-based indexing is used, as usual in Julia.

We can now write a kernel to perform the operations of the body of the for loop:

function subsetKernel(F, s, S, j)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    id += 1 # 2 ≤ id ≤ S+1
    if(id  S+1)
        F[id,j] = F[id,j-1]
        if(id > s[j])
            F[id,j] = F[id,j] | F[id - s[j], j-1]		
        end
    end
    return nothing
end
subsetKernel (generic function with 1 method)

Here one thread is assigned the computation of one element of column j of F, from indices 2 to S+1. The if(id ≤ S+1) statement is used because there may be more threads than needed. Before launching a kernel, the number of blocks and the size of each block need to be chosen. The latter is best chosen as a multiple of 32 (see CUDA documentation on warps), and there is often a recommended block size. This means that the number of threads desired may not be a multiple of the block size, so some threads in one block may be idle. Other than the computation of the id and the if statement the kernel looks just like the body of the sequential inner for loop. Each thread executes the same kernel, and S of them compute an element of column j of F; this is the SPMD programming model. Here is the rest of the program:

const MAX_THREADS_PER_BLOCK = CUDA.attribute(
   CUDA.CuDevice(0), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
)
 
function subsetSumCuNative(s, S)
    n = length(s)
    F_d = CUDA.zeros(Int8, S+1, n)
    s_d = CuArray{Int64,1}(s)
    F_d[1,:] .= 1
    if(s_d[1] S) 
        F_d[s_d[1]+1,1] = 1
    end
    blockSize = MAX_THREADS_PER_BLOCK
    nbl = cld(S, blockSize)
    for j in 2:n
        @cuda blocks=nbl threads=blockSize subsetKernel(F_d, s_d, S, j)
    end
    synchronize()
    return Bool(F_d[S+1,n])
end
subsetSumCuNative (generic function with 1 method)

The first 5 lines of subsetSumCuNative() are exactly the same as in subsetSumCuArrays. We haven't bothered to change the two statements initializing F_d into kernels because they don't have a signficant impact on performance. The size of each block is chosen to be the maximum allowable by the GPU, using the CUDA.attribute() function. The number of blocks is then computed as the ceiling of the division of S (the number of iterations to be assigned to threads) by the block size, using the cld() function. Launching a kernel involves using the @cuda macro and indicating the number of blocks and size of of each block, as well as the kernel and its arguments.

Note that the kernel operates on a single column, rather than iterating over all columns. Recall that the elements of one column depend on elements of the previous column, therefore we want synchronization to occur after each column in computed. The CUDA programming model discourages synchronization of all threads, and offers instead synchronization within a block (not discussed here). Global synchronization is achieved by calling the kernel for each iteration of the for j loop, as a kernel won't begin executing until the previous kernel call is finished.

Let's see how well this version with CUDAnative performs:

@btime subsetSumCuNative(s,S)
3.727 ms (140 allocations: 4.61 KiB)
false

This is a small improvement on the previous version with CuArrays only. For other programs CUDAnative may provide a larger performance boost. CuNative can also provide more functionality than the array operations of CuArrays.

It turns out that we can do even better. The approach used above is a good first step in parallel programming: maximize the parallelism. The next step is to see if agglomerating the work into larger chunks improves performance, to increase the amount of work each thread performs. In the next version of the kernel iterations are assigned to threads in round-robin fashion:

function subsetKernelAgglom(F, s, S, j)
    id = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    id += 1 # 2 ≤ id ≤ S+1
    stride = blockDim().x * gridDim().x
    for i in id:stride:S+1
        F[i,j] = F[i,j-1]
        if(i > s[j])
            F[i,j] = F[i,j] | F[i - s[j], j-1]
        end
    end
    return nothing
end
subsetKernelAgglom (generic function with 1 method)

Here stride is the total number of threads, which we calculate with the help of gridDim().x, the number of blocks. Note that we didn't give each thread a contiguous block of iterations to compute. The round-robin assignment means that reads by multiple threads can be coalesced into a single transaction, making use of the high memory bandwidth. Our subset sum function has another parameter to set the number of iterations that each thread will compute:

function subsetSumCuNativeChunk(s, S, chunkSize)
    n = length(s)
    F_d = CUDA.zeros(Int8, S+1, n)
    s_d = CuArray{Int64,1}(s)
    F_d[1,:] .= 1
    if(s_d[1] S) 
        F_d[s_d[1]+1,1] = 1
    end
    blockSize = MAX_THREADS_PER_BLOCK
    ntTot = S÷chunkSize
    nbl = cld(ntTot, blockSize)
    for j in 2:n
        @cuda blocks=nbl threads=blockSize subsetKernelAgglom(F_d, s_d, S, j)
    end
    synchronize()
    return Bool(F_d[S+1,n])
end
subsetSumCuNativeChunk (generic function with 1 method)

Whereas in the first kernel the total number of threads was equal to S, here it is S divided by the chunk size. Let's see how the performance of this kernel compares to the previous one:

@btime subsetSumCuNativeChunk(s,S,2)
3.698 ms (140 allocations: 4.61 KiB)
@btime subsetSumCuNativeChunk(s,S,4)
3.193 ms (140 allocations: 4.61 KiB)
@btime subsetSumCuNativeChunk(s,S,8)
2.933 ms (140 allocations: 4.61 KiB)
@btime subsetSumCuNativeChunk(s,S,16)
2.926 ms (140 allocations: 4.61 KiB)
false

This chunked version gives us a further modest improvement, for chunks larger than 2, but note the the diminishing returns in performance as the chunk size is increased.

Final Thoughts

Julia's CUDA support offers an easy way to get into GPU programming, and an incremental way to write kernels to improve performance. There's much more to GPU programming for high performance, however. This includes understanding the SIMD execution of each warp of threads, memory layout on the GPU including registers and local memory, choosing the right block size to maximize a quantity known as occupancy, and so on. Use of Nvidia's profiling tools can help in performance debugging. The thing is, unless you're writing a highly optimized library that will be used heavily, you may be able to get away with good-enough performance with minimal effort.

1

Elements of Parallel Computing (Chapman & Hall/CRC Press, 2016)

2

Algorithms and Theory of Computation Handbook (Chapman & Hall/CRC Press, 2010)

3

Understanding Throughput-oriented Architectures, Michael Garland and David B. Kirk, Commun. ACM, 53(11):58–66, November 2010