Accelerating your C++ on GPU with SYCL
Leveraging the power of graphics cards for compute applications is all the rage right now in fields such as machine learning, computer vision and high-performance computing. Technologies like OpenCL expose this power through a hardware-independent programming model, allowing you to write code which abstracts over different architecture capabilities. The dream of this is “write once, run anywhere”, be it an Intel CPU, AMD discrete GPU, DSP, etc. Unfortunately, for everyday programmers, OpenCL has something of a steep learning curve; a simple Hello World program can be a hundred or so lines of pretty ugly-looking code. However, to ease this pain, the Khronos group have developed a new standard called SYCL, which is a C++ abstraction layer on top of OpenCL. Using SYCL, you can develop these general-purpose GPU (GPGPU) applications in clean, modern C++ without most of the faff associated with OpenCL. Here’s a simple vector multiplication example written in SYCL using the parallel STL implementation:
For comparison, here’s a mostly equivalent version written in OpenCL using the C++ API (don’t spend much time reading this, just note that it looks ugly and is really long):
In this post I’ll give an introduction on using SYCL to accelerate your C++ code on the GPU.
Lightning intro to GPGPU
Before I get started on how to use SYCL, I’ll give a brief outline of why you might want to run compute jobs on the GPU for those who are unfamiliar. If you’ve already used OpenCL, CUDA or similar, feel free to skip ahead.
The key difference between a GPU and a CPU is that, rather than having a small number of complex, powerful cores (1-8 for common consumer desktop hardware), a GPU has a huge number of small, simple processing elements.
Above is a comically simplified diagram of a CPU with four cores. Each core has a set of registers and is attached to various levels of cache (some might be shared, some not), and then main memory.
In the GPU, tiny processing elements are grouped into execution units. Each processing element has a bit of memory attached to it, and each execution unit has some memory shared between its processing elements. After that, there’s some GPU-wide memory, then the same main memory which the CPU uses. The elements within an execution unit execute in lockstep, where each element executes the same instruction on a different piece of data.
The benefit which GPUs bring is the amount of data which can be processed at the same time. If you’re on a CPU, maybe you can process a large handful of pixels at a given time if you use multithreading and vector instructions, but GPUs can process orders of magnitude more than this. The sheer size of data which can be processed at once in GPUs makes them very well-suited to applications like graphics (duh), mathematical processing, neural networks, etc.
There are many aspects of GPGPU programming which make it an entirely different beast to everyday CPU programming. For example, transferring data from main memory to the GPU is slow. Really slow. Like, kill all your performance and get you fired slow. Therefore, the tradeoff with GPU programming is to make as much of the ridiculously high throughput of your accelerator to hide the latency of shipping the data to and from it.
There are other issues which might not be immediately apparent, like the cost of branching. Since the processing elements in an execution unit work in lockstep, nested branches which cause them to take different paths (divergent control flow) is a real problem. This is often solved by executing all branches for all elements and masking out the unneeded results. That’s a polynomial explosion in complexity based on the level of nesting, which is A Bad Thing ™. Of course, there are optimizations which can aid this, but the idea stands: simple assumptions and knowledge you bring from the CPU world might cause you big problems in the GPU world.
Before we get back to SYCL, some short pieces of terminology. The host is the main CPU running on your machine which executes your normal C or C++ code, and the device is what will be running your OpenCL code. A device could be the same as the host, or it could be some accelerator sitting in your machine, a simulator, whatever. A kernel is a special function which is the entry point to the code which will run on your device. It will often be supplied with buffers for input and output data which have been set up by the host.
Back to SYCL
There are currently two implementations of SYCL available: triSYCL, an experimental open source version by Xilinx (mostly used as a testbed for the standard), and ComputeCpp, an industry-strength implementation by Codeplay1 (currently in open beta). Only ComputeCpp supports execution of kernels on the GPU, so we’ll be using that in this post.
Step 1 is to get ComputeCpp up and running on your machine. The main components are a runtime library which implements the SYCL API, and a Clang-based compiler which compiles both your host code and your device code. At the time of writing, Intel CPUs and some AMD GPUs are officially supported on Ubuntu and CentOS. It should be pretty easy to get it working on other Linux distributions (I got it running on my Arch system, for instance). Support for more hardware and operating systems is being worked on, so check the supported platforms document for an up-to-date list. The dependencies and components are listed here. You might also want to download the SDK, which contains samples, documentation, build system integration files, and more. I’ll be using the SYCL Parallel STL in this post, so get that if you want to play along at home.
Once you’re all set up, we can get GPGPUing! As noted in the introduction, my first sample used the SYCL parallel STL implementation. We’ll now take a look at how to write that code with bare SYCL.
I’ll break this down piece-by-piece.
The first thing we do is include the SYCL header file, which will put the SYCL runtime library at our command.
Here we construct a large array of integers and initialize it with the numbers from 0
to array_size-1
(this is what std::iota
does). Note that we use cl::sycl::cl_int
to ensure compatibility.
Next we open up a new scope. This achieves two things:
device_queue
will be destructed at the end of the scope, which will block until the kernel has completed.in_buffer
andout_buffer
will also be destructed, which will force data tranfer back to the host and allow us to access the data fromin
andout
.
Now we create our command queue. The command queue is where all work (kernels) will be enqueued before being dispatched to the device. There are many ways to customise the queue, such as providing a device to enqueue on or setting up asynchronous error handlers, but the default constructor will do for this example; it looks for a compatible GPU and falls back on the host CPU if it fails.
Next we create a range, which describes the shape of the data which the kernel will be executing on. In our simple example, it’s a one-dimensional array, so we use cl::sycl::range<1>
. If the data was two-dimensional we would use cl::sycl::range<2>
and so on. Alongside cl::sycl::range
, there is cl::sycl::ndrange
, which allows you to specify work group sizes as well as an overall range, but we don’t need that for our example.
In order to control data sharing and transfer between the host and devices, SYCL provides a buffer
class. We create two SYCL buffers to manage our input and output arrays.
After setting up all of our data, we can enqueue our actual work. There are a few ways to do this, but a simple method for setting up a parallel execution is to call the .submit
function on our queue. To this function we pass a command group functor2 which will be executed when the runtime schedules that task. A command group handler sets up any last resources needed by the kernel and dispatches it.
In order to control access to our buffers and to tell the runtime how we will be using the data, we need to create accessors. It should be clear that we are creating one accessor for reading from in_buffer
, and one accessor for writing to out_buffer
.
Now that we’ve done all the setup, we can actually do some computation on our device. Here we dispatch a kernel on the command group handler cgh
over our range n_items
. The actual kernel itself is a lambda which takes a work-item identifier and carries out our computation. In this case, we are reading from in_accessor
at the index of our work-item identifier, multiplying it by 2
, then storing the result in the relevant place in out_accessor
. That <class VecScalMul>
is an unfortunate byproduct of how SYCL needs to work within the confines of standard C++, so we need to give a unique class name to the kernel for the compiler to be able to do its job.
After this point, our kernel will have completed and we could access out
and expect to see the correct results.
There are quite a few new concepts at play here, but hopefully you can see the power and expressibility we get using these techniques. However, if you just want to toss some code at your GPU and not worry about the customisation, then you can use the SYCL Parallel STL implementation.
SYCL Parallel STL
The SYCL Parallel STL is an implementation of the Parallelism TS which dispatches your algorithm function objects as SYCL kernels. We already saw an example of this at the top of the page, so let’s run through it quickly.
So far, so similar. Again we’re creating a couple of arrays to hold our input and output data.
Here we are creating our buffers and our queue like in the last example.
Here’s where things get interesting. We create a sycl_execution_policy
from our queue and give it a name to use for the kernel. This execution policy can then be used like std::execution::par
or std::execution::seq
.
Now our kernel dispatch looks like a call to std::transform
with an execution policy provided. That closure we pass in will be compiled for and executed on the device without us having to do any more complex set up.
Of course, you can do more than just transform
. At the time of writing, the SYCL Parallel STL supports these algorithms:
sort
transform
for_each
for_each_n
count_if
reduce
inner_product
transform_reduce
That covers things for this short introduction. If you want to keep up to date with developments in SYCL, be sure to check out sycl.tech. Notable recent developments have been porting Eigen and Tensorflow to SYCL to bring expressive artificial intelligence programming to OpenCL devices. Personally, I’m excited to see how the high-level programming models can be exploited for automatic optimization of heterogeneous programs, and how they can support even higher-level technologies like HPX or SkelCL.
Let me know what you think of this article on twitter @TartanLlama or leave a comment below!