Sunday, February 25, 2018

clmempatterns: Benchmarking GPU memory access strides

Typically, streaming memory loads on GPUs are applied sequentially by the programmer by assigning sequential threads to sequential addresses in order to enforce coalescing. On the other hand, CPUs tend to favor sequentially accessed addresses by each individual thread in order to increase the degree of spatial locality and make better use of cache. Are there any intermediate patterns between these two extreme cases? How would a CPU or GPU device behave under an intermediate situation?

Here is where oclmempatterns benchmark tool comes into play. This benchmark leverages OpenCL to explore the memory performance under different access stride selections. Accessing a memory space is benchmarked by applying all possible access strides that are integer powers of 2, till the amount of total threads is reached. For example, imagine in a simplified scenario that we have to access 16 elements by using a total of 4 threads. For CPUs a good choice is typically using single strided accesses as shown in the figure below.

Accessing memory with unit strides
However, on GPUs a fairly good choice is typically using strides equal to the total amount of threads. This would apply accesses as shown below:
Accessing memory with strides of 4, which equals to the amount of total threads
However, there are many intermediate cases where we can apply various strides. In this simplified example we could apply strides of 2 as shown below:

Accessing memory with strides of 2
In a real examples there can be tens of different cases as the memory space can be large, each applying a different power of two stride. If we assume that we have a total amount of N elements quantified as an exact power of 2 then an exact amount of log2(N) bits are required to address them. For instance, if 226 elements are accessed then 26 bits are required for indexing these elements. Having a smaller amount of threads to process these elements, e.g. 220, would yield a thread index space of 20 bits total. So, by using strides equal of the total thread index space would lead to the following representation whole element space:

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

Each cell represents a bit. These 26 bits consist the whole element space. Red bits represent the part of the address that is designated by the thread stride and the green bits are designated by the thread index. This means that each thread uses its thread index to define the green part of the address and thereafter enumerates sequentially each possible value of the red part, applying the memory access for each element address.

Of course there are other intermediate cases as seen bellow that are tested by the benchmark:

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

Each one corresponds to a different shift of the red part in the whole address representation. The last one is the other extreme case typically used on CPUs where each thread accesses elements residing on sequential addresses.

So, what would the memory access bandwidth would be in all these cases? This is the purpose of clmempatterns benchmark tool. In the figure below you can see measurements of memory bandwidth by using this tool to access 64M of int elements by using 1M of total threads on a GTX-1060 GPU. As seen using any power of two stride from 32 and beyond leads to good memory bandwidth.
clmempatterns benchmark execution on GTX-1060 (64M int elements, 1M grid size, 256 workitems/workgroup, granularity: 64 elements/workitem)

The tool is open source and you may freely experiment with it. I would be glad to let me know about any interesting results you might get.

URL: https://github.com/ekondis/clmempatterns