Sunday, February 25, 2018

clmempatterns: Benchmarking GPU memory access strides

Typically, streaming memory loads on GPUs are applied sequentially by the programmer by assigning sequential threads to sequential addresses in order to enforce coalescing. On the other hand, CPUs tend to favor sequentially accessed addresses by each individual thread in order to increase the degree of spatial locality and make better use of cache. Are there any intermediate patterns between these two extreme cases? How would a CPU or GPU device behave under an intermediate situation?

Here is where oclmempatterns benchmark tool comes into play. This benchmark leverages OpenCL to explore the memory performance under different access stride selections. Accessing a memory space is benchmarked by applying all possible access strides that are integer powers of 2, till the amount of total threads is reached. For example, imagine in a simplified scenario that we have to access 16 elements by using a total of 4 threads. For CPUs a good choice is typically using single strided accesses as shown in the figure below.

Accessing memory with unit strides
However, on GPUs a fairly good choice is typically using strides equal to the total amount of threads. This would apply accesses as shown below:
Accessing memory with strides of 4, which equals to the amount of total threads
However, there are many intermediate cases where we can apply various strides. In this simplified example we could apply strides of 2 as shown below:

Accessing memory with strides of 2
In a real examples there can be tens of different cases as the memory space can be large, each applying a different power of two stride. If we assume that we have a total amount of N elements quantified as an exact power of 2 then an exact amount of log2(N) bits are required to address them. For instance, if 226 elements are accessed then 26 bits are required for indexing these elements. Having a smaller amount of threads to process these elements, e.g. 220, would yield a thread index space of 20 bits total. So, by using strides equal of the total thread index space would lead to the following representation whole element space:

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

Each cell represents a bit. These 26 bits consist the whole element space. Red bits represent the part of the address that is designated by the thread stride and the green bits are designated by the thread index. This means that each thread uses its thread index to define the green part of the address and thereafter enumerates sequentially each possible value of the red part, applying the memory access for each element address.

Of course there are other intermediate cases as seen bellow that are tested by the benchmark:

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01

Each one corresponds to a different shift of the red part in the whole address representation. The last one is the other extreme case typically used on CPUs where each thread accesses elements residing on sequential addresses.

So, what would the memory access bandwidth would be in all these cases? This is the purpose of clmempatterns benchmark tool. In the figure below you can see measurements of memory bandwidth by using this tool to access 64M of int elements by using 1M of total threads on a GTX-1060 GPU. As seen using any power of two stride from 32 and beyond leads to good memory bandwidth.
clmempatterns benchmark execution on GTX-1060 (64M int elements, 1M grid size, 256 workitems/workgroup, granularity: 64 elements/workitem)

The tool is open source and you may freely experiment with it. I would be glad to let me know about any interesting results you might get.

URL: https://github.com/ekondis/clmempatterns

Friday, April 21, 2017

GDC17 AMD Ryzen CPU Optimization slides

Click here to see the slides (in PDF format) of the presentation on GDC17 regarding optimizing the AMD Ryzen. In their slides, the yet unreleased CodeXL v2.3 is shown. Enjoy!


Sunday, January 22, 2017

Fine grained memory management on AMD Vega GPUs


Recently, some advancements of the upcoming AMD Vega GPU architecture had been disclosed. One of the most interesting features as you might have noticed was the new memory architecture. The upcoming Vega GPUs are reportedly having support for fine grained memory management and employ fast HBM2 memories. In addition, the GPU employs a mechanism which AMD calls high bandwidth cache controller and the GPU memory is referred to as high bandwidth cache. In this sense they claim that it is used as a cache memory, that is a secondary level of cache by having the main system memory serving as the last level of the hierarchy. This fact makes me think that the Vega architecture might support the same fine grained shared virtual memory infrastructure that the NVidia Pascal already implements on its GP100 GPU but that is not clear yet. No reference to shared virtual memory was made so this is just a personal speculation. Only the fact that the GPU would be able to utilize the system memory does not imply that the GPU will utilize the same virtual addressing mechanism as the CPU does. However, it is reasonable to think this will be supported since AMD wants to grab a significant portion of the HPC market through its ROCm platform.

If this is the case then it will prove to be an exciting feature with the benefits I had referred to in a previous post of mine.




Source: Anandtech

Monday, December 26, 2016

OpenCL/ROCm clinfo output on AMD Fiji

This month with the release of AMD ROCm v1.4 we also had a taste of the preview version of the OpenCL runtime on ROCm. For anyone curious about it here is the clinfo output on an AMD R9-Nano GPU (external URL on gist):

Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.0 AMD-APP (2300.5)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback cl_amd_offline_devices


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               1
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     1002h
  Board name:                                    Fiji [Radeon R9 FURY / NANO Series]
  Device Topology:                               PCI[ B#1, D#0, F#0 ]
  Max compute units:                             64
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           256
  Preferred vector width char:                   4
  Preferred vector width short:                  2
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 1
  Native vector width char:                      4
  Native vector width short:                     2
  Native vector width int:                       1
  Native vector width long:                      1
  Native vector width float:                     1
  Native vector width double:                    1
  Max clock frequency:                           1000Mhz
  Address bits:                                  64
  Max memory allocation:                         3221225472
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    29440
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     No
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    16384
  Global memory size:                            4294967296
  Constant buffer size:                          3221225472
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             65536
  Max pipe arguments:                            0
  Max pipe active reservations:                  0
  Max pipe packet size:                          0
  Max global variable size:                      3221225472
  Max global variable preferred total size:      4294967296
  Max read/write image args:                     64
  Max on device events:                          0
  Queue on device max size:                      0
  Max on device queues:                          0
  Queue on device preferred size:                0
  SVM capabilities:
    Coarse grain buffer:                         Yes
    Fine grain buffer:                           Yes
    Fine grain system:                           No
    Atomics:                                     No
  Preferred platform atomic alignment:           0
  Preferred global atomic alignment:             0
  Preferred local atomic alignment:              0
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue on Host properties:
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:
    Out-of-Order:                                No
    Profiling :                                  No
  Platform ID:                                   0x7f7273868198
  Name:                                          gfx803
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 2.0
  Driver version:                                1.1 (HSA,LC)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2
  Extensions:                                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images


Sunday, September 18, 2016

NVidia Pascal's GPU architecture most exciting feature

Few months ago NVidia announced the Pascal GPU architecture and more specifically the GP100 GPU. This is a monstrous GPU with more than 15 billion transistors built using a 16nm FinFET fabrication. Though, the alleged performance numbers are arguably impressive (10.6 TFlops SP, 5.3 TFlops DP) I personally think that this is not the most impressive feature of this GPU.

The most impressive feature I found on as advertised is the unified memory support. In CUDA 6 and CC3.0 & CC3.5 devices (Kepler architecture) this term had been first introduced. But it didn't actually provide any real benefits at the time other than programming laziness. In particular, the run-time took care of moving the whole data to/from the GPU memory whenever it was used on either the host or GPU. The GP100 memory unification seems far more complete as according to specifications it seems to take memory unification to the next level. It supports data migration at the granularity of memory page! This means that programmer is able to "see" the whole system memory and the run-time takes care of which memory page should be moved at the time it is actually needed. This is a great feature! It allows porting CPU programs to CUDA without caring which data will actually be accessed.

For instance, imagine having a huge tree or graph structure and and you have a GPU kernel that needs to access just a few nodes on it without knowing which beforehand. Using the Kepler memory unification feature would require copying the whole structure from the host to GPU memory which could potentially cannibalize performance. The Pascal memory unification would actually copy only the memory pages residing on the accessed nodes, instead. This releases programmer from a great pain and that's why I think this is the most exciting feature.

I really hope this feature will be eventually supported on consumer GPU variants and stays not just an HPC feature for in Tesla products. I also hope that AMD will also support such a feature in its emerging ROCm platform.

Resources:

Thursday, May 19, 2016

mixbench on an AMD Fiji GPU

Recently, I had the quite pleasant opportunity to be granted with the Radeon R9 Nano GPU card. This card features the Fiji GPU and as such it seems to be a compute beast as it features 4096 shader units and HBM memory with bandwidth reaching to 512GB/sec. If one considers the card's remarkably small size and low power consumption, this card proves to be a great and efficient compute device for handling parallel compute tasks via OpenCL (or HIP, but more on this on a later post).

AMD R9 Nano GPU card

One of the first experiments I tried on it was the mixbench microbenchmark tool, of course. Expressing the execution results via gnuplot in the memory bandwidth/compute throughput plane is depicted here:

mixbench-ocl-ro as executed on the R9 Nano
GPU performance effectively approaches 8 TeraFlops of single precision compute performance on heavily compute intensive kernels whereas it exceeds 450GB/sec memory bandwidth on memory oriented kernels.

For anyone interested in trying mixbench on their CUDA/OpenCL/HIP GPU please follow the link to github:
https://github.com/ekondis/mixbench

Here is an example of execution on Ubuntu Linux:



Acknowledgement: I would like to greatly thank the Radeon Open Compute department of AMD for kindly supplying the Radeon R9 Nano GPU card for the support of our research.

Saturday, March 19, 2016

Raspberry PI 3 is here!

Some days ago the Raspberry PI 3 arrived home as I had ordered one when I heard of its launch. It's certainly a faster PI than the PI 2 due to the ARM Cortex-A53 cores. More or less the +50% performance ratio is true, depending on the application of course. There are some other additions as well like WiFi and bluetooth.

The Raspberry PI 3

A closer look of the PI 3

As usual, I am providing some nbench execution results. These are consistent with the +50% performance claim. For those interested I had published nbench results on the PI 2 in the past.

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          654.04  :      16.77  :       5.51
STRING SORT         :          72.459  :      32.38  :       5.01
BITFIELD            :      1.9972e+08  :      34.26  :       7.16
FP EMULATION        :          134.28  :      64.44  :      14.87
FOURIER             :          6677.3  :       7.59  :       4.27
ASSIGNMENT          :          10.381  :      39.50  :      10.25
IDEA                :          2740.7  :      41.92  :      12.45
HUFFMAN             :          1008.9  :      27.98  :       8.93
NEURAL NET          :          9.8057  :      15.75  :       6.63
LU DECOMPOSITION    :          365.38  :      18.93  :      13.67
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 34.272
FLOATING-POINT INDEX: 13.131
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : 4 CPU ARMv7 Processor rev 4 (v7l)
L2 Cache            :
OS                  : Linux 4.1.18-v7+
C compiler          : gcc-4.9
libc                : libc-2.19.so
MEMORY INDEX        : 7.162
INTEGER INDEX       : 9.769
FLOATING-POINT INDEX: 7.283
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

As I crossed some reports on temperature issues of PI 3 I wanted to execute some experiments on power consumption of the PI 3. I used a power meter on which I plugged the power supply unit feeding the PI. I run a few experiments and I got the following power consumption ratings:


PI running statePower consumption
Idle1.4W
Single threaded benchmark2.2W
Multithreaded benchmark4.0W
After running "poweroff"0.5W

So, for my case it doesn't seem consume to much power. However, a comparison with the PI 2 should be performed in order to have a better picture.