Thursday, May 19, 2016

mixbench on an AMD Fiji GPU

Recently, I had the quite pleasant opportunity to be granted with the Radeon R9 Nano GPU card. This card features the Fiji GPU and as such it seems to be a compute beast as it features 4096 shader units and HBM memory with bandwidth reaching to 512GB/sec. If one considers the card's remarkably small size and low power consumption, this card proves to be a great and efficient compute device for handling parallel compute tasks via OpenCL (or HIP, but more on this on a later post).

AMD R9 Nano GPU card

One of the first experiments I tried on it was the mixbench microbenchmark tool, of course. Expressing the execution results via gnuplot in the memory bandwidth/compute throughput plane is depicted here:

mixbench-ocl-ro as executed on the R9 Nano
GPU performance effectively approaches 8 TeraFlops of single precision compute performance on heavily compute intensive kernels whereas it exceeds 450GB/sec memory bandwidth on memory oriented kernels.

For anyone interested in trying mixbench on their CUDA/OpenCL/HIP GPU please follow the link to github:
https://github.com/ekondis/mixbench

Here is an example of execution on Ubuntu Linux:



Acknowledgement: I would like to greatly thank the Radeon Open Compute department of AMD for kindly supplying the Radeon R9 Nano GPU card for the support of our research.

Saturday, March 19, 2016

Raspberry PI 3 is here!

Some days ago the Raspberry PI 3 arrived home as I had ordered one when I heard of its launch. It's certainly a faster PI than the PI 2 due to the ARM Cortex-A53 cores. More or less the +50% performance ratio is true, depending on the application of course. There are some other additions as well like WiFi and bluetooth.

The Raspberry PI 3

A closer look of the PI 3

As usual, I am providing some nbench execution results. These are consistent with the +50% performance claim. For those interested I had published nbench results on the PI 2 in the past.

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          654.04  :      16.77  :       5.51
STRING SORT         :          72.459  :      32.38  :       5.01
BITFIELD            :      1.9972e+08  :      34.26  :       7.16
FP EMULATION        :          134.28  :      64.44  :      14.87
FOURIER             :          6677.3  :       7.59  :       4.27
ASSIGNMENT          :          10.381  :      39.50  :      10.25
IDEA                :          2740.7  :      41.92  :      12.45
HUFFMAN             :          1008.9  :      27.98  :       8.93
NEURAL NET          :          9.8057  :      15.75  :       6.63
LU DECOMPOSITION    :          365.38  :      18.93  :      13.67
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 34.272
FLOATING-POINT INDEX: 13.131
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : 4 CPU ARMv7 Processor rev 4 (v7l)
L2 Cache            :
OS                  : Linux 4.1.18-v7+
C compiler          : gcc-4.9
libc                : libc-2.19.so
MEMORY INDEX        : 7.162
INTEGER INDEX       : 9.769
FLOATING-POINT INDEX: 7.283
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
* Trademarks are property of their respective holder.

As I crossed some reports on temperature issues of PI 3 I wanted to execute some experiments on power consumption of the PI 3. I used a power meter on which I plugged the power supply unit feeding the PI. I run a few experiments and I got the following power consumption ratings:


PI running statePower consumption
Idle1.4W
Single threaded benchmark2.2W
Multithreaded benchmark4.0W
After running "poweroff"0.5W

So, for my case it doesn't seem consume to much power. However, a comparison with the PI 2 should be performed in order to have a better picture.

Sunday, November 22, 2015

mixbench benchmark OpenCL implementation

Four and a half months ago I posted an article about mixbench benchmark. This benchmark was used to assess performance of an artificial kernel with mixed compute and memory operations which corresponds to various operational intensities (Flops/byte ratios). The implementation was based on CUDA and therefore only NVidia GPUs could be used.

Now, I've ported the CUDA implementation to OpenCL and here I provide some performance numbers on an AMD R7-260X. Here is the output when using 128MB memory buffer:

mixbench-ocl (compute & memory balancing GPU microbenchmark)
Use "-h" argument to see available options
------------------------ Device specifications ------------------------
Device:              Bonaire
Driver version:      1800.11 (VM)
GPU clock rate:      1175 MHz
Total global mem:    1871 MB
Max allowed buffer:  1336 MB
OpenCL version:      OpenCL 2.0 AMD-APP (1800.11)
Total CUs:           14
-----------------------------------------------------------------------
Buffer size: 128MB
Workgroup size: 256
Workitem stride: NDRange
Loading kernel source file...
Precompilation of kernels... [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>]
--------------------------------------------------- CSV data --------------------------------------------------
Single Precision ops,,,,              Double precision ops,,,,              Integer operations,,,
Flops/byte, ex.time,  GFLOPS, GB/sec, Flops/byte, ex.time,  GFLOPS, GB/sec, Iops/byte, ex.time,   GIOPS, GB/sec
     0.000,  273.95,    0.00,  62.71,      0.000,  519.39,    0.00,  66.15,     0.000,  258.30,    0.00,  66.51
     0.065,  252.12,    4.26,  66.01,      0.032,  506.86,    2.12,  65.67,     0.065,  252.08,    4.26,  66.02
     0.133,  241.49,    8.89,  66.69,      0.067,  487.11,    4.41,  66.13,     0.133,  241.59,    8.89,  66.67
     0.207,  235.72,   13.67,  66.05,      0.103,  474.25,    6.79,  65.66,     0.207,  236.35,   13.63,  65.87
     0.286,  225.46,   19.05,  66.67,      0.143,  453.92,    9.46,  66.23,     0.286,  225.05,   19.08,  66.80
     0.370,  219.59,   24.45,  66.01,      0.185,  442.80,   12.12,  65.47,     0.370,  220.15,   24.39,  65.84
     0.462,  209.03,   30.82,  66.78,      0.231,  421.14,   15.30,  66.29,     0.462,  209.10,   30.81,  66.76
     0.560,  203.60,   36.92,  65.92,      0.280,  409.07,   18.37,  65.62,     0.560,  203.99,   36.85,  65.80
     0.667,  192.80,   44.55,  66.83,      0.333,  388.95,   22.09,  66.26,     0.667,  193.27,   44.44,  66.67
     0.783,  187.81,   51.46,  65.75,      0.391,  378.34,   25.54,  65.27,     0.783,  187.86,   51.44,  65.73
     0.909,  177.09,   60.63,  66.70,      0.455,  357.29,   30.05,  66.12,     0.909,  177.18,   60.60,  66.66
     1.048,  171.62,   68.82,  65.69,      0.524,  345.04,   34.23,  65.35,     1.048,  171.59,   68.83,  65.70
     1.200,  160.76,   80.15,  66.79,      0.600,  325.75,   39.55,  65.92,     1.200,  160.57,   80.24,  66.87
     1.368,  155.33,   89.86,  65.67,      0.684,  313.23,   44.56,  65.13,     1.368,  155.30,   89.88,  65.68
     1.556,  144.48,  104.05,  66.89,      0.778,  293.56,   51.21,  65.84,     1.556,  144.62,  103.95,  66.82
     1.765,  139.33,  115.60,  65.51,      0.882,  281.60,   57.20,  64.82,     1.765,  139.33,  115.60,  65.50
     2.000,  128.79,  133.40,  66.70,      1.000,  261.47,   65.70,  65.70,     2.000,  128.86,  133.32,  66.66
     2.267,  117.57,  155.26,  68.50,      1.133,  235.53,   77.50,  68.38,     2.267,  117.49,  155.36,  68.54
     2.571,  112.96,  171.10,  66.54,      1.286,  246.34,   78.46,  61.02,     2.571,  112.65,  171.57,  66.72
     2.923,  101.62,  200.77,  68.68,      1.462,  257.16,   79.33,  54.28,     2.923,  101.13,  201.72,  69.01
     3.333,   96.64,  222.22,  66.67,      1.667,  268.00,   80.13,  48.08,     3.333,   95.65,  224.51,  67.35
     3.818,   83.93,  268.65,  70.36,      1.909,  278.84,   80.86,  42.36,     3.818,   72.92,  309.24,  80.99
     4.400,   80.58,  293.16,  66.63,      2.200,  289.68,   81.55,  37.07,     4.400,   73.59,  321.00,  72.95
     5.111,   67.67,  364.96,  71.41,      2.556,  300.58,   82.16,  32.15,     5.111,   74.28,  332.49,  65.05
     6.000,   64.45,  399.83,  66.64,      3.000,  311.43,   82.75,  27.58,     6.000,   75.29,  342.26,  57.04
     7.143,   50.01,  536.76,  75.15,      3.571,  322.26,   83.30,  23.32,     7.143,   76.25,  352.04,  49.29
     8.667,   48.34,  577.52,  66.64,      4.333,  333.09,   83.81,  19.34,     8.667,   77.26,  361.33,  41.69
    10.800,   33.47,  866.12,  80.20,      5.400,  343.93,   84.29,  15.61,    10.800,   78.25,  370.48,  34.30
    14.000,   32.22,  932.99,  66.64,      7.000,  354.77,   84.74,  12.11,    14.000,   79.26,  379.32,  27.09
    19.333,   20.68, 1505.69,  77.88,      9.667,  376.91,   82.62,   8.55,    19.333,   80.27,  387.93,  20.07
    30.000,   19.37, 1663.32,  55.44,     15.000,  378.17,   85.18,   5.68,    30.000,   81.26,  396.41,  13.21
    62.000,   18.46, 1802.66,  29.08,     31.000,  389.93,   85.36,   2.75,    62.000,   33.57,  991.64,  15.99
       inf,   16.68, 2059.77,   0.00,        inf,  397.94,   86.34,   0.00,       inf,   33.54, 1024.43,   0.00
---------------------------------------------------------------------------------------------------------------

And here is "memory bandwidth" to "compute throughput" plot on the single precision floating point experiment results:

The source code of mixbench is freely provided, hosted at a github repository and you can find it at https://github.com/ekondis/mixbench. I would be happy to include results from other GPUs as well. Please try this tool and let me know about your extracted results and thoughts.

Monday, November 16, 2015

OpenCL 2.1 and SPIR-V standards released!

I've just noticed that the OpenCL 2.1 and SPIR-V standards were released today!

I just hope that vendors will not take to long to introduce up to date SDKs and drivers.

OpenCL 2.1
SPIR-V

Wednesday, October 28, 2015

OpenCL on the Raspberry PI 2

OpenCL can be enabled on the Raspberry PI 2! However, you'll be disappointed to know that I'm referring to the utilization of its CPU, not GPU. Nevertheless, running OpenCL on the PI could be useful for development and experimentation on an embedded platform.

You'll need the POCL implementation (Portable OpenCL) which relies on the LLVM. I used the just released v0.12 of POCL and the Raspbian Jessie supplied LLVM v.3.5.

After compiling and installing POCL with the natural procedure (you might need to install some libraries from the raspbian repositories, e.g. libhwloc-dev, libclang-dev or mesa-common-dev) you'll be able to compile OpenCL programs on the PI. I tested the clpeak benchmark program but the compute results were rather poor:

Platform: Portable Computing Language
Device: pthread
Driver version : 0.12-pre (Linux ARM)
Compute units : 4
Clock frequency : 900 MHz

Global memory bandwidth (GBPS)
float : 0.85
float2 : 0.87
float4 : 0.76
float8 : 0.75
float16 : 0.81

Single-precision compute (GFLOPS)
float : 0.03
float2 : 0.03
float4 : 0.03
float8 : 0.03
float16 : 0.03

Transfer bandwidth (GBPS)
enqueueWriteBuffer : 0.79
enqueueReadBuffer : 0.69
enqueueMapBuffer(for read) : 12427.57
memcpy from mapped ptr : 0.69
enqueueUnmap(after write) : 18970.70
memcpy to mapped ptr : 0.70

Kernel launch latency : 190270.91 us

In addition, the integer benchmark could not be executed for some reason. However, memory bandwidth result was decent and using a personal benchmark tool I could measure more than 1.4GB/sec memory bandwidth which is really nice for a PI!


Saturday, July 4, 2015

mixbench: A GPU benchmark for mixed compute/transfer bound kernels

I have just released mixbench on github. It is a benchmark tool which assesses performance bounds on GPUs (compute or memory bound) under mixed workloads. Unfortunately, it's currently implemented on CUDA so only NVidia GPUs can be used. The compute part can be SP Flops, DP Flops or Int ops and the memory part is global memory traffic. Running the multiple experiments in a wide range of operational intensity values allows to examine the performance of GPUs under different kernel characteristics.

Running the program under a GTX-480 gives the following output:

mixbench (compute & memory balancing GPU microbenchmark)
------------------------ Device specifications ------------------------
Device:              GeForce GTX 480
CUDA driver version: 5.50
GPU clock rate:      1401 MHz
Memory clock rate:   924 MHz
Memory bus width:    384 bits
WarpSize:            32
L2 cache size:       768 KB
Total global mem:    1535 MB
ECC enabled:         No
Compute Capability:  2.0
Total SPs:           480 (15 MPs x 32 SPs/MP)
Compute throughput:  1344.96 GFlops (theoretical single precision FMAs)
Memory bandwidth:    177.41 GB/sec
-----------------------------------------------------------------------
Total GPU memory 1610285056, free 1195106304
Buffer size: 256MB
Trade-off type:compute with global memory (block strided)
---- EXCEL data ----
Operations ratio ;  Single Precision ops ;;;  Double precision ops ;;;    Integer operations   
  compute/memory ;    Time;  GFLOPS; GB/sec;    Time;  GFLOPS; GB/sec;    Time;   GIOPS; GB/sec
       0/32      ; 240.531;    0.00; 142.85; 475.150;    0.00; 144.63; 240.205;    0.00; 143.04
       1/31      ; 233.548;    9.20; 142.52; 460.193;    4.67; 144.66; 233.484;    9.20; 142.56
       2/30      ; 225.249;   19.07; 143.01; 445.144;    9.65; 144.73; 225.235;   19.07; 143.02
       3/29      ; 218.552;   29.48; 142.48; 430.575;   14.96; 144.64; 218.745;   29.45; 142.35
       4/28      ; 210.345;   40.84; 142.93; 415.425;   20.68; 144.74; 210.091;   40.89; 143.10
       5/27      ; 203.132;   52.86; 142.72; 400.472;   26.81; 144.78; 203.275;   52.82; 142.62
       6/26      ; 194.468;   66.26; 143.56; 385.434;   33.43; 144.86; 194.314;   66.31; 143.67
       7/25      ; 187.470;   80.19; 143.19; 370.915;   40.53; 144.74; 187.475;   80.18; 143.18
       8/24      ; 175.115;   98.11; 147.16; 355.723;   48.30; 144.89; 175.132;   98.10; 147.14
       9/23      ; 171.760;  112.53; 143.78; 341.353;   56.62; 144.70; 171.920;  112.42; 143.65
      10/22      ; 163.397;  131.43; 144.57; 326.007;   65.87; 144.92; 163.252;  131.54; 144.70
      11/21      ; 155.797;  151.62; 144.73; 311.655;   75.80; 144.70; 155.814;  151.61; 144.71
      12/20      ; 146.573;  175.82; 146.51; 296.386;   86.95; 144.91; 146.662;  175.71; 146.42
      13/19      ; 138.853;  201.06; 146.93; 281.757;   99.08; 144.81; 138.941;  200.93; 146.83
      14/18      ; 129.727;  231.75; 148.98; 266.401;  112.86; 145.10; 129.744;  231.72; 148.97
      15/17      ; 121.228;  265.72; 150.57; 251.283;  128.19; 145.28; 121.339;  265.47; 150.43
      16/16      ; 120.065;  286.18; 143.09; 235.740;  145.75; 145.75; 120.122;  286.04; 143.02
      17/15      ; 111.357;  327.84; 144.64; 219.472;  166.34; 146.77; 111.528;  327.34; 144.41
      18/14      ; 106.430;  363.19; 141.24; 231.498;  166.98; 129.87; 106.541;  362.82; 141.10
      19/13      ;  96.118;  424.50; 145.22; 243.534;  167.54; 114.63;  96.494;  422.85; 144.66
      20/12      ;  89.602;  479.34; 143.80; 256.247;  167.61; 100.57;  89.642;  479.13; 143.74
      21/11      ;  81.976;  550.13; 144.08; 269.055;  167.61;  87.80;  83.091;  542.74; 142.15
      22/10      ;  76.066;  621.10; 141.16; 282.898;  167.00;  75.91;  76.068;  621.08; 141.15
      23/ 9      ;  65.631;  752.57; 147.24; 295.743;  167.01;  65.35;  76.895;  642.33; 125.67
      24/ 8      ;  60.809;  847.57; 141.26; 307.479;  167.62;  55.87;  80.099;  643.45; 107.24
      25/ 7      ;  52.032; 1031.82; 144.45; 321.449;  167.02;  46.76;  83.296;  644.53;  90.23
      26/ 6      ;  48.321; 1155.49; 133.33; 334.305;  167.02;  38.54;  86.519;  645.35;  74.46
      27/ 5      ;  49.519; 1170.90; 108.42; 347.157;  167.02;  30.93;  89.729;  646.19;  59.83
      28/ 4      ;  50.704; 1185.90;  84.71; 360.013;  167.02;  23.86;  92.891;  647.31;  46.24
      29/ 3      ;  52.024; 1197.09;  61.92; 372.867;  167.02;  17.28;  96.115;  647.94;  33.51
      30/ 2      ;  53.377; 1206.97;  40.23; 385.722;  167.02;  11.13;  99.328;  648.61;  21.62
      31/ 1      ;  53.437; 1245.80;  20.09; 397.203;  167.60;   5.41; 101.247;  657.52;  10.61
      32/ 0      ;  53.558; 1283.08;   0.00; 410.012;  167.60;   0.00; 102.494;  670.47;   0.00
--------------------

The results for single and double precision Flops are illustrated in the following charts:
% of peak SP Flops and memory bandwidth performance related with the operational intensity
% of peak DP Flops and memory bandwidth performance related with the operational intensity
Compute throughput (SP Flops) vs memory bandwidth

Compute throughput (DP Flops) vs memory bandwidth

Publication:

Since this work was initially part of published research please cite the following publication where applicable:

Konstantinidis, E.; Cotronis, Y., "A Practical Performance Model for Compute and Memory Bound GPU Kernels," Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on , vol., no., pp.651,658, 4-6 March 2015
doi: 10.1109/PDP.2015.51
URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7092788&isnumber=7092002

Monday, June 15, 2015

IWOCL 2015 presentations available online


IWOCL 2015 (International Workshop on OpenCL) presentations are available online for free download. It's a good thing that the organizers provide them not long after the conference takes place.


For more info about IWOCL: Link