. _software-struct:

Software Structure

As the previous section illustrated, GPU is a SIMT (SIMD) for data parallel application. This section introduces the GPU evolved from Graphics GPU to the General purpose GPU (GPGPU) and the software architecture of GPUs and explores AI software frameworks designed for GPUs, NPUs, and CPUs.

Vector Processor

As described in the Computer Architecture: A Quantitative Approach book, the vector processor VMIPS introduces the Vector Length Register (VLR) and Vector Mask (VM) to support SIMD execution. The Vector Mask functions similarly to conditional instructions in CPUs. In vector processors, VM acts as a form of conditional execution mechanism.

✅ Vector-Length Registers: Handling Loops Not Equal to 64

for (i=0; i <n; i=i+1)
  Y[i] = a * X[i] + Y[i];

As above code, the value of n is not known at compile time.

Solution:

Compiler converts loop into multiple iterations of loops, where each iteration processes up to the maximum vector length maximum vector length (MVL) as shown as below. For VMIPS, the MVL is 64.

low = 0;
VL = (n % MVL); /*find odd-size piece using modulo op % */
for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/
  for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/
    Y[i] = a * X[i] + Y[i] ; /*main operation*/
  low = low + VL; /*start of next vector*/
  VL = MVL; /*reset the length to maximum vector length*/
}

The inner loop of the preceding code is vectorizable with length VL, which is equal to either (n % MVL) or MVL. The VLR register must be set twice in the code, once at each place where the variable VL in the code is assigned.

✅ Vector Mask Registers: Handling IF Statements in Vector Loops

for (i = 0; i < 64; i=i+1)
  if (X[i] != 0)
    X[i] = X[i] – Y[i];

For the VMIPS vector processor, the above code can be implemented using the Vector Length Register (VLR) as shown below.

Assembly code of Vector Processor (from page 276 of Quantitative)

LV V1,Rx         ;load vector X into V1
LV V2,Ry         ;load vector Y
L.D F0,#0        ;load FP zero into F0
SNEVS.D V1,F0    ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV V1,Rx         ;store the result in X
  • Code reference here [3].

General purpose GPU

Since GLSL shaders provide a general way for writing C code in them, if applying a software frame work instead of OpenGL API, then the system can run some data parallel computation on GPU for speeding up and even get CPU and GPU executing simultaneously. Furthmore, any language that allows the code running on the CPU to poll a GPU shader for return values, can create a GPGPU framework [1].

Mapping data in GPU

As described in the previous section on GPUs, the subset of the array calculation y[] = a * x[] + y[].

// Invoke DAXPY with 256 threads per Thread Block
__host__
int nblocks = (n+255) / 256;
daxpy<<<nblocks, 256>>>(2.0, x, y);
// DAXPY in CUDA
__device__
void daxpy(double a, double *x, double *y) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  a*x[i] + y[i];
}
  • name<<<dimGrid, dimBlock>>>(… parameter list …):

    • dimGrid: Number of Blocks in Grid

    • dimBlock: 256 Threads in Block

Assembly code of PTX (from page 300 of Quantitative book)

// The following sequence of PTX instructions is for one iteration of the
// DAXPY loop above.
shl.u32 R8, blockIdx, 9       ; Thread Block ID * Block size (512)
add.u32 R8, R8, threadIdx     ; R8 = i = my CUDA Thread ID
shl.u32 R8, R8, 3             ; byte offset
ld.global.f64 RD0, [X+R8]     ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8]     ; RD2 = Y[i]
mul.f64 RD0, RD0, RD4         ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 RD0, RD0, RD2         ; SuminRD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0     ; Y[i] = sum (X[i]*a + Y[i])

✅ Conditional Branching in GPUs [4]:

Assembly code of PTX (from refering page 302 of Quantitative book)

__device__
void lane-mask-ex( double *X, double *Y, double *Z) {
  if (X[i] != 0)
    X[i] = X[i] – Y[i];
  else X[i] = Z[i];
}
  • Code from here [5].

The following two instructions illustrate conditional (predicated) instruction execution on GPUs.

predicate = cond       // predicate is the mask register
@predicate instruction

This IF statement could compile to the following PTX instructions (assuming that R8 already has the scaled thread ID), with *Push, *Comp, *Pop indicating the branch synchronization markers inserted by the PTX assembler that push the old mask, complement the current mask, and pop to restore the old mask:

Assembly code of PTX (from refering page 302 of Quantitative book)

ld.global.f64 RD0, [X+R9]     ; RD0 = X[i]
setp.neq.s32 P1, RD0, #0      ; P1 is predicate register 1
@!P1, bra ELSE1, *Push        ; Push old mask, set new mask bits
                              ; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8]     ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2         ; Difference in RD0
st.global.f64 [X+R8], RD0     ; X[i]=RD0
ELSE1:
ld.global.f64 RD0, [Z+R8]     ; RD0 = Z[i]
st.global.f64 [X+R8], RD0     ; X[i] = RD0
ENDIF1:
ret, *Pop                     ; pop to restore old mask

The PTX:

setp.neq.s32 P1, RD0, #0

On actual NVIDIA hardware (SASS), the instruction typically becomes:

ISETP.NE.AND P1, PT, R0, RZ, PT ; PT is always-true predicate

This instruction compares the 32-bit signed integer value in register RD0 with the constant 0. The result of the comparison is written to the predicate register P1.

Semantics:

P1 = (RD0 != 0)

Each thread (lane) in the warp evaluates this comparison independently.

Example (8-thread warp):

RD0 values:  [5, 3, 0, 7, 1, 0, 4, 2]
P1 result:   [1, 1, 0, 1, 1, 0, 1, 1]

This instruction does not modify the active thread mask. It only produces a predicate value that will be used by later predicated instructions or branches.

@!P1 bra ELSE1, *Push

This is a predicated branch instruction.

The prefix @!P1 means the instruction executes only for threads where the predicate P1 is false.

Semantics:

if (!P1)
    branch to ELSE1

If all threads agree on the predicate value, the warp simply branches or falls through. However, if some threads have P1 = 1 and others have P1 = 0, control flow divergence occurs.

Mask Stack Operation

The *Push modifier indicates that the hardware must update the SIMT control-flow stack.

When divergence occurs:

  1. The current active mask is pushed onto the stack.

  2. The warp execution mask is split into two masks:

    • mask_then = active_mask & P1

    • mask_else = active_mask & !P1

  3. Execution proceeds with one mask while the other path is saved for later execution.

Conceptual behavior:

push(active_mask)

mask_then = active_mask &  P1
mask_else = active_mask & !P1

execute THEN block using mask_then
later execute ELSE block using mask_else
  1. Threads in the THEN mask execute the fall-through path, while threads in the ELSE mask branch to ELSE1.

5. Keep in mind, however, that the only choice for a SIMD Lane in a clock cycle is to perform the operation specified in the PTX instruction or be idle; two SIMD Lanes cannot simultaneously execute different instructions [4].

The following table explains how the elements of saxpy() are mapped to the Lanes of a SIMD Thread (Warp), which belongs to a Thread Block (Core) within a Grid.

Table 13 Mapping saxpy code to Fig. 68.

saxpy(()

Instance in Fig. 68

Description

blockDim.x

The index of Thread Block

blockDim: in this example configured as Fig. 68 is 16(Thread Blocks) * 16(SIDM Threads) = 256

blockIdx.x

The index of SIMD Thread

blockIdx: the index of Thread Block within the Grid

threadIdx.x

The index of elements

threadIdx: the index of the SIMD Thread within its Thread Block

  • With Fermi, each 32-wide thread of SIMD instructions is mapped to 16 physical SIMD Lanes, so each SIMD instruction in a thread of SIMD instructions takes two clock cycles to complete.

  • You could say that it has 16 Lanes, the vector length would be 32, and the Chime is 2 clock cycles.

  • The mape of y[0..31] = a * x[0..31] * y[0..31] to <Core, Warp, Cuda Thread> of GPU as the following table. x[0..31] map to 32 Cuda Threads; two Cuda Threads map to one SIMD Lane as Fig. 67..

Table 14 Map <Core, Warp> to saxpy

Warp-0

Warp-1

Warp-15

Core-0

y[0..31] = a * x[0..31] * y[0..31]

y[32..63] = a * x[32..63] + y[32..63]

y[480..511] = a * x[480..511] + y[480..511]

Core-15

y[7680..7711] = a * …

y[8160..8191] = a * x[8160..8191] + y[8160..8191]

  • Each Cuda Thread runs the GPU function code saxpy. Fermi has a register file of size 32768 x 32-bit. As shown in Fig. 59, the number of registers in a Thread Block is: 16 (SM) * 32 (Cuda Threads) * 64 (TLR, Thread Level Register) = 32768 x 32-bit (Register file).

  • When mapping to fragments/pixels in a graphics GPU, x[0..15] corresponds to a two-dimensional tile of fragments/pixels at pixel[0..3][0..3], since images use tile-based grouping to cluster similar colors together.

Work between CPU and GPU in Cuda

The previous daxpy() GPU code did not include the host (CPU) side code that triggers the GPU function.

The following example shows the host (CPU) side of a CUDA program that calls saxpy on the GPU [2]:

#include <stdio.h>

__global__
void saxpy(int n, float a, float * x, float * y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  ...
  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
  ...
  saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);
  ...
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
  ...
}

The main() function runs on the CPU, while saxpy() runs on the GPU. The CPU copies data from x and y to the corresponding device arrays d_x and d_y using cudaMemcpy.

The saxpy kernel is launched with the following statement:

saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);

This launches the kernel with Thread Blocks containing 256 threads, and uses integer arithmetic to determine the number of Thread Blocks needed to process all N elements in the arrays. The expression (N+255)/256 ensures full coverage of the input data.

Using cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost, the CPU can pass data in x and y to the GPU, and retrieve the results back to y.

Since both memory transfers are handled by DMA and do not require CPU operation, the performance can be improved by running CPU and GPU independently, each accessing their own cache.

After the DMA copy from CPU memory to GPU memory, the GPU performs the full matrix operation loop for y[] = a * x[] + y[]; using a single Grid of threads.

DMA memcpy maps the data in CPU memory to each L1 cache of a core on GPU memory.

Many GPUs support scatter and gather operations to access DRAM efficiently for stream processing tasks [6] [1] [7].

When the GPU function is dense computation in array such as MPEG4 encoder or deep learning for tuning weights, it may get much speed up [8]. However when GPU function is matrix addition and CPU will idle for waiting GPU’s result. It may slow down than doing matrix addition by CPU only. Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speedup [1].

Wiki here [9] includes GPU-accelerated applications for speedup as follows:

General Purpose Computing on GPU, has found its way into fields as diverse as machine learning, oil exploration, scientific image processing, linear algebra, statistics, 3D reconstruction and even stock options pricing determination. In addition, section “GPU accelerated video decoding and encoding” for video compressing [9] gives the more applications for GPU acceleration.

Table 15 The differences for speedup in architecture of CPU and GPU

Item

CPU

GPU

Application

Non-data parallel

Data parallel

Architecture

SISD, small vector (eg.4*32bits)

Large SIMD (eg.16*32bits)

Cache

Smaller and faster

Larger and slower (ref. The following Note)

ILP

Pipeline

Pipeline

Superscalar, SMT

SIMT

Super-pipeline

Core

Smaller threads for SMT (2 or 4)

Larger threads (16 or 32)

Branch

Conditional-instructions

Mask & conditional-instructions

Note

GPU-Cache

In theory, for data-parallel applications using GPU’s SMT, the GPU can schedule more threads and aims for throughput rather than speedup of a single thread, as seen in SISD on CPUs.

However, in practice, GPUs provide only a small L1 cache, similar to CPUs, and handle cache misses by scheduling another thread.

As a result, GPUs often lack L2 and L3 caches, which are common in CPUs with deeper cache hierarchies.

OpenCL, Vulkan and Spir-v

digraph G {
  rankdir=LR;

  compound=true;
  node [shape=record];
  SW_LAYER [label="{ GLSL | OpenCL } | SPIR-V | GPU machine code"];
}

Fig. 86 OpenCL and GLSL(OpenGL)

Table 16 OpenCL and OpenGL SW system

Name of SW

GPU language

Level of GPU language

OpenCL

OpenCL

C99 dialect (with C pointer, …)

OpenGL

GLSL

C-like (no C pointer, …)

Vulkan

SPIR-V

IR

_images/opencl-to-spirv-offine-compilation.png

Fig. 87 Offline Compilation of OpenCL Kernels into SPIR-V Using Open Source Tooling [11]

  • clang: Compile OpenCL to spirv for runtime+driver. Or compile OpenCL to llvm, then “SPIR-V LLVM Translator” translate llvm to spirv for runtime+driver.

  • clspv: Compile OpenCL to spirv directly.

digraph ShaderToLLVMIR {
    rankdir=LR;
    node [shape=record, style=filled, color=black];

    // Source Languages
    GLSL [label="GLSL ES (OpenGL)", fillcolor=white];
    OpenCL_C [label="OpenCL C", fillcolor=white];

    // Intermediate Representation
    SPIRV [label="SPIR-V", fillcolor=orange];

    GPU_ISA [label="GPU ISA in Assembly and Binary", fillcolor=grey];

    // LLVM IR
    LLVM_IR [label="LLVM IR", fillcolor=orange];

    // Tools with oval shapes
    node [shape=oval, style=filled, fillcolor=lightgreen];
    Glslang [label="glslangValidator", fillcolor=lightblue];
    CL_SPIRV [label="OpenCL-SPIRV Translator"];

    // Tools with oval shapes
    node [shape=oval, style=filled, fillcolor=yellow];
    SPIRV_LLVM [label="SPIRV-LLVM Translator"];
    LLVMCompiler [label="Backend Compiler"];

    // Edges
    GLSL -> Glslang -> SPIRV;
    OpenCL_C -> CL_SPIRV -> SPIRV ->  SPIRV_LLVM -> LLVM_IR -> LLVMCompiler -> GPU_ISA;
}

Fig. 88 GPU Compiler Components and Flow

The flow and relationships among GLSL, OpenCL, SPIR-V (Vulkan/OpenCL), LLVM IR, and the GPU compiler are shown in the Fig. 86, Fig. 87 and Fig. 88. As shown in Fig. 88, OpenCL-C to SPIR-V (OpenCL) can be compiled using clang + llvm-spirv tools or a proprietary converter.

As shown in Fig. 88, both GLSL and OpenCL use frontend tools to generate SPIR-V. The driver can invoke either the GLSL or OpenCL compiler based on metadata fields in the SPIR-V, as illustrated in Fig. 89 and the following figures, which describe offline compilation from GLSL/OpenCL to SPIR-V and online execution using the generated SPIR-V files.

digraph SPIRV_Deployment {
    rankdir=LR;
    node [shape=box, style=filled, fillcolor=lightgray, fontname="Helvetica"];

    subgraph cluster_glsl {
        label = "From GLSL";
        glsl_src [label="GLSL Shader\n (.vert/.frag/.comp)", fillcolor=lightblue];
        glsl_compiler [label="glslangValidator\n(or similar compiler)", fillcolor=lightgreen];
        spirv_glsl [label="SPIR-V\n (from GLSL)", fillcolor=gold];
        glsl_src -> glsl_compiler -> spirv_glsl;
    }

    subgraph cluster_opencl {
        label = "From OpenCL C";
        opencl_src [label="OpenCL C (.cl)", fillcolor=lightblue];
        clang_spirv [label="Clang + SPIR-V Backend", fillcolor=lightgreen];
        spirv_opencl [label="SPIR-V\n (from OpenCL C)", fillcolor=gold];
        opencl_src -> clang_spirv -> spirv_opencl;
    }

    subgraph cluster_opencl_runtime {
        label = "OpenCL Runtime (Host)";
        spirv_loader [label="clCreateProgramWithIL()", fillcolor=orange];
        spirv_glsl -> spirv_loader;
        spirv_opencl -> spirv_loader;
        spirv_loader -> device_driver [label="Load\n SPIR-V\n into driver"];
        device_driver [label="OpenCL Driver\n(SPIR-V → Device IR → Machine Code)", fillcolor=plum];
        device_driver -> execution [label="Compiled &\n Run on device"];
        execution [label="Execute on\n OpenCL Device", fillcolor=lightyellow];
    }

    // Styling
    edge [fontname="Helvetica"];
}

Fig. 89 Compiling and Deploying GPU Code from GLSL, Vulkan, and OpenCL

Based on the flows above, the public standards OpenGL and OpenCL provide tools for transferring these data format, as illustrated in Fig. 90. The corresponding LLVM IR and SPIR-V formats are listed below.

digraph G {
  rankdir=LR;

  // Data Nodes
  node [shape=record, style=filled, fillcolor=white];
  glsl [label="glsl"];
  openclc [label="OpenCL C"];
  spirv [label="spirv"];
  llvm [label="llvm-ir"];

  // Tools Nodes
  node [shape=oval, style=filled, fillcolor=lightgreen];
  glslang [label="glslangValidator"];
  spirv_cross [label="spirv-cross"];
  clspv [label="clspv"];
  llvm_spirv [label="llvm-spirv"];

  glsl -> glslang -> spirv;
  glsl -> spirv_cross -> spirv [dir="back"];
  openclc -> clspv -> spirv;
  openclc -> clspv -> spirv [dir="back"];
  spirv -> llvm_spirv -> llvm;
  llvm -> llvm_spirv -> spirv;
}

Fig. 90 Convertion between GLSL, OpenCL C, SPIRV-V and LLVM-IR

References/add-matrix.ll

; ModuleID = 'add-matrix.ll'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-G1"
target triple = "spir64-unknown-unknown"

; Function Attrs: nounwind
define spir_func <4 x i32> @add_mat(<4 x i32> %a, <4 x i32> %b) #0 {
entry:
  %sum = add <4 x i32> %a, %b
  ret <4 x i32> %sum
}

attributes #0 = { nounwind }

!spirv.MemoryModel = !{!0}
!opencl.enable.FP_CONTRACT = !{}
!spirv.Source = !{!1}
!opencl.spir.version = !{!2}
!opencl.used.extensions = !{!3}
!opencl.used.optional.core.features = !{!3}
!spirv.Generator = !{!4}

!0 = !{i32 2, i32 2}
!1 = !{i32 0, i32 0}
!2 = !{i32 1, i32 2}
!3 = !{}
!4 = !{i16 6, i16 14}

References/add-matrix.spvasm

; SPIR-V
; Version: 1.0
; Generator: Khronos LLVM/SPIR-V Translator; 14
; Bound: 10
; Schema: 0
               OpCapability Addresses
               OpCapability Linkage
               OpCapability Kernel
          %1 = OpExtInstImport "OpenCL.std"
               OpMemoryModel Physical64 OpenCL
               OpSource Unknown 0
               OpName %add_mat "add_mat"
               OpName %a "a"
               OpName %b "b"
               OpName %entry "entry"
               OpName %sum "sum"
               OpDecorate %add_mat LinkageAttributes "add_mat" Export
       %uint = OpTypeInt 32 0
     %v4uint = OpTypeVector %uint 4
          %4 = OpTypeFunction %v4uint %v4uint %v4uint
    %add_mat = OpFunction %v4uint None %4
          %a = OpFunctionParameter %v4uint
          %b = OpFunctionParameter %v4uint
      %entry = OpLabel
        %sum = OpIAdd %v4uint %a %b
               OpReturnValue %sum
               OpFunctionEnd

Convert between spirv and llvm-ir

% pwd
$HOME/git/lbd/References
% llvm-as -o add-matrix.bc add-matrix.ll
% llvm-spirv -o add-matrix.spv add-matrix.bc
% spirv-dis -o add-matrix.spvasm add-matrix.spv
// Convert spirv to llvm-ir again and check the converted llvm-ir is same
// with origin.
% llvm-spirv -r add-matrix.spv -o add-matrix.spv.bc
% llvm-dis add-matrix.spv.bc -o add-matrix.spv.bc.ll
% diff add-matrix.ll add-matrix.spv.bc.ll
1c1
< ; ModuleID = 'add-matrix.ll'
---
> ; ModuleID = 'add-matrix.spv.bc'

Install llvm-spriv and llvm with Brew-install

% brew install spirv-llvm-translator
% brew install llvm

The following explains how the driver identifies whether the SPIR-V source is from GLSL or OpenCL.

SPIR-V binaries contain metadata that can help identify whether they were generated from OpenCL, GLSL, or another language.

  • Execution Model

    Defined by the OpEntryPoint instruction. It is a strong indicator of the source language.

    ExecutionModel

    Typical Source

    Notes

    Kernel

    OpenCL

    Used only by OpenCL C

    GLCompute

    GLSL or HLSL

    Used in compute shaders

    Fragment

    GLSL or HLSL

    For pixel shaders

    Vertex

    GLSL or HLSL

    For vertex shaders

  • Capabilities

    Declared using OpCapability. They provide clues about the SPIR-V’s execution model and source.

    Capability

    Likely Source

    Kernel

    OpenCL

    Addresses

    OpenCL

    Linkage

    OpenCL

    Shader

    GLSL or HLSL

  • Extensions

    Declared using OpExtension. Some are tied to specific compilers or languages.

    Extension

    Likely Source

    SPV_KHR_no_integer_wrap_decoration

    OpenCL

    SPV_INTEL_unified_shared_memory

    OpenCL (Intel)

    SPV_AMD_shader_ballot

    GLSL (graphics)

  • Memory Model

    Defined by OpMemoryModel.

    • OpenCL → OpenCL source

    • GLSL450 → GLSL or HLSL source

  • How to Inspect

    Use the spirv-dis tool to disassemble SPIR-V to human-readable form:

    spirv-dis kernel.spv -o kernel.spvasm
    

    Look for these at the top of the file:

    Example (GLSL):

    OpCapability Shader
    OpMemoryModel Logical GLSL450
    OpEntryPoint GLCompute %main "main"
    

    Example (OpenCL):

    OpCapability Kernel
    OpCapability Addresses
    OpMemoryModel Logical OpenCL
    OpEntryPoint Kernel %foo "foo"
    

Summary

Feature

Indicates

OpEntryPoint Kernel

OpenCL

OpCapability Shader

GLSL or HLSL

OpMemoryModel OpenCL

OpenCL

OpMemoryModel GLSL450

GLSL or HLSL

  • Comparsion for OpenCL and OpenGL’s compute shader.

    • Same:

      Both are for General Computing of GPU.

    • Difference:

      OpenCL include GPU and other accelerate device/processor. OpenCL is C language on Device and C++ on Host based on OpenCL runtime. Compute shader is GLSL shader language run on OpenGL graphic enviroment and integrate and access data of OpenGL API easily [10].

  • OpenGL/GLSL vs Vulkan/spir-v.

    • High level of API and shader: OpenGL, GLSL.

    • Low level of API and shader: Vulkan, spir-v.

Though OpenGL api existed in higher level with many advantages from sections above, sometimes it cannot compete in efficience with direct3D providing lower levels api for operating memory by user program [13]. Vulkan api is lower level’s C/C++ api to fill the gap allowing user program to do these things in OpenGL to compete against Microsoft direct3D. Here is an example [14]. Meanwhile glsl is C-like language. The vulkan infrastructure provides tool, glslangValidator [15], to compile glsl into an Intermediate Representation Form (IR) called spir-v off-line. As a result, it saves part of compilation time from glsl to gpu instructions on-line since spir-v is an IR of level closing to llvm IR [16]. In addition, vulkan api reduces gpu drivers efforts in optimization and code generation [13]. These standards provide user programmer option to using vulkan/spir-v instead of OpenGL/glsl, and allow them pre-compiling glsl into spir-v off-line to saving part of on-line compilation time.

With vulkan and spir-v standard, the gpu can be used in OpenCL for Parallel Programming of Heterogeneous Systems [17] [18]. Similar with Cuda, a OpenCL example for fast Fourier transform (FFT) is here [19]. Once OpenCL grows into a popular standard when more computer languages or framework supporting OpenCL language, GPU will take more jobs from CPU [20].

Most GPUs have 16 or 32 Lanes in a SIMD processor (Warp), vulkan provides Subgroup operations to data parallel programming on Lanes of SIMD processor [21].

Subgroup operations provide a fast way for moving data between Lanes intra Warp. Assuming each Warp has four Lanes. The following table lists result of reduce, inclusive and exclusive operations.

Table 17 Lists each Lane’s value after Reduce, Inclusive and Exclusive operations repectively

Lane

0

1

2

3

Initial value

a

b

c

d

Reduce

OP(abcd)

OP(abcd)

OP(abcd)

OP(abcd)

Inclusive

OP(a)

OP(ab)

OP(abc)

OP(abcd)

Exclusive

not define

OP(a)

OP(ab)

OP(abc)

  • Reduce: e.g. subgroupAdd. Inclusive: e.g. subgroupInclusiveAdd. Exclusive: e.g. subgroupExclusiveAdd.

  • For examples:

    • ADD operation: OP(abcd) = a+b+c+d.

    • MAX operation: OP(abc) = MAX(a,b,c).

  • When Lane i is inactive, it is value is none.

    • For instance of Lane 0 is inactive, then MUL operation: OP(abcd) = b*c*d.

The following is a code example.

An example of subgroup operations in glsl for vulkan

vec4 sum = vec4(0, 0, 0, 0);
if (gl_SubgroupInvocationID < 16u) {
  sum += subgroupAdd(in[gl_SubgroupInvocationID]);
}
else {
  sum += subgroupInclusiveMul(in[gl_SubgroupInvocationID]);
}
subgroupMemoryBarrier();
  • Nvidia’s GPU provides __syncWarp() for subgroupMemoryBarrier() or compiler to sync for the Lanes in the same Warp.

In order to let Lanes in the same SIMD processor work efficently, data unifomity analysis will provide many optimization opporturnities in register allocation, transformation and code generation [22].

LLVM IR expansion from CPU to GPU is becoming increasingly influential. In fact, LLVM IR has been expanding steadily from version 3.1 until now, as I have observed.

Unified IR Conversion Flows

This section outlines the intermediate representation (IR) flows for graphics (Microsoft DirectX, OpenGL) and OpenCL compilation across major GPU vendors: NVIDIA, AMD, ARM, Imagination Technologies, and Apple.

Graphics Compilation Flow (Microsoft DirectX & OpenGL)

Graphics shaders are compiled from high-level languages (HLSL, GLSL) into vendor-specific GPU binaries via intermediate representations like DXIL and SPIR-V.

✅ Each node in the graph is color-coded to indicate its category or role within the structure.

digraph G {
    node [shape=box, style=filled];
    PUIR [label="Public standard of IRs", fillcolor=lightyellow];
    PRIR [label="Private IRs", fillcolor=lightgray];
    VD [label="Vendor Driver", shape=oval, fillcolor=lightgreen];
    GPU [label="GPU", fillcolor=lightblue];
}
  • Vendor Driver will call Vendor Compiler for on-line compilation.

digraph OpenCL_OpenGL_Compilation {
    rankdir=LR;
    node [shape=box];

    // Source Languages
    OpenCL_C [style=filled, fillcolor=lightyellow];
    GLSL [style=filled, fillcolor=lightyellow];

    // Shared IRs
    SPIR [style=filled, fillcolor=lightyellow];
    SPIRV [label="SPIR-V IR", style=filled, fillcolor=lightyellow];
    LLVM_IR [style=filled, fillcolor=lightyellow];

    // Vendor Drivers
    node [shape=oval];
    "NVIDIA Driver" [style=filled, fillcolor=lightgreen];
    "AMD Driver" [style=filled, fillcolor=lightgreen];
    "ARM Driver" [style=filled, fillcolor=lightgreen];
    "Imagination Driver" [style=filled, fillcolor=lightgreen];
    "Apple Driver" [style=filled, fillcolor=lightgreen];

    // Private IRs
    node [shape=box];
    "PTX IR" [style=filled, fillcolor=gray];
    "GCN IR" [style=filled, fillcolor=gray];
    "Burst IR" [style=filled, fillcolor=gray];
    "Metal IR" [style=filled, fillcolor=gray];

    // GPU Targets
    "NVIDIA GPU ISA" [style=filled, fillcolor=lightblue];
    "AMD GPU ISA" [style=filled, fillcolor=lightblue];
    "ARM Mali ISA" [style=filled, fillcolor=lightblue];
    "Imagination USC ISA" [style=filled, fillcolor=lightblue];
    "Apple GPU ISA" [style=filled, fillcolor=lightblue];

    // OpenCL Flow
    OpenCL_C -> SPIR -> LLVM_IR;
    OpenCL_C -> SPIRV;

    // OpenGL Flow
    GLSL -> SPIRV -> LLVM_IR;

    // LLVM based Compilation Flow
    LLVM_IR -> "PTX IR" -> "NVIDIA Driver" -> "NVIDIA GPU ISA";
    LLVM_IR -> "GCN IR" -> "AMD Driver" -> "AMD GPU ISA";
    LLVM_IR -> "ARM Driver" -> "ARM Mali ISA";
    LLVM_IR -> "Burst IR" -> "Imagination Driver" -> "Imagination USC ISA";
    LLVM_IR -> "Metal IR" -> "Apple Driver" -> "Apple GPU ISA";
}

Fig. 91 Graphics and OpenCL Compiler IR Conversion Flow

  • OpenCL C is the device side code in C language while Host side code is C/C++.

  • OpenCL C is compiled to SPIR-V in later versions of OpenCL, while earlier versions used SPIR. SPIR-V has now largely replaced SPIR as the standard intermediate representation.

Table 18 Comparison of PTX, GCN IR, Burst IR, and Metal IR

IR Layer

Abstraction Level

Register Model

PTX (NVIDIA)

Virtual ISA; portable across GPU generations; hides hardware scheduling

Virtual registers (%r, %f); mapped to physical registers during SASS lowering

GCN IR (AMD)

Machine IR; tightly coupled to GCN/RDNA architecture; exposes Wavefront semantics

Explicit vector (vN) and scalar (sN) registers; register pressure affects occupancy. AMD’s compiler backend can lower vector operations to scalar instructions on low-end GPUs, while preserving vector operations on high-end architectures.

Burst IR (Imagination)

Power-aware IR; optimized for burst-core scheduling and latency hiding

Operand staging model; abstracted register usage; mapped late to USC ISA

Metal IR (Apple)

LLVM-inspired IR; abstracted from developers; tuned for tile shading and threadgroup fusion

Region-based register allocation; dynamic renaming; not exposed as physical register model

✅ NVIDIA, AMD, ARM and Imagination all have exposed LLVM IR and convert SPIR-V IR to LLVM IR.

  • SPIR:

    • For OpenCL development, the IR started from SPIR (LLVM-based IR).

    • SPIRV’s Limitation: tightly coupled to specific LLVM versions, making it brittle across.

  • SPIR-V:

    • A complete redesign: binary format, not tied to LLVM.

    • Designed for Vulkan, but also supports OpenCL and OpenGL.

    • Enables cross-vendor portability, shader reflection, and custom extensions.

    • Used in graphics and compute pipelines, including ML workloads via Vulkan compute.

    • A Vulkan shader written in GLSL is compiled to SPIR-V, then passed to the GPU driver.

    • An OpenCL kernel written in C can be compiled to SPIR-V, then lowered to LLVM IR internally by vendors like AMD or NVIDIA.

⚠️ Apple

  • Uses LLVM IR Partially. Apple supports SPIR-V in Metal and OpenCL, but LLVM IR is not always exposed.

  • Metal shaders are compiled via MetalIR, which is LLVM-inspired but not standard LLVM IR. Metal IR is not standard LLVM IR and is not exposed to developers.

  • Apple’s ML compiler stack may use LLVM IR internally, but it’s abstracted from developers.

  • Apple is not a vendor of GPU IP, so it does not expose LLVM IR in its ML or graphics APIs for the reasons below:

    • Security: opaque compilation prevents tampering

    • Performance tuning: Apple controls the entire stack for optimal hardware use

    • Developer simplicity: high-level APIs reduce friction

Notes:

  • HLSL → DXIL → DirectX is Microsoft’s graphics pipeline, used on Windows and Xbox.

  • GLSL → SPIR-V → OpenGL/Vulkan is cross-platform and supported by all vendors.

  • Final GPU ISA varies by vendor:

    • NVIDIA: PTX → SASS

    • AMD: LLVM IR → GCN/RDNA

    • ARM: Mali ISA

    • Imagination: USC ISA

    • Apple: Metal GPU ISA

Notes:

  • OpenCL C → SPIR → Vendor Driver → GPU ISA is the standard compilation path.

  • Some vendors (e.g., AMD, NVIDIA) may bypass SPIR and compile directly to LLVM IR or PTX.

  • Apple deprecated OpenCL in favor of Metal, but legacy support remains.

✅ References

ML and GPU Compilation

This section outlines the intermediate representation (IR) flows used by NVIDIA, AMD, and ARM in machine learning and GPU compilation pipelines. It includes both inference engines and compiler toolchains.

✅ Each node in the graph is color-coded to indicate its category or role within the structure. In AI, usually use runtime instead of driver for graphics.

digraph G {
    node [shape=box, style=filled];
    PUIR [label="Public standard of IRs", fillcolor=lightyellow];
    PRIR [label="Private IRs", fillcolor=lightgray];
    VDLR [label="Vendor Driver,\nLibrary or Runtime", shape=oval, fillcolor=lightgreen];
    MA [label="Machine", fillcolor=lightblue];
}

NVIDIA IR Conversion Flow

NVIDIA supports both TensorRT-based inference and MLIR-based compilation targeting CUDA GPUs is shown as Fig. 92.

digraph NVIDIA_IR_Flow {
        rankdir=LR;
  
        node [shape=box]; 
  
        ONNX [style=filled, fillcolor=lightyellow];
        "TensorRT Graph IR" [style=filled, fillcolor=lightgray];
        "CUDA Kernel IR" [style=filled, fillcolor=lightgray];
        PTX [style=filled, fillcolor=lightyellow];
        SASS [label="SASS (NVIDIA GPU ISA)", style=filled, fillcolor=lightblue];
        TensorRT [style=filled, shape=oval, fillcolor=lightgreen];

        TOSA [style=filled, fillcolor=lightyellow];
        "MLIR GPU Dialects" [style=filled, fillcolor=lightyellow];
        "LLVM IR" [style=filled, fillcolor=lightyellow];

        ONNX -> "TensorRT Graph IR" -> "CUDA Kernel IR" -> PTX;
        "TensorRT Graph IR" -> TensorRT;

        TOSA -> "MLIR GPU Dialects" -> "LLVM IR" -> PTX -> SASS;
    }

Fig. 92 NVIDIA IR Conversion Flow

  • SASS stands for Streaming ASSembler, and it represents the native instruction set architecture (ISA) for NVIDIA GPUs.

  • TensorRT is a C++ library and runtime developed by NVIDIA for deep learning inference—the phase where trained models make predictions.

    • It works with models trained in frameworks like TensorFlow, PyTorch, and ONNX, converting them into highly optimized engines for execution on NVIDIA hardware [23] [24].

  • CUDA Kernel IR is a bridge between LLVM IR and PTX/SASS, or a direct output from TensorRT.

  • LLVM IR is foundational in many flows, but TensorRT may skip it and directly emit CUDA kernels.

  • Although MLIR dialects may be publicly available, they are typically hardware-dependent and tailored to specific vendors’ GPU architectures. As a result, their applicability is limited to the corresponding hardware platforms.

  • MLIR GPU Dialects is public but it is for Nvidia’s GPU.

AMD IR Conversion Flow

AMD uses ROCm and MIOpen for ML workloads, with LLVM-based compilation targeting GCN or RDNA architectures is shown as Fig. 93.

digraph ROCm_Runtime_PyTorch_MIOpen_PreMLIR_Flow {
    rankdir=LR;
    node [shape=box];

    // Entry points
    PyTorch_Model [label="PyTorch Model", style=filled, fillcolor=lightyellow];
    ONNX_Model [label="ONNX Model", style=filled, fillcolor=lightyellow];
    MLIR_TOSA [label="MLIR (TOSA dialect)", style=filled, fillcolor=lightyellow];

    // Pre-MLIR optimization
    MIOpen_PreMLIR [label="MIOpen Graph IR (Pre-MLIR)", style=filled, fillcolor=gray];

    // Compilation layers
    ONNX_MLIR [label="ONNX-MLIR", style=filled, fillcolor=lightyellow];
    MLIR_GPU [label="MLIR GPU dialect", style=filled, fillcolor=lightyellow];
    LLVM_IR [style=filled, fillcolor=lightyellow];
    ROCm_BC [label="ROCm device libraries (.bc)", shape=oval, style=filled, fillcolor=lightgreen];
    GCN_IR [style=filled, fillcolor=gray];

    // Post-GCN optimization
    MIOpen_IR [label="MIOpen Graph IR (Post-GCN)", style=filled, fillcolor=gray];

    // Runtime layers
    ROCr_Runtime [label="ROCr Runtime", shape=oval, style=filled, fillcolor=lightgreen];

    // GPU Targets
    GPU [label="AMD GPU ISA", style=filled, fillcolor=lightblue];
 
   // Flow paths
    PyTorch_Model -> ONNX_Model;
    ONNX_Model -> MIOpen_PreMLIR -> ONNX_MLIR -> LLVM_IR;
    MLIR_TOSA -> MLIR_GPU -> LLVM_IR;

    LLVM_IR -> ROCm_BC;
    LLVM_IR -> GCN_IR;
    GCN_IR -> MIOpen_IR -> ROCr_Runtime -> GPU;

    // === Layering for better spacing ===
    { rank = same; MIOpen_PreMLIR; ONNX_MLIR }
    { rank = same; GCN_IR; MIOpen_IR }
}

Fig. 93 AMD IR Conversion Flow

  • ROCm is not just a compiler or driver — it includes a full runtime stack that enables AMD GPUs to execute compute workloads across HIP (Heterogeneous-compute Interface for Portability), OpenCL, and ML frameworks. It’s analogous to NVIDIA’s CUDA runtime but built around open standards like HSA (Heterogeneous System Architecture) [25] and LLVM.

  • MIOpen Graph IR includes different form and structure. (Pre-MLIR) and (Post-GCN) are different.

    • Developers interact with MIOpen via high-level APIs (e.g., miopenConvolutionForward) — not via direct IR manipulation.

    • While MIOpen itself is open source (GitHub repo), its graph IR format and transformation passes are internal.

ARM IR Conversion Flow

ARM supports both CPU/NPU deployment (e.g., Ethos-U/N) and GPU execution (e.g., Mali via Vulkan). The IR flow diverges depending on the target is shown as Fig. 94.

digraph ARM_IR_Flow {
        rankdir=LR;

        node [shape=box];

        "ONNX / TFLite" [style=filled, fillcolor=lightyellow];
        TOSA [style=filled, fillcolor=lightyellow];
        "MLIR Dialects" [style=filled, fillcolor=lightgray];
        "LLVM IR" [style=filled, fillcolor=lightyellow];
        "ARM Codegen" [style=filled, shape=oval, fillcolor=white];
        "Ethos-N / Cortex-M" [style=filled, fillcolor=lightblue];
        "MLIR GPU Dialects" [style=filled, fillcolor=lightgray];
        SPIRV [style=filled, fillcolor=lightyellow];
        "Mali GPU (Vulkan)" [style=filled, fillcolor=lightblue];
        "Ethos-N Driver" [style=filled, shape=oval, fillcolor=lightgreen];
        Vulkan [style=filled, shape=oval, fillcolor=lightgreen];

        "ONNX / TFLite" -> TOSA;

        TOSA -> "MLIR Dialects" -> "LLVM IR" -> "ARM Codegen" ->
        "Ethos-N / Cortex-M";
        "ARM Codegen" -> "Ethos-N Driver";

        TOSA -> "MLIR GPU Dialects" -> SPIRV -> "Mali GPU (Vulkan)";
        SPIRV -> Vulkan;
    }

Fig. 94 ARM IR Conversion Flow

  • Node “Mali GPU (Vulkan)” is the SPIR-V compilation flow that illustrated in the previous section.

  • Ethos-N is ARM’s NPU. Cortex-M is ARM’s CPU.

ARM Codegen generally emits instructions for CPU/NPU execution, but for certain NN operations (especially those requiring vendor-specific acceleration), it may generate function calls into the Ethos-N driver, which then orchestrates execution on the NPU.

✅ Common Case: Direct NPU/CPU Instruction Generation

  • For operations that are well-supported by the NPU or CPU, the codegen backend emits hardware-specific instructions or IR directly.

  • These are scheduled for execution on the CPU or passed to the Ethos-N via its driver stack.

⚙️ Special Case: Function Calls to Ethos-N Driver

  • For complex or fused neural network operations (e.g., custom activation functions, quantized convolutions, or optimized memory layouts), the codegen may emit calls (LLVM-IR `call`) to precompiled driver functions.

  • These function calls act as entry points into the Ethos-N runtime, which handles:

    • Memory management

    • Scheduling

    • Firmware-level execution

    • Hardware-specific optimizations

Imagination Technologies IR Conversion Flow

digraph Imagination_IR_Flow {
        rankdir=LR;
        node [shape=box];

        "ONNX / TFLite" [style=filled, fillcolor=lightyellow];
        TVM [style=filled, fillcolor=lightgreen];
        oneAPI [style=filled, fillcolor=lightgreen];
        OpenCL [style=filled, fillcolor=lightgreen];
        "E-Series Compiler IR" [style=filled, fillcolor=lightgray];
        "Burst Processor IR" [style=filled, fillcolor=lightgray];
        "Neural Core Execution" [style=filled, fillcolor=lightblue];

        "ONNX / TFLite" -> TVM -> "E-Series Compiler IR" -> "Burst Processor IR" -> "Neural Core Execution";
        "ONNX / TFLite" -> oneAPI -> "E-Series Compiler IR";
        "ONNX / TFLite" -> OpenCL -> "E-Series Compiler IR";
    }

Fig. 95 Imagination Technologies IR Conversion Flow

Notes:

Comparison Summary

Vendor

High-Level IR

Mid-Level IR

Low-Level IR

Libraries / Runtimes

NVIDIA

ONNX, TensorRT IR

MLIR GPU Dialects

PTX → SASS

TensorRT

AMD

ONNX, MIOpen IR

MLIR Dialects

LLVM IR → GCN ISA

MIOpen, ROCm

ARM

ONNX, TFLite

TOSA, MLIR Dialects

LLVM IR / SPIR-V

Ethos-N Driver, Vulkan

Imagination

ONNX, TFLite

E-Series Compiler IR

Burst IR → Neural Core

OpenCL, TVM, oneAPI

References

Accelerate ML/DL on OpenCL/SYCL

_images/opencl_ml_graph.png

Fig. 96 Implement ML graph scheduler both on compiler and runtime

As shown in Fig. 96, the Device, such as a GPU or a CPU+NPU, is capable of running the entire ML graph. However, if the Device has only an NPU, then operations like Avg-Pool, which require CPU support, must run on the Host side. This introduces communication overhead between the Host and the Device.

Similar to OpenGL shaders, the “kernel” function may be compiled either on-line or off-line and then sent to the GPU as a programmable function.

In order to run ML (Machine Learning) efficiently, all platforms for ML on GPU/NPU implement scheduling SW both on graph compiler and runtime. If OpenCL can extend to support ML graph, then graph compiler such as TVM or Runtime from Open Source have chance to leverage the effort of scheduling SW from programmers [26]. Cuda graph is an idea like this [27] [28] .

  • SYCL: Using C++ templates to optimize and genertate code for OpenCL and Cuda. Provides a consistent language, APIs, and ecosystem in which to write and tune code for different accelerator architecture, CPUs, GPUs, and FPGAs [29].

    • SYCL uses generic programming with templates and generic lambda functions to enable higher-level application software to be cleanly coded with optimized acceleration of kernel code across an extensive range of acceleration backend APIs, such as OpenCL and CUDA [30].

_images/sycl.png

Fig. 97 SYCL = C++ template and compiler for Data Parallel Applications on AI on CPUs, GPUs and HPGAs.

  • DPC++ (OneDPC) compiler: Based on SYCL, DPC++ can compile the DPC++ language for both CPU host and GPU device. DPC++ (Data Parallel C++) is a language developed by Intel and may be adopted into standard C++. The GPU-side (kernel code) is written in C++ but does not support exception handling [31] [32].

    • Features of Kernel Code:

      • Not supported:

        Dynamic polymorphism, dynamic memory allocations (therefore no object management using new or delete operators), static variables, function pointers, runtime type information (RTTI), and exception handling. No virtual member functions, and no variadic functions, are allowed to be called from kernel code. Recursion is not allowed within kernel code.

      • Supported:

        Lambdas, operator overloading, templates, classes, and static polymorphism [33].