. _software-struct:

Software Structure¶

As the previous section illustrated, GPU is a SIMT (SIMD) for data parallel application. This section introduces the GPU evolved from Graphics GPU to the General purpose GPU (GPGPU) and the software architecture of GPUs and explores AI software frameworks designed for GPUs, NPUs, and CPUs.

Vector Processor ¶

As described in the Computer Architecture: A Quantitative Approach book, the vector processor VMIPS introduces the Vector Length Register (VLR) and Vector Mask (VM) to support SIMD execution. The Vector Mask functions similarly to conditional instructions in CPUs. In vector processors, VM acts as a form of conditional execution mechanism.

✅ Vector-Length Registers: Handling Loops Not Equal to 64

for (i=0; i <n; i=i+1)
  Y[i] = a * X[i] + Y[i];

As above code, the value of n is not known at compile time.

Solution:

Compiler converts loop into multiple iterations of loops, where each iteration processes up to the maximum vector length maximum vector length (MVL) as shown as below. For VMIPS, the MVL is 64.

low = 0;
VL = (n % MVL); /*find odd-size piece using modulo op % */
for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/
  for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/
    Y[i] = a * X[i] + Y[i] ; /*main operation*/
  low = low + VL; /*start of next vector*/
  VL = MVL; /*reset the length to maximum vector length*/
}

The inner loop of the preceding code is vectorizable with length VL, which is equal to either (n % MVL) or MVL. The VLR register must be set twice in the code, once at each place where the variable VL in the code is assigned.

✅ Vector Mask Registers: Handling IF Statements in Vector Loops

for (i = 0; i < 64; i=i+1)
  if (X[i] != 0)
    X[i] = X[i] – Y[i];

For the VMIPS vector processor, the above code can be implemented using the Vector Length Register (VLR) as shown below.

Assembly code of Vector Processor (from page 276 of Quantitative)

LV V1,Rx         ;load vector X into V1
LV V2,Ry         ;load vector Y
L.D F0,#0        ;load FP zero into F0
SNEVS.D V1,F0    ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV V1,Rx         ;store the result in X

Code reference here [3].

General purpose GPU ¶

Since GLSL shaders provide a general way for writing C code in them, if applying a software frame work instead of OpenGL API, then the system can run some data parallel computation on GPU for speeding up and even get CPU and GPU executing simultaneously. Furthmore, any language that allows the code running on the CPU to poll a GPU shader for return values, can create a GPGPU framework [1].

Mapping data in GPU ¶

As described in the previous section on GPUs, the subset of the array calculation y[] = a * x[] + y[].

// Invoke DAXPY with 256 threads per Thread Block
__host__
int nblocks = (n+255) / 256;
daxpy<<<nblocks, 256>>>(2.0, x, y);
// DAXPY in CUDA
__device__
void daxpy(double a, double *x, double *y) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  a*x[i] + y[i];
}

name<<<dimGrid, dimBlock>>>(… parameter list …):
- dimGrid: Number of Blocks in Grid
- dimBlock: 256 Threads in Block

Assembly code of PTX (from page 300 of Quantitative book)

// The following sequence of PTX instructions is for one iteration of the
// DAXPY loop above.
shl.u32 R8, blockIdx, 9       ; Thread Block ID * Block size (512)
add.u32 R8, R8, threadIdx     ; R8 = i = my CUDA Thread ID
shl.u32 R8, R8, 3             ; byte offset
ld.global.f64 RD0, [X+R8]     ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8]     ; RD2 = Y[i]
mul.f64 RD0, RD0, RD4         ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 RD0, RD0, RD2         ; SuminRD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0     ; Y[i] = sum (X[i]*a + Y[i])

✅ Conditional Branching in GPUs [4]:

Assembly code of PTX (from refering page 302 of Quantitative book)

__device__
void lane-mask-ex( double *X, double *Y, double *Z) {
  if (X[i] != 0)
    X[i] = X[i] – Y[i];
  else X[i] = Z[i];
}

Code from here [5].

The following two instructions illustrate conditional (predicated) instruction execution on GPUs.

predicate = cond       // predicate is the mask register
@predicate instruction

This IF statement could compile to the following PTX instructions (assuming that R8 already has the scaled thread ID), with *Push, *Comp, *Pop indicating the branch synchronization markers inserted by the PTX assembler that push the old mask, complement the current mask, and pop to restore the old mask:

Assembly code of PTX (from refering page 302 of Quantitative book)

ld.global.f64 RD0, [X+R9]     ; RD0 = X[i]
setp.neq.s32 P1, RD0, #0      ; P1 is predicate register 1
@!P1, bra ELSE1, *Push        ; Push old mask, set new mask bits
                              ; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8]     ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2         ; Difference in RD0
st.global.f64 [X+R8], RD0     ; X[i]=RD0
ELSE1:
ld.global.f64 RD0, [Z+R8]     ; RD0 = Z[i]
st.global.f64 [X+R8], RD0     ; X[i] = RD0
ENDIF1:
ret, *Pop                     ; pop to restore old mask

The PTX:

setp.neq.s32 P1, RD0, #0

On actual NVIDIA hardware (SASS), the instruction typically becomes:

ISETP.NE.AND P1, PT, R0, RZ, PT ; PT is always-true predicate

This instruction compares the 32-bit signed integer value in register RD0 with the constant 0. The result of the comparison is written to the predicate register P1.

Semantics:

P1 = (RD0 != 0)

Each thread (lane) in the warp evaluates this comparison independently.

Example (8-thread warp):

RD0 values:  [5, 3, 0, 7, 1, 0, 4, 2]
P1 result:   [1, 1, 0, 1, 1, 0, 1, 1]

This instruction does not modify the active thread mask. It only produces a predicate value that will be used by later predicated instructions or branches.

@!P1 bra ELSE1, *Push

This is a predicated branch instruction.

The prefix @!P1 means the instruction executes only for threads where the predicate P1 is false.

Semantics:

if (!P1)
    branch to ELSE1

If all threads agree on the predicate value, the warp simply branches or falls through. However, if some threads have P1 = 1 and others have P1 = 0, control flow divergence occurs.

Mask Stack Operation

The *Push modifier indicates that the hardware must update the SIMT control-flow stack.

When divergence occurs:

The current active mask is pushed onto the stack.
The warp execution mask is split into two masks:
- mask_then = active_mask & P1
- mask_else = active_mask & !P1
Execution proceeds with one mask while the other path is saved for later execution.

Conceptual behavior:

push(active_mask)

mask_then = active_mask &  P1
mask_else = active_mask & !P1

execute THEN block using mask_then
later execute ELSE block using mask_else

Threads in the THEN mask execute the fall-through path, while threads in the ELSE mask branch to ELSE1.

5. Keep in mind, however, that the only choice for a SIMD Lane in a clock cycle is to perform the operation specified in the PTX instruction or be idle; two SIMD Lanes cannot simultaneously execute different instructions [4].

The following table explains how the elements of saxpy() are mapped to the Lanes of a SIMD Thread (Warp), which belongs to a Thread Block (Core) within a Grid.

Table 13 Mapping saxpy code to Fig. 68.¶
saxpy(()	Instance in Fig. 68	Description
blockDim.x	The index of Thread Block	blockDim: in this example configured as Fig. 68 is 16(Thread Blocks) * 16(SIDM Threads) = 256
blockIdx.x	The index of SIMD Thread	blockIdx: the index of Thread Block within the Grid
threadIdx.x	The index of elements	threadIdx: the index of the SIMD Thread within its Thread Block

With Fermi, each 32-wide thread of SIMD instructions is mapped to 16 physical SIMD Lanes, so each SIMD instruction in a thread of SIMD instructions takes two clock cycles to complete.
You could say that it has 16 Lanes, the vector length would be 32, and the Chime is 2 clock cycles.
The mape of y[0..31] = a * x[0..31] * y[0..31] to <Core, Warp, Cuda Thread> of GPU as the following table. x[0..31] map to 32 Cuda Threads; two Cuda Threads map to one SIMD Lane as Fig. 67..

Table 14 Map <Core, Warp> to saxpy¶
	Warp-0	Warp-1	…	Warp-15
Core-0	y[0..31] = a * x[0..31] * y[0..31]	y[32..63] = a * x[32..63] + y[32..63]	…	y[480..511] = a * x[480..511] + y[480..511]
…	…	…	…	…
Core-15	y[7680..7711] = a * …	…	…	y[8160..8191] = a * x[8160..8191] + y[8160..8191]

Each Cuda Thread runs the GPU function code saxpy. Fermi has a register file of size 32768 x 32-bit. As shown in Fig. 59, the number of registers in a Thread Block is: 16 (SM) * 32 (Cuda Threads) * 64 (TLR, Thread Level Register) = 32768 x 32-bit (Register file).
When mapping to fragments/pixels in a graphics GPU, x[0..15] corresponds to a two-dimensional tile of fragments/pixels at pixel[0..3][0..3], since images use tile-based grouping to cluster similar colors together.

Work between CPU and GPU in Cuda ¶

The previous daxpy() GPU code did not include the host (CPU) side code that triggers the GPU function.

The following example shows the host (CPU) side of a CUDA program that calls saxpy on the GPU [2]:

#include <stdio.h>

__global__
void saxpy(int n, float a, float * x, float * y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  ...
  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
  ...
  saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);
  ...
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
  ...
}

The main() function runs on the CPU, while saxpy() runs on the GPU. The CPU copies data from x and y to the corresponding device arrays d_x and d_y using cudaMemcpy.

The saxpy kernel is launched with the following statement:

saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);

This launches the kernel with Thread Blocks containing 256 threads, and uses integer arithmetic to determine the number of Thread Blocks needed to process all N elements in the arrays. The expression (N+255)/256 ensures full coverage of the input data.

Using cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost, the CPU can pass data in x and y to the GPU, and retrieve the results back to y.

Since both memory transfers are handled by DMA and do not require CPU operation, the performance can be improved by running CPU and GPU independently, each accessing their own cache.

After the DMA copy from CPU memory to GPU memory, the GPU performs the full matrix operation loop for y[] = a * x[] + y[]; using a single Grid of threads.

DMA memcpy maps the data in CPU memory to each L1 cache of a core on GPU memory.

Many GPUs support scatter and gather operations to access DRAM efficiently for stream processing tasks [6] [1] [7].

When the GPU function is dense computation in array such as MPEG4 encoder or deep learning for tuning weights, it may get much speed up [8]. However when GPU function is matrix addition and CPU will idle for waiting GPU’s result. It may slow down than doing matrix addition by CPU only. Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speedup [1].

Wiki here [9] includes GPU-accelerated applications for speedup as follows:

General Purpose Computing on GPU, has found its way into fields as diverse as machine learning, oil exploration, scientific image processing, linear algebra, statistics, 3D reconstruction and even stock options pricing determination. In addition, section “GPU accelerated video decoding and encoding” for video compressing [9] gives the more applications for GPU acceleration.

Table 15 The differences for speedup in architecture of CPU and GPU¶
Item	CPU	GPU
Application	Non-data parallel	Data parallel
Architecture	SISD, small vector (eg.4*32bits)	Large SIMD (eg.16*32bits)
Cache	Smaller and faster	Larger and slower (ref. The following Note)
ILP	Pipeline	Pipeline
	Superscalar, SMT	SIMT
	Super-pipeline
Core	Smaller threads for SMT (2 or 4)	Larger threads (16 or 32)
Branch	Conditional-instructions	Mask & conditional-instructions

Note

GPU-Cache

In theory, for data-parallel applications using GPU’s SMT, the GPU can schedule more threads and aims for throughput rather than speedup of a single thread, as seen in SISD on CPUs.

However, in practice, GPUs provide only a small L1 cache, similar to CPUs, and handle cache misses by scheduling another thread.

As a result, GPUs often lack L2 and L3 caches, which are common in CPUs with deeper cache hierarchies.

OpenCL, Vulkan and Spir-v ¶

digraph G {
rankdir=LR;

compound=true;
node [shape=record];
SW_LAYER [label="{ GLSL | OpenCL } | SPIR-V | GPU machine code"];
} — Fig. 86 OpenCL and GLSL(OpenGL)¶

Table 16 OpenCL and OpenGL SW system¶
Name of SW	GPU language	Level of GPU language
OpenCL	OpenCL	C99 dialect (with C pointer, …)
OpenGL	GLSL	C-like (no C pointer, …)
Vulkan	SPIR-V	IR

_images/opencl-to-spirv-offine-compilation.png — Fig. 87 Offline Compilation of OpenCL Kernels into SPIR-V Using Open Source Tooling [11]¶

clang: Compile OpenCL to spirv for runtime+driver. Or compile OpenCL to llvm, then “SPIR-V LLVM Translator” translate llvm to spirv for runtime+driver.
clspv: Compile OpenCL to spirv directly.

digraph ShaderToLLVMIR {
rankdir=LR;
node [shape=record, style=filled, color=black];

// Source Languages
GLSL [label="GLSL ES (OpenGL)", fillcolor=white];
OpenCL_C [label="OpenCL C", fillcolor=white];

// Intermediate Representation
SPIRV [label="SPIR-V", fillcolor=orange];

GPU_ISA [label="GPU ISA in Assembly and Binary", fillcolor=grey];

// LLVM IR
LLVM_IR [label="LLVM IR", fillcolor=orange];

// Tools with oval shapes
node [shape=oval, style=filled, fillcolor=lightgreen];
Glslang [label="glslangValidator", fillcolor=lightblue];
CL_SPIRV [label="OpenCL-SPIRV Translator"];

// Tools with oval shapes
node [shape=oval, style=filled, fillcolor=yellow];
SPIRV_LLVM [label="SPIRV-LLVM Translator"];
LLVMCompiler [label="Backend Compiler"];

// Edges
GLSL -> Glslang -> SPIRV;
OpenCL_C -> CL_SPIRV -> SPIRV -> SPIRV_LLVM -> LLVM_IR -> LLVMCompiler -> GPU_ISA;
} — Fig. 88 GPU Compiler Components and Flow¶

The flow and relationships among GLSL, OpenCL, SPIR-V (Vulkan/OpenCL), LLVM IR, and the GPU compiler are shown in the Fig. 86, Fig. 87 and Fig. 88. As shown in Fig. 88, OpenCL-C to SPIR-V (OpenCL) can be compiled using clang + llvm-spirv tools or a proprietary converter.

As shown in Fig. 88, both GLSL and OpenCL use frontend tools to generate SPIR-V. The driver can invoke either the GLSL or OpenCL compiler based on metadata fields in the SPIR-V, as illustrated in Fig. 89 and the following figures, which describe offline compilation from GLSL/OpenCL to SPIR-V and online execution using the generated SPIR-V files.

digraph SPIRV_Deployment {
rankdir=LR;
node [shape=box, style=filled, fillcolor=lightgray, fontname="Helvetica"];

subgraph cluster_glsl {
label = "From GLSL";
glsl_src [label="GLSL Shader\n (.vert/.frag/.comp)", fillcolor=lightblue];
glsl_compiler [label="glslangValidator\n(or similar compiler)", fillcolor=lightgreen];
spirv_glsl [label="SPIR-V\n (from GLSL)", fillcolor=gold];
glsl_src -> glsl_compiler -> spirv_glsl;
}

subgraph cluster_opencl {
label = "From OpenCL C";
opencl_src [label="OpenCL C (.cl)", fillcolor=lightblue];
clang_spirv [label="Clang + SPIR-V Backend", fillcolor=lightgreen];
spirv_opencl [label="SPIR-V\n (from OpenCL C)", fillcolor=gold];
opencl_src -> clang_spirv -> spirv_opencl;
}

subgraph cluster_opencl_runtime {
label = "OpenCL Runtime (Host)";
spirv_loader [label="clCreateProgramWithIL()", fillcolor=orange];
spirv_glsl -> spirv_loader;
spirv_opencl -> spirv_loader;
spirv_loader -> device_driver [label="Load\n SPIR-V\n into driver"];
device_driver [label="OpenCL Driver\n(SPIR-V → Device IR → Machine Code)", fillcolor=plum];
device_driver -> execution [label="Compiled &\n Run on device"];
execution [label="Execute on\n OpenCL Device", fillcolor=lightyellow];
}

// Styling
edge [fontname="Helvetica"];
} — Fig. 89 Compiling and Deploying GPU Code from GLSL, Vulkan, and OpenCL¶

Based on the flows above, the public standards OpenGL and OpenCL provide tools for transferring these data format, as illustrated in Fig. 90. The corresponding LLVM IR and SPIR-V formats are listed below.

digraph G {
rankdir=LR;

// Data Nodes
node [shape=record, style=filled, fillcolor=white];
glsl [label="glsl"];
openclc [label="OpenCL C"];
spirv [label="spirv"];
llvm [label="llvm-ir"];

// Tools Nodes
node [shape=oval, style=filled, fillcolor=lightgreen];
glslang [label="glslangValidator"];
spirv_cross [label="spirv-cross"];
clspv [label="clspv"];
llvm_spirv [label="llvm-spirv"];

glsl -> glslang -> spirv;
glsl -> spirv_cross -> spirv [dir="back"];
openclc -> clspv -> spirv;
openclc -> clspv -> spirv [dir="back"];
spirv -> llvm_spirv -> llvm;
llvm -> llvm_spirv -> spirv;
} — Fig. 90 Convertion between GLSL, OpenCL C, SPIRV-V and LLVM-IR¶

References/add-matrix.ll

; ModuleID = 'add-matrix.ll'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-G1"
target triple = "spir64-unknown-unknown"

; Function Attrs: nounwind
define spir_func <4 x i32> @add_mat(<4 x i32> %a, <4 x i32> %b) #0 {
entry:
  %sum = add <4 x i32> %a, %b
  ret <4 x i32> %sum
}

attributes #0 = { nounwind }

!spirv.MemoryModel = !{!0}
!opencl.enable.FP_CONTRACT = !{}
!spirv.Source = !{!1}
!opencl.spir.version = !{!2}
!opencl.used.extensions = !{!3}
!opencl.used.optional.core.features = !{!3}
!spirv.Generator = !{!4}

!0 = !{i32 2, i32 2}
!1 = !{i32 0, i32 0}
!2 = !{i32 1, i32 2}
!3 = !{}
!4 = !{i16 6, i16 14}

References/add-matrix.spvasm

; SPIR-V
; Version: 1.0
; Generator: Khronos LLVM/SPIR-V Translator; 14
; Bound: 10
; Schema: 0
               OpCapability Addresses
               OpCapability Linkage
               OpCapability Kernel
          %1 = OpExtInstImport "OpenCL.std"
               OpMemoryModel Physical64 OpenCL
               OpSource Unknown 0
               OpName %add_mat "add_mat"
               OpName %a "a"
               OpName %b "b"
               OpName %entry "entry"
               OpName %sum "sum"
               OpDecorate %add_mat LinkageAttributes "add_mat" Export
       %uint = OpTypeInt 32 0
     %v4uint = OpTypeVector %uint 4
          %4 = OpTypeFunction %v4uint %v4uint %v4uint
    %add_mat = OpFunction %v4uint None %4
          %a = OpFunctionParameter %v4uint
          %b = OpFunctionParameter %v4uint
      %entry = OpLabel
        %sum = OpIAdd %v4uint %a %b
               OpReturnValue %sum
               OpFunctionEnd

Convert between spirv and llvm-ir

% pwd
$HOME/git/lbd/References
% llvm-as -o add-matrix.bc add-matrix.ll
% llvm-spirv -o add-matrix.spv add-matrix.bc
% spirv-dis -o add-matrix.spvasm add-matrix.spv
// Convert spirv to llvm-ir again and check the converted llvm-ir is same
// with origin.
% llvm-spirv -r add-matrix.spv -o add-matrix.spv.bc
% llvm-dis add-matrix.spv.bc -o add-matrix.spv.bc.ll
% diff add-matrix.ll add-matrix.spv.bc.ll
1c1
< ; ModuleID = 'add-matrix.ll'
---
> ; ModuleID = 'add-matrix.spv.bc'

Install llvm-spriv and llvm with Brew-install

% brew install spirv-llvm-translator
% brew install llvm

The following explains how the driver identifies whether the SPIR-V source is from GLSL or OpenCL.

SPIR-V binaries contain metadata that can help identify whether they were generated from OpenCL, GLSL, or another language.

Execution Model

Defined by the OpEntryPoint instruction. It is a strong indicator of the source language.

ExecutionModel	Typical Source	Notes
Kernel	OpenCL	Used only by OpenCL C
GLCompute	GLSL or HLSL	Used in compute shaders
Fragment	GLSL or HLSL	For pixel shaders
Vertex	GLSL or HLSL	For vertex shaders

Capabilities

Declared using OpCapability. They provide clues about the SPIR-V’s execution model and source.

Capability

Likely Source

Kernel

OpenCL

Addresses

OpenCL

Linkage

OpenCL

Shader

GLSL or HLSL
Extensions

Declared using OpExtension. Some are tied to specific compilers or languages.

Extension

Likely Source

SPV_KHR_no_integer_wrap_decoration

OpenCL

SPV_INTEL_unified_shared_memory

OpenCL (Intel)

SPV_AMD_shader_ballot

GLSL (graphics)
Memory Model

Defined by OpMemoryModel.
- OpenCL → OpenCL source
- GLSL450 → GLSL or HLSL source

Capability	Likely Source
Kernel	OpenCL
Addresses	OpenCL
Linkage	OpenCL
Shader	GLSL or HLSL

Extension	Likely Source
SPV_KHR_no_integer_wrap_decoration	OpenCL
SPV_INTEL_unified_shared_memory	OpenCL (Intel)
SPV_AMD_shader_ballot	GLSL (graphics)

How to Inspect

Use the spirv-dis tool to disassemble SPIR-V to human-readable form:

spirv-dis kernel.spv -o kernel.spvasm

Look for these at the top of the file:

Example (GLSL):

OpCapability Shader
OpMemoryModel Logical GLSL450
OpEntryPoint GLCompute %main "main"

Example (OpenCL):

OpCapability Kernel
OpCapability Addresses
OpMemoryModel Logical OpenCL
OpEntryPoint Kernel %foo "foo"

Summary ¶

Feature	Indicates
OpEntryPoint Kernel	OpenCL
OpCapability Shader	GLSL or HLSL
OpMemoryModel OpenCL	OpenCL
OpMemoryModel GLSL450	GLSL or HLSL

Comparsion for OpenCL and OpenGL’s compute shader.
- Same:
  
  Both are for General Computing of GPU.
- Difference:
  
  OpenCL include GPU and other accelerate device/processor. OpenCL is C language on Device and C++ on Host based on OpenCL runtime. Compute shader is GLSL shader language run on OpenGL graphic enviroment and integrate and access data of OpenGL API easily [10].
OpenGL/GLSL vs Vulkan/spir-v.
- High level of API and shader: OpenGL, GLSL.
- Low level of API and shader: Vulkan, spir-v.

Though OpenGL api existed in higher level with many advantages from sections above, sometimes it cannot compete in efficience with direct3D providing lower levels api for operating memory by user program [13]. Vulkan api is lower level’s C/C++ api to fill the gap allowing user program to do these things in OpenGL to compete against Microsoft direct3D. Here is an example [14]. Meanwhile glsl is C-like language. The vulkan infrastructure provides tool, glslangValidator [15], to compile glsl into an Intermediate Representation Form (IR) called spir-v off-line. As a result, it saves part of compilation time from glsl to gpu instructions on-line since spir-v is an IR of level closing to llvm IR [16]. In addition, vulkan api reduces gpu drivers efforts in optimization and code generation [13]. These standards provide user programmer option to using vulkan/spir-v instead of OpenGL/glsl, and allow them pre-compiling glsl into spir-v off-line to saving part of on-line compilation time.

With vulkan and spir-v standard, the gpu can be used in OpenCL for Parallel Programming of Heterogeneous Systems [17] [18]. Similar with Cuda, a OpenCL example for fast Fourier transform (FFT) is here [19]. Once OpenCL grows into a popular standard when more computer languages or framework supporting OpenCL language, GPU will take more jobs from CPU [20].

Most GPUs have 16 or 32 Lanes in a SIMD processor (Warp), vulkan provides Subgroup operations to data parallel programming on Lanes of SIMD processor [21].

Subgroup operations provide a fast way for moving data between Lanes intra Warp. Assuming each Warp has four Lanes. The following table lists result of reduce, inclusive and exclusive operations.

Table 17 Lists each Lane’s value after **Reduce**, **Inclusive** and **Exclusive** operations repectively¶
Lane	0	1	2	3
Initial value	a	b	c	d
Reduce	OP(abcd)	OP(abcd)	OP(abcd)	OP(abcd)
Inclusive	OP(a)	OP(ab)	OP(abc)	OP(abcd)
Exclusive	not define	OP(a)	OP(ab)	OP(abc)

Reduce: e.g. subgroupAdd. Inclusive: e.g. subgroupInclusiveAdd. Exclusive: e.g. subgroupExclusiveAdd.
For examples:
- ADD operation: OP(abcd) = a+b+c+d.
- MAX operation: OP(abc) = MAX(a,b,c).
When Lane i is inactive, it is value is none.
- For instance of Lane 0 is inactive, then MUL operation: OP(abcd) = b*c*d.

The following is a code example.

An example of subgroup operations in glsl for vulkan

vec4 sum = vec4(0, 0, 0, 0);
if (gl_SubgroupInvocationID < 16u) {
  sum += subgroupAdd(in[gl_SubgroupInvocationID]);
}
else {
  sum += subgroupInclusiveMul(in[gl_SubgroupInvocationID]);
}
subgroupMemoryBarrier();

Nvidia’s GPU provides __syncWarp() for subgroupMemoryBarrier() or compiler to sync for the Lanes in the same Warp.

In order to let Lanes in the same SIMD processor work efficently, data unifomity analysis will provide many optimization opporturnities in register allocation, transformation and code generation [22].

LLVM IR expansion from CPU to GPU is becoming increasingly influential. In fact, LLVM IR has been expanding steadily from version 3.1 until now, as I have observed.

Unified IR Conversion Flows ¶

This section outlines the intermediate representation (IR) flows for graphics (Microsoft DirectX, OpenGL) and OpenCL compilation across major GPU vendors: NVIDIA, AMD, ARM, Imagination Technologies, and Apple.

Graphics Compilation Flow (Microsoft DirectX & OpenGL)¶

Graphics shaders are compiled from high-level languages (HLSL, GLSL) into vendor-specific GPU binaries via intermediate representations like DXIL and SPIR-V.

✅ Each node in the graph is color-coded to indicate its category or role within the structure.

digraph G {
node [shape=box, style=filled];
PUIR [label="Public standard of IRs", fillcolor=lightyellow];
PRIR [label="Private IRs", fillcolor=lightgray];
VD [label="Vendor Driver", shape=oval, fillcolor=lightgreen];
GPU [label="GPU", fillcolor=lightblue];
}

Vendor Driver will call Vendor Compiler for on-line compilation.

digraph OpenCL_OpenGL_Compilation {
rankdir=LR;
node [shape=box];

// Source Languages
OpenCL_C [style=filled, fillcolor=lightyellow];
GLSL [style=filled, fillcolor=lightyellow];

// Shared IRs
SPIR [style=filled, fillcolor=lightyellow];
SPIRV [label="SPIR-V IR", style=filled, fillcolor=lightyellow];
LLVM_IR [style=filled, fillcolor=lightyellow];

// Vendor Drivers
node [shape=oval];
"NVIDIA Driver" [style=filled, fillcolor=lightgreen];
"AMD Driver" [style=filled, fillcolor=lightgreen];
"ARM Driver" [style=filled, fillcolor=lightgreen];
"Imagination Driver" [style=filled, fillcolor=lightgreen];
"Apple Driver" [style=filled, fillcolor=lightgreen];

// Private IRs
node [shape=box];
"PTX IR" [style=filled, fillcolor=gray];
"GCN IR" [style=filled, fillcolor=gray];
"Burst IR" [style=filled, fillcolor=gray];
"Metal IR" [style=filled, fillcolor=gray];

// GPU Targets
"NVIDIA GPU ISA" [style=filled, fillcolor=lightblue];
"AMD GPU ISA" [style=filled, fillcolor=lightblue];
"ARM Mali ISA" [style=filled, fillcolor=lightblue];
"Imagination USC ISA" [style=filled, fillcolor=lightblue];
"Apple GPU ISA" [style=filled, fillcolor=lightblue];

// OpenCL Flow
OpenCL_C -> SPIR -> LLVM_IR;
OpenCL_C -> SPIRV;

// OpenGL Flow
GLSL -> SPIRV -> LLVM_IR;

// LLVM based Compilation Flow
LLVM_IR -> "PTX IR" -> "NVIDIA Driver" -> "NVIDIA GPU ISA";
LLVM_IR -> "GCN IR" -> "AMD Driver" -> "AMD GPU ISA";
LLVM_IR -> "ARM Driver" -> "ARM Mali ISA";
LLVM_IR -> "Burst IR" -> "Imagination Driver" -> "Imagination USC ISA";
LLVM_IR -> "Metal IR" -> "Apple Driver" -> "Apple GPU ISA";
} — Fig. 91 Graphics and OpenCL Compiler IR Conversion Flow¶

OpenCL C is the device side code in C language while Host side code is C/C++.
OpenCL C is compiled to SPIR-V in later versions of OpenCL, while earlier versions used SPIR. SPIR-V has now largely replaced SPIR as the standard intermediate representation.

Table 18 Comparison of PTX, GCN IR, Burst IR, and Metal IR¶
IR Layer	Abstraction Level	Register Model
PTX (NVIDIA)	Virtual ISA; portable across GPU generations; hides hardware scheduling	Virtual registers (%r, %f); mapped to physical registers during SASS lowering
GCN IR (AMD)	Machine IR; tightly coupled to GCN/RDNA architecture; exposes Wavefront semantics	Explicit vector (vN) and scalar (sN) registers; register pressure affects occupancy. AMD’s compiler backend can lower vector operations to scalar instructions on low-end GPUs, while preserving vector operations on high-end architectures.
Burst IR (Imagination)	Power-aware IR; optimized for burst-core scheduling and latency hiding	Operand staging model; abstracted register usage; mapped late to USC ISA
Metal IR (Apple)	LLVM-inspired IR; abstracted from developers; tuned for tile shading and threadgroup fusion	Region-based register allocation; dynamic renaming; not exposed as physical register model

✅ NVIDIA, AMD, ARM and Imagination all have exposed LLVM IR and convert SPIR-V IR to LLVM IR.

SPIR:

For OpenCL development, the IR started from SPIR (LLVM-based IR).

SPIRV’s Limitation: tightly coupled to specific LLVM versions, making it brittle across.

SPIR-V:

A complete redesign: binary format, not tied to LLVM.

Designed for Vulkan, but also supports OpenCL and OpenGL.

Enables cross-vendor portability, shader reflection, and custom extensions.

Used in graphics and compute pipelines, including ML workloads via Vulkan compute.

A Vulkan shader written in GLSL is compiled to SPIR-V, then passed to the GPU driver.

An OpenCL kernel written in C can be compiled to SPIR-V, then lowered to LLVM IR internally by vendors like AMD or NVIDIA.

⚠️ Apple

Uses LLVM IR Partially. Apple supports SPIR-V in Metal and OpenCL, but LLVM IR is not always exposed.
Metal shaders are compiled via MetalIR, which is LLVM-inspired but not standard LLVM IR. Metal IR is not standard LLVM IR and is not exposed to developers.
Apple’s ML compiler stack may use LLVM IR internally, but it’s abstracted from developers.
Apple is not a vendor of GPU IP, so it does not expose LLVM IR in its ML or graphics APIs for the reasons below:
- Security: opaque compilation prevents tampering
- Performance tuning: Apple controls the entire stack for optimal hardware use
- Developer simplicity: high-level APIs reduce friction

Notes:

HLSL → DXIL → DirectX is Microsoft’s graphics pipeline, used on Windows and Xbox.
GLSL → SPIR-V → OpenGL/Vulkan is cross-platform and supported by all vendors.
Final GPU ISA varies by vendor:
- NVIDIA: PTX → SASS
- AMD: LLVM IR → GCN/RDNA
- ARM: Mali ISA
- Imagination: USC ISA
- Apple: Metal GPU ISA

Notes:

OpenCL C → SPIR → Vendor Driver → GPU ISA is the standard compilation path.
Some vendors (e.g., AMD, NVIDIA) may bypass SPIR and compile directly to LLVM IR or PTX.
Apple deprecated OpenCL in favor of Metal, but legacy support remains.

✅ References

OpenCL Specification: https://www.khronos.org/opencl/
SPIR-V Specification: https://www.khronos.org/spir
DirectX Shader Compiler: https://github.com/microsoft/DirectXShaderCompiler
Imagination E-Series GPU: https://www.imaginationtech.com/
Apple Metal API: https://developer.apple.com/metal/

ML and GPU Compilation ¶

This section outlines the intermediate representation (IR) flows used by NVIDIA, AMD, and ARM in machine learning and GPU compilation pipelines. It includes both inference engines and compiler toolchains.

✅ Each node in the graph is color-coded to indicate its category or role within the structure. In AI, usually use runtime instead of driver for graphics.

$digraph G { node [shape=box, style=filled]; PUIR [label="Public standard of IRs", fillcolor=lightyellow]; PRIR [label="Private IRs", fillcolor=lightgray]; VDLR [label="Vendor Driver,\nLibrary or Runtime", shape=oval, fillcolor=lightgreen]; MA [label="Machine", fillcolor=lightblue]; }$

NVIDIA IR Conversion Flow ¶

NVIDIA supports both TensorRT-based inference and MLIR-based compilation targeting CUDA GPUs is shown as Fig. 92.

digraph NVIDIA_IR_Flow {
rankdir=LR;

node [shape=box];

ONNX [style=filled, fillcolor=lightyellow];
"TensorRT Graph IR" [style=filled, fillcolor=lightgray];
"CUDA Kernel IR" [style=filled, fillcolor=lightgray];
PTX [style=filled, fillcolor=lightyellow];
SASS [label="SASS (NVIDIA GPU ISA)", style=filled, fillcolor=lightblue];
TensorRT [style=filled, shape=oval, fillcolor=lightgreen];

TOSA [style=filled, fillcolor=lightyellow];
"MLIR GPU Dialects" [style=filled, fillcolor=lightyellow];
"LLVM IR" [style=filled, fillcolor=lightyellow];

ONNX -> "TensorRT Graph IR" -> "CUDA Kernel IR" -> PTX;
"TensorRT Graph IR" -> TensorRT;

TOSA -> "MLIR GPU Dialects" -> "LLVM IR" -> PTX -> SASS;
} — Fig. 92 NVIDIA IR Conversion Flow¶

SASS stands for Streaming ASSembler, and it represents the native instruction set architecture (ISA) for NVIDIA GPUs.
TensorRT is a C++ library and runtime developed by NVIDIA for deep learning inference—the phase where trained models make predictions.
- It works with models trained in frameworks like TensorFlow, PyTorch, and ONNX, converting them into highly optimized engines for execution on NVIDIA hardware [23] [24].
CUDA Kernel IR is a bridge between LLVM IR and PTX/SASS, or a direct output from TensorRT.
LLVM IR is foundational in many flows, but TensorRT may skip it and directly emit CUDA kernels.
Although MLIR dialects may be publicly available, they are typically hardware-dependent and tailored to specific vendors’ GPU architectures. As a result, their applicability is limited to the corresponding hardware platforms.
MLIR GPU Dialects is public but it is for Nvidia’s GPU.

AMD IR Conversion Flow ¶

AMD uses ROCm and MIOpen for ML workloads, with LLVM-based compilation targeting GCN or RDNA architectures is shown as Fig. 93.

digraph ROCm_Runtime_PyTorch_MIOpen_PreMLIR_Flow {
rankdir=LR;
node [shape=box];

// Entry points
PyTorch_Model [label="PyTorch Model", style=filled, fillcolor=lightyellow];
ONNX_Model [label="ONNX Model", style=filled, fillcolor=lightyellow];
MLIR_TOSA [label="MLIR (TOSA dialect)", style=filled, fillcolor=lightyellow];

// Pre-MLIR optimization
MIOpen_PreMLIR [label="MIOpen Graph IR (Pre-MLIR)", style=filled, fillcolor=gray];

// Compilation layers
ONNX_MLIR [label="ONNX-MLIR", style=filled, fillcolor=lightyellow];
MLIR_GPU [label="MLIR GPU dialect", style=filled, fillcolor=lightyellow];
LLVM_IR [style=filled, fillcolor=lightyellow];
ROCm_BC [label="ROCm device libraries (.bc)", shape=oval, style=filled, fillcolor=lightgreen];
GCN_IR [style=filled, fillcolor=gray];

// Post-GCN optimization
MIOpen_IR [label="MIOpen Graph IR (Post-GCN)", style=filled, fillcolor=gray];

// Runtime layers
ROCr_Runtime [label="ROCr Runtime", shape=oval, style=filled, fillcolor=lightgreen];

// GPU Targets
GPU [label="AMD GPU ISA", style=filled, fillcolor=lightblue];

// Flow paths
PyTorch_Model -> ONNX_Model;
ONNX_Model -> MIOpen_PreMLIR -> ONNX_MLIR -> LLVM_IR;
MLIR_TOSA -> MLIR_GPU -> LLVM_IR;

LLVM_IR -> ROCm_BC;
LLVM_IR -> GCN_IR;
GCN_IR -> MIOpen_IR -> ROCr_Runtime -> GPU;

// === Layering for better spacing ===
{ rank = same; MIOpen_PreMLIR; ONNX_MLIR }
{ rank = same; GCN_IR; MIOpen_IR }
} — Fig. 93 AMD IR Conversion Flow¶

ROCm is not just a compiler or driver — it includes a full runtime stack that enables AMD GPUs to execute compute workloads across HIP (Heterogeneous-compute Interface for Portability), OpenCL, and ML frameworks. It’s analogous to NVIDIA’s CUDA runtime but built around open standards like HSA (Heterogeneous System Architecture) [25] and LLVM.
MIOpen Graph IR includes different form and structure. (Pre-MLIR) and (Post-GCN) are different.
- Developers interact with MIOpen via high-level APIs (e.g., miopenConvolutionForward) — not via direct IR manipulation.
- While MIOpen itself is open source (GitHub repo), its graph IR format and transformation passes are internal.

ARM IR Conversion Flow ¶

ARM supports both CPU/NPU deployment (e.g., Ethos-U/N) and GPU execution (e.g., Mali via Vulkan). The IR flow diverges depending on the target is shown as Fig. 94.

digraph ARM_IR_Flow {
rankdir=LR;

node [shape=box];

"ONNX / TFLite" [style=filled, fillcolor=lightyellow];
TOSA [style=filled, fillcolor=lightyellow];
"MLIR Dialects" [style=filled, fillcolor=lightgray];
"LLVM IR" [style=filled, fillcolor=lightyellow];
"ARM Codegen" [style=filled, shape=oval, fillcolor=white];
"Ethos-N / Cortex-M" [style=filled, fillcolor=lightblue];
"MLIR GPU Dialects" [style=filled, fillcolor=lightgray];
SPIRV [style=filled, fillcolor=lightyellow];
"Mali GPU (Vulkan)" [style=filled, fillcolor=lightblue];
"Ethos-N Driver" [style=filled, shape=oval, fillcolor=lightgreen];
Vulkan [style=filled, shape=oval, fillcolor=lightgreen];

"ONNX / TFLite" -> TOSA;

TOSA -> "MLIR Dialects" -> "LLVM IR" -> "ARM Codegen" ->
"Ethos-N / Cortex-M";
"ARM Codegen" -> "Ethos-N Driver";

TOSA -> "MLIR GPU Dialects" -> SPIRV -> "Mali GPU (Vulkan)";
SPIRV -> Vulkan;
} — Fig. 94 ARM IR Conversion Flow¶

Node “Mali GPU (Vulkan)” is the SPIR-V compilation flow that illustrated in the previous section.
Ethos-N is ARM’s NPU. Cortex-M is ARM’s CPU.

ARM Codegen generally emits instructions for CPU/NPU execution, but for certain NN operations (especially those requiring vendor-specific acceleration), it may generate function calls into the Ethos-N driver, which then orchestrates execution on the NPU.

✅ Common Case: Direct NPU/CPU Instruction Generation

For operations that are well-supported by the NPU or CPU, the codegen backend emits hardware-specific instructions or IR directly.
These are scheduled for execution on the CPU or passed to the Ethos-N via its driver stack.

⚙️ Special Case: Function Calls to Ethos-N Driver

For complex or fused neural network operations (e.g., custom activation functions, quantized convolutions, or optimized memory layouts), the codegen may emit calls (LLVM-IR `call`) to precompiled driver functions.
These function calls act as entry points into the Ethos-N runtime, which handles:
- Memory management
- Scheduling
- Firmware-level execution
- Hardware-specific optimizations

Imagination Technologies IR Conversion Flow ¶

digraph Imagination_IR_Flow {
rankdir=LR;
node [shape=box];

"ONNX / TFLite" [style=filled, fillcolor=lightyellow];
TVM [style=filled, fillcolor=lightgreen];
oneAPI [style=filled, fillcolor=lightgreen];
OpenCL [style=filled, fillcolor=lightgreen];
"E-Series Compiler IR" [style=filled, fillcolor=lightgray];
"Burst Processor IR" [style=filled, fillcolor=lightgray];
"Neural Core Execution" [style=filled, fillcolor=lightblue];

"ONNX / TFLite" -> TVM -> "E-Series Compiler IR" -> "Burst Processor IR" -> "Neural Core Execution";
"ONNX / TFLite" -> oneAPI -> "E-Series Compiler IR";
"ONNX / TFLite" -> OpenCL -> "E-Series Compiler IR";
} — Fig. 95 Imagination Technologies IR Conversion Flow¶

Notes:

E-Series GPUs support up to 200 TOPS INT8/FP8 for edge AI workloads [B](https://www.techpowerup.com/336545/imagination-announces-e-series-gpu-ip-with-burst-processors-and-up-to-200-tops?copilot_analytics_metadata=eyJldmVudEluZm9fY29udmVyc2F0aW9uSWQiOiJMSlNQWjRaWjY5Y2ZuN0VnWnJEVzEiLCJldmVudEluZm9fbWVzc2FnZUlkIjoidVZhb2U4blgyYVVQb1pQdWlKZ0FzIiwiZXZlbnRJbmZvX2NsaWNrRGVzdGluYXRpb24iOiJodHRwczpcL1wvd3d3LnRlY2hwb3dlcnVwLmNvbVwvMzM2NTQ1XC9pbWFnaW5hdGlvbi1hbm5vdW5jZXMtZS1zZXJpZXMtZ3B1LWlwLXdpdGgtYnVyc3QtcHJvY2Vzc29ycy1hbmQtdXAtdG8tMjAwLXRvcHMiLCJldmVudEluZm9fY2xpY2tTb3VyY2UiOiJjaXRhdGlvbkxpbmsifQ%3D%3D&citationMarker=9F742443-6C92-4C44-BF58-8F5A7C53B6F1).
The architecture is programmable, supporting graphics and AI workloads simultaneously.
Developers can target the GPU using OpenCL, Apache TVM, or oneAPI.
The Burst Processor IR optimizes power efficiency and memory locality.
Final execution occurs on Neural Cores, deeply integrated into the GPU.

Comparison Summary ¶

Vendor	High-Level IR	Mid-Level IR	Low-Level IR	Libraries / Runtimes
NVIDIA	ONNX, TensorRT IR	MLIR GPU Dialects	PTX → SASS	TensorRT
AMD	ONNX, MIOpen IR	MLIR Dialects	LLVM IR → GCN ISA	MIOpen, ROCm
ARM	ONNX, TFLite	TOSA, MLIR Dialects	LLVM IR / SPIR-V	Ethos-N Driver, Vulkan
Imagination	ONNX, TFLite	E-Series Compiler IR	Burst IR → Neural Core	OpenCL, TVM, oneAPI

References ¶

Accelerate ML/DL on OpenCL/SYCL ¶

_images/opencl_ml_graph.png — Fig. 96 Implement ML graph scheduler both on compiler and runtime¶

As shown in Fig. 96, the Device, such as a GPU or a CPU+NPU, is capable of running the entire ML graph. However, if the Device has only an NPU, then operations like Avg-Pool, which require CPU support, must run on the Host side. This introduces communication overhead between the Host and the Device.

Similar to OpenGL shaders, the “kernel” function may be compiled either on-line or off-line and then sent to the GPU as a programmable function.

In order to run ML (Machine Learning) efficiently, all platforms for ML on GPU/NPU implement scheduling SW both on graph compiler and runtime. If OpenCL can extend to support ML graph, then graph compiler such as TVM or Runtime from Open Source have chance to leverage the effort of scheduling SW from programmers [26]. Cuda graph is an idea like this [27] [28] .

SYCL: Using C++ templates to optimize and genertate code for OpenCL and Cuda. Provides a consistent language, APIs, and ecosystem in which to write and tune code for different accelerator architecture, CPUs, GPUs, and FPGAs [29].
- SYCL uses generic programming with templates and generic lambda functions to enable higher-level application software to be cleanly coded with optimized acceleration of kernel code across an extensive range of acceleration backend APIs, such as OpenCL and CUDA [30].

_images/sycl.png — Fig. 97 SYCL = C++ template and compiler for Data Parallel Applications on AI on CPUs, GPUs and HPGAs.¶

DPC++ (OneDPC) compiler: Based on SYCL, DPC++ can compile the DPC++ language for both CPU host and GPU device. DPC++ (Data Parallel C++) is a language developed by Intel and may be adopted into standard C++. The GPU-side (kernel code) is written in C++ but does not support exception handling [31] [32].
- Features of Kernel Code:
  - Not supported:
    
    Dynamic polymorphism, dynamic memory allocations (therefore no object management using new or delete operators), static variables, function pointers, runtime type information (RTTI), and exception handling. No virtual member functions, and no variadic functions, are allowed to be called from kernel code. Recursion is not allowed within kernel code.
  - Supported:
    
    Lambdas, operator overloading, templates, classes, and static polymorphism [33].