Appendix C: The concept of GPU compiler

Basicly CPU compiler is SISD (Single Instruction Single Data Architecture). The multimedia instructions in CPU are small scaled of SIMD (Single Instruction Multiple Data) for 4 or 16 elements while GPU is a large scaled of SIMD processor coloring millions of pixels of image in few micro seconds. Since the 2D or 3D graphic processing provides large opportunity in parallel data processing, GPU hardware usually composed of thousands of functional units in each core(grid) in N-Vidia processors.

The flow for 3D/2D graphic processing as the following diagram.

_images/opengl_flow.png

Fig. 33 OpenGL flow

The most of time for running OpenGL api is on GPU. Usually, CPU is a function call to GPU’s functions. This chapter is giving a concept for the flow above and focuses on shader compiler for GPU. Furthermore, explaining how GPU has taking more applications from CPU through GPGPU concept and related standards emerged.

3D modeling

Through creating 3D model with Triangles or Quads along on skin, the 3D model is created with polygon mesh [1] formed by all the vertices on the first image as follows,

_images/modeling1.png

Fig. 34 Creating 3D model and texturing

After the next smooth shading [1], the vertices and edge lines are covered with color (or remove edges), and model looks much more smooth [2]. Further, after texturing (texture mapping), the model looks real more [3].

To get to know how animation for a 3D modeling, please look video here [4]. In this series of video, you find the 3D modeling tools creating Java instead of C/C++ code calling OpenGL api and shaders. It’s because Java can call OpenGL api through a wrapper library [5].

Every CAD software manufacturer such as AutoDesk and Blender has their own proprietary format. To solve the problem of interoperability, neutral or open source formats were invented as intermediate formats for converting between two proprietary formats. Naturally, these formats have become hugely popular now. Two famous examples of neutral formats are STL (with a .STL extension) and COLLADA (with a .DAE extension). Here is the list, where the 3D file formats are marked with their type.

Table 41 3D file formats [6]

3D file format

Type

STL

Neutral

OBJ

ASCII variant is neutral, binary variant is proprietary

FBX

Proprietary

COLLADA

Neutral

3DS

Proprietary

IGES

Neutral

STEP

Neutral

VRML/X3D

Neutral

3D Rendering

3D rendering is the process of converting 3D models into 2D images on a computer [7]. The steps as the following Figure [8].

_images/rendering_pipeline.png

Fig. 35 Diagram of the Rendering Pipeline. The blue boxes are programmable shader stages.

For 2D animation, the model is created by 2D only (1 face only), so it only can be viewed from the same face of model. If you want to display different faces of model, multiple 2D models need to be created and switch these 2D models from face(flame) to face(flame) from time to time [9].

GLSL (GL Shader Language)

OpenGL is a standard for designing 2D/3D animation in computer graphic. To do animation well, OpenGL provides a lots of api(functions) call for graphic processing. The 3D model construction tools such as Maya, Blender, …, etc, only need to call this api to finish the 3D to 2D projecting function in computer. Any GPU hardware dependent code in these api provided by GPU manufacturer. An OpenGL program looks like the following,

Vertex shader

#version 330 core
layout (location = 0) in vec3 aPos; // the position variable has attribute position 0

out vec4 vertexColor; // specify a color output to the fragment shader

void main()
{
    gl_Position = vec4(aPos, 1.0); // see how we directly give a vec3 to vec4's constructor
    vertexColor = vec4(0.5, 0.0, 0.0, 1.0); // set the output variable to a dark-red color
}
Fragment shader

#version 330 core
out vec4 FragColor;

in vec4 vertexColor; // the input variable from the vertex shader (same name and same type)

void main()
{
    FragColor = computeColorOfThisPixel(...);
}

// openGl user program
int main(int argc, char ** argv)
{
  // init window, detect user input and do corresponding animation by calling opengl api
  ...
}

The last main() is programed by user obviously. Let’s explain what the first two main() work for. As you know, the OpenGL is a lots of api to let programmer display the 3D object into 2D computer screen explained from book of concept of computer graphic. 3D graphic model can set light and object texture by user firstly, and calculating the postion of each vertex secondly, then color for each pixel automatically by 3D software and GPU thirdly, finally display the color of each pixel in computer screen. But in order to let user/programmer add some special effect or decoration in coordinate for each vertex or in color for each pixel, OpenGL provides these two functions to do it. OpenGL uses fragment shader instead of pixel is : “Fragment shaders are a more accurate name for the same functionality as Pixel shaders. They aren’t pixels yet, since the output still has to past several tests (depth, alpha, stencil) as well as the fact that one may be using antialiasing, which renders one-fragment-to-one-pixel non-true [10]. Programmer is allowed to add their converting functions that compiler translate them into GPU instructions running on GPU processor. With these two shaders, new features have been added to allow for increased flexibility in the rendering pipeline at the vertex and fragment level [11]. Unlike the shaders example here [12], some converting functions for coordinate in vertex shader or for color in fragment shade are more complicated according the scenes of animation. Here is an example [13]. In wiki shading page [2], Gourand and Phong shading methods make the surface of object more smooth by glsl. Example glsl code of Gourand and Phong shading on OpenGL api are here [14]. Since the hardware of graphic card and software graphic driver can be replaced, the compiler is run on-line meaning driver will compile the shaders program when it is run at first time and kept in cache after compilation [15].

The shaders program is C-like syntax and can be compiled in few mini-seconds, add up this few mini-seconds of on-line compilation time in running OpenGL program is a good choice for dealing the cases of driver software or gpu hardware replacement [16].

In addition, OpenGL provides vertex buffer object (VBO) allowing vertex array data to be stored in high-performance graphics memory on the server side and promotes efficient data transfer [18] [17].

OpenGL Shader compiler

OpenGL standard is here [19]. The OpenGL is for desktop computer or server while the OpenGL ES is for embedded system [20]. Though shaders are only a small part of the whole OpenGL software/hardware system. It is still a large effort to finish the compiler implementation since there are lots of api need to be implemented. For example, there are 80 related texture APIs [21]. This implementation can be done by generating llvm extended intrinsic functions from shader parser of frontend compiler as well as llvm backend converting those intrinsic to gpu instructions as follows,

#version 320 es
uniform sampler2D x;
out vec4 FragColor;

void main()
{
    FragColor = texture(x, uv_2d, bias);
}

...
!1 = !{!"sampler_2d"}
!2 = !{i32 SAMPLER_2D} : SAMPLER_2D is integer value for sampler2D, for example: 0x0f02
; A named metadata.
!x_meta = !{!1, !2}

define void @main() #0 {
    ...
    %1 = @llvm.gpu0.texture(metadata !x_meta, %1, %2, %3); // %1: %sampler_2d, %2: %uv_2d, %3: %bias
    ...
}

...
   // gpu machine code
    sample2d_inst $1, $2, $3 // $1: %x, $2: %uv_2d, $3: %bias

About llvm intrinsic extended function, please refer this book here [22].

gvec4 texture(gsampler2D sampler, vec2 P, [float bias]);
_images/sampling_diagram.png

Fig. 36 Relationships between the texturing concept [23].

The Fig. 36 as above. The texture object is not bound directly into the shader (where the actual sampling takes place). Instead, it is bound to a ‘texture unit’ whose index is passed to the shader. So the shader reaches the texture object by going through the texture unit. There are usually multiple texture units available and the exact number depends on the capability of your graphic card [23]. A texture unit, also called a texture mapping unit (TMU) or a texture processing unit (TPU), is a hardware component in a GPU that does sampling operation. The argument sampler in texture function as above is sampler_2d index from ‘teuxture unit’ for texture object [23].

‘sampler uniform variable’:

There is a group of special uniform variables for that, according to the texture target: ‘sampler1D’, ‘sampler2D’, ‘sampler3D’, ‘samplerCube’, etc. You can create as many ‘sampler uniform variables’ as you want and assign the value of a texture unit to each one from the application. Whenever you call a sampling function on a ‘sampler uniform variable’ the corresponding texture unit (and texture object) will be used [23].

_images/sampling_diagram_binding.png

Fig. 37 Binding sampler variables [24].

As Fig. 37, the Java api gl.bindTexture binding ‘Texture Object’ to ‘Texture Unit’. The gl.getUniformLocation and gl.uniform1i associate ‘Texture Unit’ to ‘sampler uniform variables’.

gl.uniform1i(xLoc, 1): where 1 is ‘Texture Unit 1’, 2 is ‘Texture Unit 2’, …, etc [24].

The following figure depicts how driver read metadata from compiled glsl obj, OpenGL api associate ‘Sample Variable’ and gpu executing texture instruction.

_images/driverSamplerTable.png

Fig. 38 Associating Sampler Variables and gpu executing texture instruction

Explaining the detail steps for figure above as the following.

1. In order to let the ‘texture unit’ binding by driver, frontend compiler must pass the metadata of ‘sampler uniform variable’ (sampler_2d_var in this example) [27] to backend, and backend must allocate the metadata of ‘sampler uniform variable’ in the compiled binary file [25].

2. After gpu driver executing glsl on-line compiling, driver read this metadata from compiled binary file and maintain a table of {name, SamplerType} for each ‘sampler uniform variable’.

  1. Api,

xLoc = gl.getUniformLocation(prog, "x"); // prog: glsl program, xLoc

will get the location from the table for ‘sampler uniform variable’ x that driver created and set the memory address xSlot to xLoc.

SAMPLER_2D: is integer value for Sampler2D type.

  1. Api,

gl.uniform1i( xLoc, 1 );

will binding xLoc of ‘sampler uniform variable’ x to ‘Texture Unit 1’ by writing 1 to the glsl binary metadata location of ‘sampler uniform variable’ x as follows,

{xLoc, 1} : 1 is 'Texture Unit 1', xLoc is the location(memory address) of 'sampler uniform variable' x

This api will set the descriptor register of gpu with this {xLoc, 1} information [28].

  1. When executing the texture instructions from glsl binary file on gpu,

// gpu machine code
load $1, &xSlot;
sample2d_inst $1, $2, $3 // $1: %x, $2: %uv_2d, $3: %bias

the corresponding ‘Texture Unit 1’ on gpu will being executed through descriptor register of gpu {xLoc, 1} in this example since memory address xSlot includes the value of xLoc.

For instance, Nvidia texture instruction as follow,

tex.3d.v4.s32.s32  {r1,r2,r3,r4}, [tex_a, {f1,f2,f3,f4}];

Where tex_a is the texture memory address for ‘sampler uniform variable’ x, and the pixel of coordinates (x,y,z) is given by (f1,f2,f3) user input. The f4 is skipped for 3D texture.

Above tex.3d texture instruction load the calculated color of pixel (x,y.z) from texture image into GPRs (r1,r2,r3,r4)=(R,G,B,A). And fragment shader can re-calculate the color of this pixel with the color of this pixel at texture image [26].

If it is 1d texture instruction, the tex.1d as follows,

tex.1d.v4.s32.f32  {r1,r2,r3,r4}, [tex_a, {f1}];

Since ‘Texture Unit’ is limited hardware accelerator on gpu, OpenGL providing api to user program for binding ‘Texture Unit’ to ‘Sampler Variables’. As a result, user program is allowed doing load balance in using ‘Texture Unit’ through OpenGL api without recompiling glsl. Fast texture sampling is one of the key requirements for good GPU performance [24].

In addition to api for binding texture, OpenGL provides glTexParameteri api to do Texture Wrapping [29]. Furthmore the texture instruction for some gpu may including S# T# values in operands. Same with associating ‘Sampler Variables’ to ‘Texture Unit’, S# and T# are location of memory associated to Texture Wrapping descriptor registers allowing user program to change Wrapping option without re-compiling glsl.

Even glsl frontend compiler always expanding function call into inline function as well as llvm intrinsic extended function providing an easy way to do code generation through llvm td (Target Description) file written, GPU backend compiler is still a little complex than CPU backend. (But when considering the effort in frontend compier such as clang, or other toolchain such as linker and gdb/lldb, of course, CPU compiler is not easier than GPU compiler.)

Here is the software stack of 3D graphic system for OpenGL in linux [30]. And mesa open source website is here [31].

Architecture

The leading GPU architecture of Nvidia’s gpu is as the following figures.

_images/grid.png

Fig. 39 core(grid) in Nvidia gpu (figure from book [32])

_images/SIMD-processors.png

Fig. 40 SIMD processors (figure from book [33])

_images/threads-lanes.png

Fig. 41 threads and lanes in gpu (figure from book [34])

_images/memory.png

Fig. 42 core(grid) in Nvidia’s gpu (figure from book [35])

General purpose GPU

Since GLSL shaders provide a general way for writing C code in them, if applying a software frame work instead of OpenGL api, then the system can run some data parallel computation on GPU for speeding up and even get CPU and GPU executing simultaneously. Furthmore, any language that allows the code running on the CPU to poll a GPU shader for return values, can create a GPGPU framework [40]. The following is a CUDA example to run large data in array on GPU [41] as follows,

__global__
void saxpy(int n, float a, float * x, float * y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  ...
  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
  ...
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
  ...
}

In the programming example saxpy() above,

  • blockIdx is index of ThreadBlock

  • threadIdx is index of SIMD Thread

  • blockDim is the number of total Thread Blocks in a Grid

A GPU may has the HW structure and handle the subset of y[]=a*x[]+y[] array-calculation as follows,

  • A Grid: has 16 Thread Blocks (Cores).

  • A Core: has 16 Threads (Warps, Cuda Threads).

  • A Thread: has 16 Lanes (vector instruction with processing 16-elements).

Table 42 Map (Core,Thread) to saxpy

Thread-0

Thread-1

Thread-15

Core-0

y[0..31] = a * x[0..31] * y[0..31]

y[32..63] = a * x[32..63] + y[32..63]

y[480..511] = a * x[480..511] + y[480..511]

Core-15

y[7680..7711] = a * …

y[8160..8191] = a * x[8160..8191] + y[8160..8191]

  • Grid is Vectorizable Loop [36].

  • Thread Block <-> SIMD Processor (Core). Warp has it’s own PC and TLR (Thread Level Registers). Warp may map to one whole function or part of function. Compiler and run time may assign them to the same Warp or different Warps [37].

  • SIMD Processors are full processors with separate PCs and are programmed using threads [38]. As Fig. 40, it assigns 16 Thread blocks to 16 SIMD Processors.

  • As Fig. 39, the maximum number of SIMD Threads that can execute simultaneously per Thread Block (SIMD Processor) is 32 for the later Fermi-generation GPUs. Each SIMD Thread has 32 elements run as Fig. 41 on 16 SIMD lanes (number of functional units just same as in vector processor). So it takes 2 clock cycles to complete [39], also known as “ping pong” cycles.

  • Each thread handle 32 elements computing, assuming 4 registers for 1 element, then there are 4*32=128 Thread Level Registers, TLR, occupied in a thread to support the SIMT computing. So, assume a GPU architecture allocating 256 TLR to a Thread (Warp), then it has sufficient TLR for more complicated statement, such as a*x[i]+b*[y]+c*[z] without spilling in register allocation. 16 lanes share the 256 TLR.

  • Each Thread Block (Core/Warp) has 16 threads, so there are 16*256 = 4K TLR in a Core.

The main() run on CPU while the saxpy() run on GPU. Through cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost, CPU can pass data in x and y arrays to GPU and get result from GPU to y array. Since both of these memory transfers trigger the DMA functions without CPU operation, it may speed up by running both CPU/GPU with their data in their own cache repectively. After DMA memcpy from cpu’s memory to gpu’s, gpu operates the whole loop of matrix operation for “y[] = a*x[]+y[];” instructions with one Grid. Furthermore liking vector processor, gpu provides Vector Mask Registers to Handling IF Statements in Vector Loops as the following code [42],

for(i=0;i<64; i=i+1)
  if (X[i] != 0)
    X[i] = X[i] – Y[i];
LV V1,Rx         ;load vector X into V1
LV V2,Ry         ;load vector Y
L.D F0,#0        ;load FP zero into F0
SNEVS.D V1,F0    ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV V1,Rx         ;store the result in X

GPU persues throughput from SIMD application. Can hide cache-miss latence from SMT. As result GPU may hasn’t L2 and L3 like CPU for each core since GPU is highly latency-tolerant multithreading for data parallel application [43]. DMA memcpy map the data in cpu memory to each l1 cache of core on gpu memory. Many gpu provides operations scatter and gather to access DRAM data for stream processing [44] [40] [45].

When the GPU function is dense computation in array such as MPEG4 encoder or deep learning for tuning weights, it may get much speed up [46]. However when GPU function is matrix addition and CPU will idle for waiting GPU’s result. It may slow down than doing matrix addition by CPU only. Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speedup [40].

Wiki here [47] includes speedup applications for gpu as follows:

General Purpose Computing on GPU, has found its way into fields as diverse as machine learning, oil exploration, scientific image processing, linear algebra, statistics, 3D reconstruction and even stock options pricing determination. In addition, section “GPU accelerated video decoding and encoding” for video compressing [47] gives the more applications for GPU acceleration.

Table 43 The differences for speedup in architecture of CPU and GPU

Item

CPU

GPU

Application

Non-data parallel

Data parallel

Architecture

SISD, small vector

Large SIMD

Cache

Smaller and faster

Larger and slower #A

ILP

Pipeline

Pipeline

Superscalar, SMT

SIMT

Super-pipeline

Branch

Conditional-instructions

Mask & conditional-instructions

Note

GPU-Cache

In theory for data parallel application in GPU’s SMT, GPU can schedule more threads and pursues throughput rather speedup for one single thread as SISD in CPU. However in reality, GPU provides small L1 cache like CPU’s and fill the cache-miss with scheduline another thread. So, GPU may has no L2 and L3 while CPU has deep level of caches.

Vulkan and spir-v

Though OpenGL api existed in higher level with many advantages from sections above, sometimes it cannot compete in efficience with direct3D providing lower levels api for operating memory by user program [48]. Vulkan api is lower level’s C/C++ api to fill the gap allowing user program to do these things in OpenGL to compete against Microsoft direct3D. Here is an example [49]. Meanwhile glsl is C-like language. The vulkan infrastructure provides tool, glslangValidator [50], to compile glsl into an Intermediate Representation Form (IR) called spir-v off-line. As a result, it saves part of compilation time from glsl to gpu instructions on-line since spir-v is an IR of level closing to llvm IR [51]. In addition, vulkan api reduces gpu drivers efforts in optimization and code generation [48]. These standards provide user programmer option to using vulkan/spir-v instead of OpenGL/glsl, and allow them pre-compiling glsl into spir-v off-line to saving part of on-line compilation time.

With vulkan and spir-v standard, the gpu can be used in OpenCL for Parallel Programming of Heterogeneous Systems [52] [53]. Similar with Cuda, a OpenCL example for fast Fourier transform (FFT) is here [54]. Once OpenCL grows into a popular standard when more computer languages or framework supporting OpenCL language, GPU will take more jobs from CPU [55].

Now, you find llvm IR expanding from cpu to gpu becoming influentially more and more. And actually, llvm IR expanding from version 3.1 util now as I can feel.

Accelerate ML/DL on OpenCL/SYCL

_images/opencl_ml_graph.png

Fig. 43 Implement ML graph scheduler both on compiler and runtime

As above figure, the Device of GPU or CPU+NPU is able to run the whole ML graph. However if the Device has NPU only, then the CPU operation such as Avg-Pool has to run on Host side which add communication cost between Host and Device.

In order to run ML (Machine Learning) efficiently, all platforms for ML on GPU/NPU implement scheduling SW both on graph compiler and runtime. If OpenCL can extend to support ML graph, then graph compiler such as TVM or Runtime from Open Source has chance to leverage the effort of scheduling SW from programmers [56]. Cuda graph is an idea like this [57] [58] .