The Concept of GPU Compiler¶

Basically, a CPU is a SISD (Single Instruction Single Data) architecture in each core. The multimedia instructions in CPUs are smaller-scale forms of SIMD (Single Instruction Multiple Data), while GPUs are large-scale SIMD processors, capable of coloring millions of image pixels in just a few milliseconds.

Since 2D and 3D graphic processing offers great potential for parallel data processing, GPU hardware typically includes tens of thousands of functional units per chip, as seen in products by NVIDIA and other manufacturers.

This chapter provides an overview of how 3D animation is created and executed on a CPU+GPU system. Following that, it introduces GPU compilers and hardware features relevant to graphics applications. Finally, it explains how GPUs have taken on more computational tasks traditionally handled by CPUs, through the GPGPU (General-Purpose computing on Graphics Processing Units) concept and the emergence of related standards.

Website: Basic Theory of 3D Graphics with OpenGL [1].

Concept in Graphics and Systems ¶

3D Modeling ¶

By creating 3D models with triangles or quads on a surface, the model is formed using a polygon mesh [2]. This mesh consists of all the vertices shown in the first image as Fig. 61.

_images/modeling1.png — Fig. 61 Creating 3D model and texturing¶

After applying smooth shading [2], the vertices and edge lines are covered with color (or the edges are visually removed—edges never actually have black outlines). As a result, the model appears much smoother [3].

Furthermore, after texturing (texture mapping), the model looks even more realistic [4].

To understand how animation works for a 3D model, please refer to the video here [5]. According to the video on skeleton animation, joints are positioned at different poses and assigned timing (keyframes), as illustrated in Fig. 62.

_images/animation.png — Fig. 62 Set time point at keyframes¶

In this series of videos, you will see 3D modeling tools generating Java code instead of C/C++ code calling OpenGL API and shaders. This is because Java can call OpenGL API through a wrapper library [6].

Every CAD software manufacturer, such as AutoDesk and Blender, has their own proprietary format. To solve interoperability problems, neutral or open source formats were created as intermediate formats to convert between proprietary formats.

Naturally, these neutral formats have become very popular. Two famous examples are STL (with a .STL extension) and COLLADA (with a .DAE extension). Below is a list showing 3D file formats along with their types.

Table 44 3D file formats [7]¶
3D file format	Type
STL	Neutral
OBJ	ASCII variant is neutral, binary variant is proprietary
FBX	Proprietary
COLLADA	Neutral
3DS	Proprietary
IGES	Neutral
STEP	Neutral
VRML/X3D	Neutral

The four key features a 3D file can store include the model’s geometry, the model’s surface texture, scene details, and animation of the model [7].

Specifically, they can store details about four key features of a 3D model, though it’s worth bearing in mind that you may not always take advantage of all four features in all projects, and not all file formats support all four features!

3D printer applications do not to support animation. CAD and CAM such as designing airplane does not need feature of scene details.

DAE (Collada) appeared in the video animation above. Collada files belong to a neutral format used heavily in the video game and film industries. It’s managed by the non-profit technology consortium, the Khronos Group.

The file extension for the Collada format is .dae. The Collada format stores data using the XML mark-up language.

The original intention behind the Collada format was to become a standard among 3D file formats. Indeed, in 2013, it was adopted by ISO as a publicly available specification, ISO/PAS 17506. As a result, many 3D modeling programs support the Collada format.

That said, the consensus is that the Collada format hasn’t kept up with the times. It was once used heavily as an interchange format for Autodesk Max/Maya in film production, but the industry has now shifted more towards OBJ, FBX, and Alembic [7].

Graphic HW and SW Stack ¶

https://en.wikipedia.org/wiki/Free_and_open-source_graphics_device_driver

The block diagram of the Graphic Processing Unit (GPU) is shown in Fig. 63.

_images/gpu-block-diagram.png — Fig. 63 Components of a GPU: GPU has accelerated video decoding and encoding [8]¶

The roles of the CPU and GPU in graphic animation are illustrated in Fig. 64.

_images/graphic-cpu-gpu.png — Fig. 64 OpenGL and Vulkan are both rendering APIs. In both cases, the GPU executes shaders, while the CPU executes everything else [9].¶

GPU can’t directly read user input from, say, keyboard, mouse, gamepad, or play audio, or load files from a hard drive, or anything like that. In this situation, cannot let GPU handle the animation work [10].
A graphics driver consists of an implementation of the OpenGL state machine and a compilation stack to compile the shaders into the GPU’s machine language. This compilation, as well as pretty much anything else, is executed on the CPU, then the compiled shaders are sent to the GPU and are executed by it. (SDL = Simple DirectMedia Layer) [11].

_images/graphic-gpu-csf.png — Fig. 65 MCU and specific HW circuits to speedup the processing of CSF (Command Stream Fronted) [12].¶

The GPU driver write command and data from CPU to GPU’s system memory through PCIe. These commands are called Command Stream Fronted (CSF) in the memory of GPU. A chipset of GPU includes tens of SIMD processors (cores). In order to speedup the GPU driver’s processing, the CSF is designed to a simpler form. As result, GPU chipset include MCU (Micro Chip Unit) and specfic HW to transfer the CSF into individual data structure for each SIMD processor to execute as Fig. 65. The firmware version of MCU is updated by MCU itself usually.

The driver runs on the CPU side as shown in Fig. 66. The OpenGL API eventually calls the driver’s functions, and the driver executes these functions by issuing commands to the GPU hardware and/or sending data to the GPU.

Even so, the GPU’s rendering work, which uses data such as 3D vertices and colors sent from the CPU and stored in GPU or shared memory, consumes more computing power than the CPU.

digraph G {
rankdir=LR;

compound=true;
node [shape=record];
subgraph cluster_cpu {
label = "CPU (Client)";
CPU_SW [label=" 3D Model | JAVA | JOGL | { OpenGL API | Shaders \n (buitin-functions)} | <f1> Driver"];
}
subgraph cluster_gpu {
label = "GPU HW (Server)"
GPU_SW [label="<f1> 3D Rendering-pipeline \ndescribed in next section"];
}
CPU_SW:f1 -> GPU_SW:f1 [label=" VAO, texture, ..., from 3D model, \n shader-exectuable-code"];

// label = "Graphic SW Stack";
} — Fig. 66 Graphic SW Stack¶

According to the previous section, after the user creates a skeleton and skin for each model and sets keyframe times using a 3D modeling tool, the tool can either generate Java code that calls JOGL (Java OpenGL) [6] or generate OpenCL APIs directly. The frame data can be calculated by interpolating between keyframes.
As described above, for each animation frame, the client (CPU) program sets the new positions of objects (vertices) and colors while the server (driver and GPU) performs the 3D-to-2D rendering. Higher-level libraries and frameworks on top of OpenGL provide animation frameworks and tools to generate OpenGL APIs and shaders from 3D models.
Shaders may call built-in functions written in Compute Shaders, SPIR-V, or LLVM-IR. LLVM libclc is a project for OpenCL built-in functions, which can also be used in OpenGL [13]. Like CPU built-ins, new GPU ISAs or architectures must implement their own built-ins or port them from open source projects like libclc.
The 3D model (on CPU) performs rendering animations to generate each frame between keyframes (poses), while the GPU executes the rendering pipeline from each frame down to each pixel’s value.
These frame data are stored in the form of VAOs (Vertex Array Objects) in OpenGL. This will be explained in a later section: OpenGL.
Additionally, OpenGL provides VBOs (Vertex Buffer Objects), which allow vertex array data to be stored in high-performance graphics memory on the server side and enable efficient data transfer [14] [15].
3D animation software provides many built-in shaders. Programmers can also write their own shaders for use in game engines.

The flow for 3D/2D graphic processing is shown in Fig. 67.

digraph G {
rankdir=LR;

compound=true;
node [shape=record];
subgraph cluster_3d {
label = "3D/2D modeling software";
subgraph cluster_code {
label = "3D/2D's code: engine, lib, shader, ...";
Api [label="<g> Generated Code | <a> OpenGL API | lib | <s> Shaders (3D animation's shaders \n or programmer writing shaders"];
}
}
subgraph cluster_driver {
label = "Driver"
Compiler [label="On-line Compiler"];
Obj [label="obj"];
Linker [label="On-line binding (Linker)"];
Exe [label="exe"];
}
Api:a -> Obj [lhead ="cluster_driver"];
Api:s -> Compiler;
Compiler -> Obj -> Linker -> Exe;
Exe -> GPU;
Exe -> CPU [ltail ="cluster_driver"];

// label = "OpenGL Flow";
} — Fig. 67 OpenGL Flow¶

VSync

No tearing occurs when the GPU and display operate at the same refresh rate,
since the GPU refreshes faster than the display as shown below.

              A    B

GPU      | ----| ----|

Display  |-----|-----|

            B      A

Tearing occurs when the GPU has exact refresh cycles but VSync takes
one more cycle than the display as shown below.

              A

GPU      | -----|

Display  |-----|-----|

            B      A

To avoid tearing, the GPU runs at half the refresh rate of the display,
as shown below.

              A          B

GPU      | -----|    | -----|

Display  |-----|-----|-----|-----|

            B      B    A     A

Double Buffering

While the display is reading from the frame buffer to display the current frame, we might be updating its contents for the next frame (not necessarily in raster-scan manner). This would result in the so-called tearing, in which the screen shows parts of the old frame and parts of the new frame. This could be resolved by using so-called double buffering. Instead of using a single frame buffer, modern GPU uses two of them: a front buffer and a back buffer. The display reads from the front buffer, while we can write the next frame to the back buffer. When we finish, we signal to GPU to swap the front and back buffer (known as buffer swap or page flip).
VSync

Double buffering alone does not solve the entire problem, as the buffer swap might occur at an inappropriate time, for example, while the display is in the middle of displaying the old frame. This is resolved via the so-called vertical synchronization (or VSync) at the end of the raster-scan. When we signal to the GPU to do a buffer swap, the GPU will wait till the next VSync to perform the actual swap, after the entire current frame is displayed.

As above text digram. The most important point is: When the VSync buffer-swap is enabled, you cannot refresh the display faster than the refresh rate of the display!!! If GPU is capable of producing higher frame rates than the display’s refresh rate, then GPU can use fast rate without tearing. If GPU has same or less frame rates then display’s and you application refreshes at a fixed rate, the resultant refresh rate is likely to be an integral factor of the display’s refresh rate, i.e., 1/2, 1/3, 1/4, etc. Otherwise it will cause tearing [1].
NVIDIA G-SYNC and AMD FreeSync

If your monitor and graphics card both in your customer computer support NVIDIA G-SYNC, you’re in luck. With this technology, a special chip in the display communicates with the graphics card. This lets the monitor vary the refresh rate to match the frame rate of the NVIDIA GTX graphics card, up to the maximum refresh rate of the display. This means that the frames are displayed as soon as they are rendered by the GPU, eliminating screen tearing and reducing stutter for when the frame rate is both higher and lower than the refresh rate of the display. This makes it perfect for situations where the frame rate varies, which happens a lot when gaming. Today, you can even find G-SYNC technology in gaming laptops!

AMD has a similar solution called FreeSync. However, this doesn’t require a proprietary chip in the monitor. In FreeSync, the AMD Radeon driver, and the display firmware handle the communication. Generally, FreeSync monitors are less expensive than their G-SYNC counterparts, but gamers generally prefer G-SYNC over FreeSync as the latter may cause ghosting, where old images leave behind artifacts [16].

Basic geometry in computer graphics ¶

This section introduces the basic geometry math used in computer graphics. The complete concept can be found in the book Computer Graphics: Principles and Practice, 3rd Edition, authored by John F. et al. However, the book contains over a thousand pages.

It is very comprehensive and may take considerable time to understand all the details.

Color ¶

Additive colors in light are shown in Fig. 69 [17] [18].
In the case of paints, additive colors produce shades and become light gray due to the addition of darker pigments [19].

Note

Additive colors

I know it doesn’t match human intuition. However, additive RGB colors in light combine to produce white light, while additive RGB in paints result in light gray paint. This makes sense because light has no shade. This result stems from the way human eyes perceive color. Without light, no color can be sensed by the eyes.

Computer engineers should understand that exploring the underlying reasons falls into the realms of physics or the biology of the human eye structure.

Transformation ¶

Objects (Triangle/Quad) can be moved in 2D/3D using matrix representation, as explained in this wiki page [20].

The rotation matrix used is derived from another wiki page [21].

Every computer graphics book covers the topic of transformation of objects and their positions in space. Chapter 4 of the Blue Book: OpenGL SuperBible, 7th Edition provides a short but useful 40-page description of transformation concepts. It is a good material for understanding the basics.

The following Quaternion Product (Hamilton product) is from the wiki [22] since it is not covered in the book.

\[\mathbf ij = -ji = k, jk = -kj = i, ki = -ik = j.\]

_images/trans-steps.png — Fig. 70 Cooridinates Transform Pipeline [1]¶

Details for Fig. 70 can be found on the website [1].

Projection ¶

Only objects within the cone between near and far planes are projected to 2D in perspective projection.

Perspective and orthographic projections (used in CAD tools) from 3D to 2D can be represented by transformation matrices as described in the previous section [23].

Cross product ¶

Both triangles and quads are polygons. So, objects can be formed with polygons in both 2D and 3D. The transformation in 2D or 3D is well covered in almost every computer graphics book. This section introduces the most important concept and method for determining inner and outer planes. Then, a point or object can be checked for visibility during 2D or 3D rendering.

Any area of a polygon can be calculated by dividing it into triangles or quads. The area of a triangle or quad can be calculated using the cross product in 3D.

The cross product in 3D is defined by the formula and can be represented with matrix notation, as shown here [24].

\[\mathbf a \mathsf x \mathbf b = \Vert a \Vert \Vert b \Vert sin(\Theta) n\]

\[\begin{split}\mathbf a \mathsf x \mathbf b = \begin{vmatrix} \mathbf i & \mathbf j& \mathbf k\\ a_1& a_2& a_3\\ b_1& b_2& b_3 \end{vmatrix}\end{split}\]

The cross product in 2D is defined by a formula and can be represented with matrix notation, as proven here [25] [26].

\[\mathbf a \mathsf x \mathbf b = \Vert a \Vert \Vert b \Vert sin(\Theta)\]

\[\begin{split}\mathbf a \mathsf x \mathbf b = \begin{vmatrix} \mathbf i & \mathbf j& \mathbf k\\ a_1& a_2& 0\\ b_1& b_2& 0 \end{vmatrix} = \begin{bmatrix} a_1& a_2 \\ b_1& b_2 \end{bmatrix}\end{split}\]

After the above matrix form is proven, the antisymmetry property may be demonstrated as follows:

\[\begin{split}a \mathsf x b = \mathsf x& \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} a_1& a_2 \\ b_1& b_2 \end{bmatrix} = a_1b_2 - a_2b_1 =\end{split}\]

\[\begin{split}-b_1a_2 - (-b_2a_1) = \begin{bmatrix} - b_1& - b_2 \\ a_1& a_2 \end{bmatrix} = \mathsf x& \begin{bmatrix} -b \\ a \end{bmatrix} = -b \mathsf x a\end{split}\]

In 2D, any two points \(\text{from } P_i \text{ to } P_{i+1}\) can form a vector and determine the inner or outer side.

For example, as shown in Fig. 71, \(\Theta\) is the angle from \(P_iP_{i+1}\) to \(P_iP'_{i+1} = 180^\circ\).

Using the right-hand rule and counter-clockwise order, any vector \(P_iQ\) between \(P_iP_{i+1}\) and \(P_iP'_{i+1}\), with angle \(\theta\) such that \(0^\circ < \theta < 180^\circ\), indicates the inward direction.

_images/inward-edge-normals.png — Fig. 71 Inward edge normals¶

_images/2d-vector-inward.png — Fig. 72 Inward and outward in 2D for a vector.¶

Based on this observation, the rule for inward and outward vectors is shown in Fig. 71. Facing the same direction as a specific vector, the left side is inward and the right side is outward, as shown in Fig. 72.

For each edge \(P_i - P_{i+1}\), the inward edge normal is the vector \(\mathsf{x} \; v_i\); the outward edge normal is \(- \; \mathsf{x} \; v_i\), where \(\mathsf{x} \; v_i\) is the cross-product of \(v_i\), as shown in Fig. 71.

A polygon can be created from a set of vertices. Suppose \((P_0, P_1, ..., P_n)\) defines a polygon. The line segments \(P_0P_1, P_1P_2\), etc., are the polygon’s edges. The vectors \(v_0 = P_1 - P_0, v_1 = P_2 - P_1, ..., v_n = P_0 - P_n\) represent those edges.

Using counter-clockwise ordering, the left side is inward. Thus, the inward region of a polygon can be determined.

For a convex polygon with vertices listed in counter-clockwise order, the inward edge normals point toward the interior of the polygon, and the outward edge normals point toward the unbounded exterior. This matches our usual intuition.

However, if the polygon vertices are listed in clockwise order, the interior and exterior definitions are reversed.

This cross product has an important property: going from \(v\) to \(\times v\) involves a 90° rotation in the same direction as the rotation from the positive x-axis to the positive y-axis.

_images/polygon.png — Fig. 73 Draw a polygon with vectices counter clockwise¶

As shown in Fig. 73, when drawing a polygon with vectors (lines) in counter-clockwise order, the polygon will be formed, and the two sides of each vector (line) can be identified [27].

Furthermore, whether a point is inside or outside the polygon can be determined.

One simple method to test whether a point lies inside or outside a simple polygon is to cast a ray from the point in any fixed direction and count how many times it intersects the edges of the polygon.

If the point is outside the polygon, the ray will intersect its edges an even number of times. If the point is inside the polygon, it will intersect the edges an odd number of times [28].

_images/3d-cross-product.png — Fig. 74 Cross product definition in 3D¶

In the same way, by following the counter-clockwise direction to create a 2D polygon step by step, a 3D polygon can be constructed.

As shown in Fig. 74 from the wiki [24], the inward direction is determined by \(a \times b < 0\), and the outward direction is determined by \(a \times b > 0\) in OpenGL.

Replacing \(a\) and \(b\) with \(x\) and \(y\), as shown in Fig. 75, the positive Z-axis (\(z+\)) represents the outer surface, while the negative Z-axis (\(z-\)) represents the inner surface [29].

_images/ogl-pointing-outwards.png — Fig. 75 OpenGL pointing outwards, indicating the outer surface (z axis is +)¶

_images/3d-polygon.png — Fig. 76 3D polygon with directions on each plane¶

The Fig. 76 shows an example of a 3D polygon created from 2D triangles. The direction of the plane (triangle) is given by the line perpendicular to the plane.

Cast a ray from the 3D point along the X-axis and count how many intersections with the outer object occur. Depending on the number of intersections along each axis (even or odd), you can understan if the point is inside or outside [30].

An odd number means inside, and an even number means outside. As shown in Fig. 77, points on the line passing through the object satisfy this rule.

_images/in-3d-object.png — Fig. 77 Point is inside or outside of 3D object¶

OpenGL uses counter clockwise and pointing outwards as default [14].

// unit cube
// A cube has 6 sides and each side has 4 vertices, therefore, the total number
// of vertices is 24 (6 sides * 4 verts), and 72 floats in the vertex array
// since each vertex has 3 components (x,y,z) (= 24 * 3)
//    v6----- v5
//   /|      /|
//  v1------v0|
//  | |     | |
//  | v7----|-v4
//  |/      |/
//  v2------v3

// vertex position array
GLfloat vertices[]  = {
   .5f, .5f, .5f,  -.5f, .5f, .5f,  -.5f,-.5f, .5f,  .5f,-.5f, .5f, // v0,v1,v2,v3 (front)
   .5f, .5f, .5f,   .5f,-.5f, .5f,   .5f,-.5f,-.5f,  .5f, .5f,-.5f, // v0,v3,v4,v5 (right)
   .5f, .5f, .5f,   .5f, .5f,-.5f,  -.5f, .5f,-.5f, -.5f, .5f, .5f, // v0,v5,v6,v1 (top)
  -.5f, .5f, .5f,  -.5f, .5f,-.5f,  -.5f,-.5f,-.5f, -.5f,-.5f, .5f, // v1,v6,v7,v2 (left)
  -.5f,-.5f,-.5f,   .5f,-.5f,-.5f,   .5f,-.5f, .5f, -.5f,-.5f, .5f, // v7,v4,v3,v2 (bottom)
   .5f,-.5f,-.5f,  -.5f,-.5f,-.5f,  -.5f, .5f,-.5f,  .5f, .5f,-.5f  // v4,v7,v6,v5 (back)
};

From the code above, we can see that OpenGL uses counter-clockwise and pointing outwards as the default. However, OpenGL provides glFrontFace(GL_CW) for clockwise winding [31].

For a group of objects, a scene graph provides better animation support and saves memory [32].

OpenGL ¶

Example of OpenGL program ¶

The following example is from the OpenGL Red Book and its example code [37] [38].

References/triangles.vert

#version 400 core

layout( location = 0 ) in vec4 vPosition;

void
main()
{
    gl_Position = vPosition;
}

References/triangles.frag

#version 450 core

out vec4 fColor;

void main()
{
    fColor = vec4(0.5, 0.4, 0.8, 1.0);
}

References/01-triangles.cpp

//////////////////////////////////////////////////////////////////////////////
//
//  Triangles.cpp
//
//////////////////////////////////////////////////////////////////////////////

#include "vgl.h"
#include "LoadShaders.h"

enum VAO_IDs { Triangles, NumVAOs };
enum Buffer_IDs { ArrayBuffer, NumBuffers };
enum Attrib_IDs { vPosition = 0 };

GLuint  VAOs[NumVAOs];
GLuint  Buffers[NumBuffers];

const GLuint  NumVertices = 6;

//----------------------------------------------------------------------------
//
// init
//

void
init( void )
{
    glGenVertexArrays( NumVAOs, VAOs ); // Same with glCreateVertexArray( NumVAOs, VAOs ); 
      // https://stackoverflow.com/questions/24441430/glgen-vs-glcreate-naming-convention
    // Make the new VAO:VAOs[Triangles] active, creating it if necessary.
    glBindVertexArray( VAOs[Triangles] );
    // opengl->current_array_buffer = VAOs[Triangles]
    
    GLfloat  vertices[NumVertices][2] = {
        { -0.90f, -0.90f }, {  0.85f, -0.90f }, { -0.90f,  0.85f },  // Triangle 1
        {  0.90f, -0.85f }, {  0.90f,  0.90f }, { -0.85f,  0.90f }   // Triangle 2
    };

    glCreateBuffers( NumBuffers, Buffers );
    
    // Make the buffer the active array buffer.
    glBindBuffer( GL_ARRAY_BUFFER, Buffers[ArrayBuffer] );
    // Attach the active VBO:Buffers[ArrayBuffer] to VAOs[Triangles]
    // as an array of vectors with 4 floats each.
    // Kind of like:
    // opengl->current_vertex_array->attributes[attr] = {
    //     type = GL_FLOAT,
    //     size = 4,
    //     data = opengl->current_array_buffer
    // }
    // Can be replaced with glVertexArrayVertexBuffer(VAOs[Triangles], Triangles, 
    // buffer[ArrayBuffer], ArrayBuffer, sizeof(vmath::vec2));, glVertexArrayAttribFormat(), ...
    // in OpenGL 4.5.
    
    glBufferStorage( GL_ARRAY_BUFFER, sizeof(vertices), vertices, 0);

    ShaderInfo  shaders[] =
    {
        { GL_VERTEX_SHADER, "media/shaders/triangles/triangles.vert" },
        { GL_FRAGMENT_SHADER, "media/shaders/triangles/triangles.frag" },
        { GL_NONE, NULL }
    };

    GLuint program = LoadShaders( shaders );
    glUseProgram( program );

    glVertexAttribPointer( vPosition, 2, GL_FLOAT,
                           GL_FALSE, 0, BUFFER_OFFSET(0) );
    glEnableVertexAttribArray( vPosition );
    // Above two functions specify vPosition to vertex shader at layout (location = 0)
}

//----------------------------------------------------------------------------
//
// display
//

void
display( void )
{
    static const float black[] = { 0.0f, 0.0f, 0.0f, 0.0f };

    glClearBufferfv(GL_COLOR, 0, black);

    glBindVertexArray( VAOs[Triangles] );
    glDrawArrays( GL_TRIANGLES, 0, NumVertices );
}

//----------------------------------------------------------------------------
//
// main
//

#ifdef _WIN32
int CALLBACK WinMain(
  _In_ HINSTANCE hInstance,
  _In_ HINSTANCE hPrevInstance,
  _In_ LPSTR     lpCmdLine,
  _In_ int       nCmdShow
)
#else
int
main( int argc, char** argv )
#endif
{
    glfwInit();

    GLFWwindow* window = glfwCreateWindow(800, 600, "Triangles", NULL, NULL);

    glfwMakeContextCurrent(window);
    gl3wInit();

    init();

    while (!glfwWindowShouldClose(window))
    {
        display();
        glfwSwapBuffers(window);
        glfwPollEvents();
    }

    glfwDestroyWindow(window);

    glfwTerminate();
}

Init():

Generate Vertex Array VAOs and bind VAOs[0].

(glGenVertexArrays(NumVAOs, VAOs); glBindVertexArray(VAOs[Triangles]); glCreateBuffers(NumBuffers, Buffers);)

A vertex-array object holds various data related to a collection of vertices. Those data are stored in buffer objects and managed by the currently bound vertex-array object.
- glBindBuffer(GL_ARRAY_BUFFER, Buffers[ArrayBuffer]);
  
  Because there are many different places where buffer objects can be in OpenGL, when we bind a buffer, we need to specify what we’d like to use it for. In our example, because we’re storing vertex data into the buffer, we use GL_ARRAY_BUFFER. The place where the buffer is bound is known as the binding target.
According to the counter-clockwise rule in the previous section, triangle primitives are defined in variable vertices. After binding OpenGL VBO Buffers[0] to vertices, vertex data will be sent to the memory of the server (GPU).

Think of the “active” buffer as just a global variable, and there are a bunch of functions that use the active buffer instead of taking using a parameter. These global state variables are the ugly side of OpenGL [39] and can be replaced with glVertexArrayVertexBuffer(), glVertexArrayAttribFormat(), etc. Then call glBindVertexArray(vao) before drawing in OpenGL 4.5 [40] [41].
glVertexAttribPointer(vPosition, 2, GL_FLOAT, GL_FALSE, 0, BUFFER_OFFSET(0)):

During GPU rendering, each vertex position will be held in vPosition and passed to the “triangles.vert” shader through the LoadShaders(shaders) function.

glfwSwapBuffers(window):

You’ve already used double buffering for animation. Double buffering is done by making the main color buffer have two parts: a front buffer that’s displayed in your window; and a back buffer, which is where you render the new image. When you swap the buffers (by calling glfwSwapBuffers(), for example), the front and back buffers are exchanged [78].

display():

Bind VAOs[0], set render mode to GL_TRIANGLES and send vertex data to Buffer (gpu memory, OpenGL pipeline). Next, GPU will do rendering pipeline descibed in next section.

The triangles.vert has input vPosition and no output variable, so using gl_Position default varaible without declaration. The triangles.frag has not defined input variable and has defined output variable fColor instead of using gl_FragColor.

The “in” and “out” in shaders above are “type qualifier”. A type qualifier is used in the OpenGL Shading Language (GLSL) to modify the storage or behavior of global and locally defined variables. These qualifiers change particular aspects of the variable, such as where they get their data from and so forth [45].

Though attribute and varying are removed from later version 1.4 of OpenGL, many materials in website using them [46] [47]. It’s better to use “in” and “out” to replace them as the following code. OpenGL has a few ways to binding API’s variable with shader’s variable. glVertexAttrib* as the following code and glBindAttribLocation() [48], …

replace attribute and varying with in and out

uniform float scale;
layout (location = 0) attribute vec2 position;
// layout (location = 0) in vec2 position;
layout (location = 1) attribute vec4 color;
// layout (location = 1) in vec4 color;
varying vec4 v_color;
// out v_color

void main()
{
  gl_Position = vec4(position*scale, 0.0, 1.0);
  v_color = color;
}

// OpenGL API
GLfloat attrib[] = { x * 0.5f, x * 0.6f, x* 0.4f, 0.0f };
// Update the value of input attribute 1 : layout (location = 1) in vec4 color
glVertexAttrib4fv(1, attrib);

varying vec4 v_color;
// in vec4 v_color;

void main()
{
  gl_FragColor = v_color;
}

An OpenGL program is made of two shaders [43] [44]:

The vertex shader is (commonly) executed once for every vertex we want to draw. It receives some attributes as input, computes the position of this vertex in space and returns it in a variable called gl_Position. It also defines some varyings.
The fragment shader is executed once for each pixel to be rendered. It receives some varyings as input, computes the color of this pixel and returns it in a variable called fColor.

Since we have 6 vertices in our buffer, this shader will be executed 6 times by the GPU (once per vertex)! We can also expect all 6 instances of the shader to be executed in parallel, since a GPU have so many cores.

3D Rendering ¶

Based on the previous section of 3D modeling, the 3D modeling tool will generate a 3D vertex model and OpenGL code. Then, programmers may manually modify the OpenGL code and add or update shaders. The 3D animation will trigger the 3D rendering process for each 2D image drawing.

3D rendering is the process of converting 3D models into 2D images on a computer [33].

The steps are shown in Fig. 78.

_images/short-rendering-pipeline.png — Fig. 78 3D Graphics Rendering Pipeline¶

A fragment can be treated as a pixel in 3D spaces, which is aligned with the pixel grid, with attributes such as position, color, normal and texture.

The complete steps are shown in Fig. 79, from the OpenGL website [34]. The website also provides a description for each stage.

_images/rendering_pipeline.png — Fig. 79 Diagram of the Rendering Pipeline. The blue boxes are programmable shader stages.¶

In addition, list the OpenGL rendering pipeline Figure 1.2 and its stages from the book OpenGL Programming Guide, 9th Edition [37] as follows:

Table 45 OpenGL rendering pipeline from page 10 of book “OpenGL Programming Guide 9th Edition” [37] and [34].¶
Stage.	Description
Vertex Specification	After setting data as in the previous section, `glDrawArrays()` sends data to the GPU through buffer objects.
Vertex Shading	Vertex → Vertex and other data such as color for later passes. For each vertex issued by a drawing command, a vertex shader processes the data associated with that vertex.
Tessellation Shading	Create more detail on demand when zoomed in. After the vertex shader processes each vertex, the tessellation shader stage (if active) continues processing. See reference below.
Geometry Shading	Allows additional processing of geometric primitives. This stage may create new primitives before rasterization. See Chapter 10 of the Red Book [37].
Primitive Assembly	The previous shading stages all operate on vertices, with the information about how those vertices are organized into geometric primitives being carried along internal to OpenGL. The primitive assembly stage organizes the vertices into their associated geometric primitives in preparation for clipping and rasterization.
Clipping	Clipping hidden parts. Occasionally, vertices will be outside of the viewport—the region of the window where you’re permitted to draw—and cause the primitive associated with that vertex to be modified so none of its pixels are outside of the viewport. This operation is called clipping and is handled automatically by OpenGL.
Rasterization	Vertex -> Fragment. The job of the rasterizer is to determine which screen locations are covered by a particular piece of geometry (point, line, or triangle). Knowing those locations, along with the input vertex data, the rasterizer linearly interpolates the data values for each varying variable in the fragment shader and sends those values as inputs into your fragment shader. A fragment can be treated as a pixel in 3D spaces, which is aligned with the pixel grid, with attributes such as position, color, normal and texture.
Fragment Shading	Determine color for each pixel. The final stage where you have programmable control over the color of a screen location is fragment shading. In this shader stage, you use a shader to determine the fragment’s final color (although the next stage, per-fragment operations, can modify the color one last time) and potentially its depth value. Fragment shaders are very powerful, as they often employ texture mapping to augment the colors provided by the vertex processing stages. A fragment shader may also terminate processing a fragment if it determines the fragment shouldn’t be drawn; this process is called fragment discard. A helpful way of thinking about the difference between shaders that deal with vertices and fragment shaders is this: vertex shading (including tessellation and geometry shading) determines where on the screen a primitive is, while fragment shading uses that information to determine what color that fragment will be.
Per-Fragment Operations	During this stage, a fragment’s visibility is determined using depth testing (also commonly known as z-buffering) and stencil testing. If a fragment successfully makes it through all of the enabled tests, it may be written directly to the framebuffer, updating the color (and possibly depth value) of its pixel, or if blending is enabled, the fragment’s color will be combined with the pixel’s current color to generate a new color that is written into the framebuffer.

Tessellation Shading: The core problem that Tessellation deals with is the static nature of 3D models in terms of their detail and polygon count. The thing is that when we look at a complex model such as a human face up close we prefer to use a highly detailed model that will bring out the tiny details (e.g. skin bumps, etc). A highly detailed model automatically translates to more triangles and more compute power required for processing. … One possible way to solve this problem using the existing features of OpenGL is to generate the same model at multiple levels of detail (LOD). For example, highly detailed, average and low. We can then select the version to use based on the distance from the camera. This, however, will require more artist resources and often will not be flexible enough. … Let’s take a look at how Tessellation has been implemented in the graphics pipeline. The core components that are responsible for Tessellation are two new shader stages and in between them a fixed function stage that can be configured to some degree but does not run a shader. The first shader stage is called Tessellation Control Shader (TCS), the fixed function stage is called the Primitive Generator (PG), and the second shader stage is called Tessellation Evaluation Shader (TES). Some GPU havn’t this fixed function stage implemented in HW and even havn’t provide these TCS, TES and Gemoetry Shader. User can write Compute Shaders instead for this on-fly detail display. This surface is usually defined by some polynomial formula and the idea is that moving a CP has an effect on the entire surface. … The group of CPs is usually called a Patch [35]. Chapter 9 of Red Book [37] has details.

For 2D animation, the model is created by 2D only (1 face only), so it only can be viewed from the same face of model. If you want to display different faces of model, multiple 2D models need to be created and switch these 2D models from face(flame) to face(flame) from time to time [36].

GLSL (GL Shader Language)¶

OpenGL is a standard specification for designing 2D and 3D graphics and animation in computer graphics. To support advanced animation and rendering, OpenGL provides a large set of APIs (functions) for graphics processing. Popular 3D modeling and animation tools—such as Maya, Blender, and others—can utilize these APIs to handle 3D-to-2D projection and rendering directly on the computer.

The hardware-specific implementation of these APIs is provided by GPU manufacturers, ensuring that rendering is optimized for the underlying hardware.

An OpenGL program typically follows a structure like the example below:

Vertex shader

#version 330 core
layout (location = 0) in vec3 aPos; // the position variable has attribute position 0

out vec4 vertexColor; // specify a color output to the fragment shader

void main()
{
    gl_Position = vec4(aPos, 1.0); // see how we directly give a vec3 to vec4's constructor
    vertexColor = vec4(0.5, 0.0, 0.0, 1.0); // set the output variable to a dark-red color
}

Fragment shader

#version 330 core
out vec4 FragColor;

in vec4 vertexColor; // the input variable from the vertex shader (same name and same type)

void main()
{
    FragColor = computeColorOfThisPixel(...);
}

OpenGl user program

int main(int argc, char ** argv)
{
  // init window, detect user input and do corresponding animation by calling opengl api
  ...
}

The last main() function in an OpenGL application is written by the user, as expected. Now, let’s explain the purpose of the first two main components of the OpenGL pipeline.

As discussed in the Concepts of Computer Graphics textbook, OpenGL provides a rich set of APIs that allow programmers to render 3D objects onto a 2D computer screen. The general rendering process follows these steps:

The user sets up lighting, textures, and object materials.
The system calculates the position of each vertex in 3D space.
The GPU and rendering pipeline automatically determine the color of each pixel based on lighting, textures, and interpolation.
The final image is displayed on the screen by writing pixel colors to the framebuffer.

To give programmers the flexibility to add custom effects or visual enhancements—such as modifying vertex positions for animation or applying unique coloring—OpenGL provides two programmable stages in the graphics pipeline:

Vertex Shader: Allows the user to customize how vertex coordinates are transformed and processed.
Fragment Shader: Allows the user to define how each pixel (fragment) is shaded and colored, enabling effects like lighting, textures, and transparency.

These shaders are written by the user and compiled at runtime, providing powerful control over the rendering process.

OpenGL uses fragment shader instead of pixel is : “Fragment shaders are a more accurate name for the same functionality as Pixel shaders. They aren’t pixels yet, since the output still has to past several tests (depth, alpha, stencil) as well as the fact that one may be using antialiasing, which renders one-fragment-to-one-pixel non-true [49]. Programmer is allowed to add their converting functions that compiler translate them into GPU instructions running on GPU processor. With these two shaders, new features have been added to allow for increased flexibility in the rendering pipeline at the vertex and fragment level [50]. Unlike the shaders example here [51], some converting functions for coordinate in vertex shader or for color in fragment shade are more complicated according the scenes of animation. Here is an example [52]. In wiki shading page [3], Gourand and Phong shading methods make the surface of object more smooth by glsl. Example glsl code of Gourand and Phong shading on OpenGL api are here [53]. Since the hardware of graphic card and software graphic driver can be replaced, the compiler is run on-line meaning driver will compile the shaders program when it is run at first time and kept in cache after compilation [54].

The shaders program is C-like syntax and can be compiled in few mini-seconds, add up this few mini-seconds of on-line compilation time in running OpenGL program is a good choice for dealing the cases of driver software or gpu hardware replacement [55].

OpenGL Shader compiler ¶

The OpenGL standard is defined in [56]. OpenGL is primarily designed for desktop computers and servers, whereas OpenGL ES is a subset tailored for embedded systems [57].

Although shaders represent only a small part of the entire OpenGL software/hardware stack, implementing a compiler for them is still a significant undertaking. This is because a large number of APIs need to be supported. For instance, there are over 80 texture-related APIs alone [58].

A practical approach to implementing such a compiler involves generating LLVM extended intrinsic functions from the shader frontend (parser and AST generator). These intrinsics can then be lowered into GPU-specific instructions in the LLVM backend. The overall workflow is illustrated as follows:

Fragment shader

#version 320 es
uniform sampler2D x;
out vec4 FragColor;

void main()
{
    FragColor = texture(x, uv_2d, bias);
}

llvm-ir

...
!1 = !{!"sampler_2d"}
!2 = !{i32 SAMPLER_2D} ; SAMPLER_2D is integer value for sampler2D, for example: 0x0f02
; A named metadata.
!x_meta = !{!1, !2}

define void @main() #0 {
    ...
    %1 = @llvm.gpu0.texture(metadata !x_meta, %1, %2, %3); ; %1: %sampler_2d, %2: %uv_2d, %3: %bias
    ...
}

asm of gpu

...
// gpu machine code
load $1, tex_a;
sample2d_inst $1, $2, $3 // $1: tex_a, $2: %uv_2d, $3: %bias

.tex_a // Driver set the index of gpu descriptor regsters here

As shown at the end of the code above, the .tex_a memory address contains the Texture Object, which is bound by the driver during online compilation and linking. By binding a Texture Object (software representation) to a Texture Unit (hardware resource) via OpenGL API calls, the GPU can access and utilize Texture Unit hardware efficiently. This binding mechanism ensures that texture sampling and mapping are executed with minimal overhead during rendering.

For more information about LLVM extended intrinsic functions, please refer to [59].

gvec4 texture(gsampler2D sampler, vec2 P, [float bias]);

GPUs provide Texture Units to accelerate texture access in fragment shaders. However, Texture Units are expensive hardware resources, and only a limited number are available on a GPU. To manage this limitation, the OpenGL driver can associate a Texture Unit with a sampler variable using OpenGL API calls. This association can be updated or switched between shaders as needed. The following statements demonstrate how to bind and switch Texture Units across shaders:

_images/sampling_diagram.png — Fig. 80 Relationships between the texturing concept [60].¶

As shown in Fig. 80, the texture object is not bound directly to a shader (where sampling operations occur). Instead, it is bound to a texture unit, and the index of this texture unit is passed to the shader. This means the shader accesses the texture object through the assigned texture unit. Most GPUs support multiple texture units, though the exact number depends on the hardware capabilities [60].

A texture unit—also known as a Texture Mapping Unit (TMU) or Texture Processing Unit (TPU)— is a dedicated hardware component in the GPU that performs texture sampling operations.

The sampler argument in the texture sampling function refers to a sampler2D (or similar) uniform variable. This variable represents the texture unit index used to access the associated texture object [60].

Sampler Uniform Variables:

OpenGL provides a set of special uniform variables for texture sampling, named according to the texture target: sampler1D, sampler2D, sampler3D, samplerCube, etc.

You can create as many sampler uniform variables as needed and assign each one to a specific texture unit index using OpenGL API calls. Whenever a sampling function is invoked with a sampler uniform, the GPU uses the texture unit (and its bound texture object) associated with that sampler [60].

_images/sampling_diagram_binding.png — Fig. 81 Binding sampler variables [61].¶

As shown in Fig. 81, the Java API function gl.bindTexture() binds a Texture Object to a specific Texture Unit. Then, using gl.getUniformLocation() and gl.uniform1i(), you associate the Texture Unit with a sampler uniform variable in the shader.

For example, gl.uniform1i(xLoc, 1) assigns Texture Unit 1 to the sampler variable at location xLoc. Similarly, passing 2 would refer to Texture Unit 2, and so on [61].

The following figure illustrates how the OpenGL driver reads metadata from a compiled GLSL object, how the OpenGL API links sampler uniform variables to Texture Units, and how the GPU executes the corresponding texture instructions.

_images/driverSamplerTable.png — Fig. 82 Associating Sampler Variables and gpu executing texture instruction¶

Explaining the detailed steps for the figure above:

To enable the GPU driver to bind the texture unit, the frontend compiler must pass metadata for each sampler uniform variable (e.g., sampler_2d_var in this example) [64] to the backend. The backend then allocates and embeds this metadata in the compiled binary file [62].
During the on-line compilation of the GLSL shader, the GPU driver reads this metadata from the compiled binary file. It constructs an internal table mapping each sampler uniform variable to its attributes, such as {name, type, location}. This mapping allows the driver to properly populate the Texture Descriptor in the GPU’s memory, linking the variable to a specific texture unit.
API:

xLoc = gl.getUniformLocation(prog, "x"); // prog: GLSL program, xLoc: location of sampler variable "x"

This API call queries the location of the sampler uniform variable named “x” from the internal table that the driver created after parsing the shader metadata.

The returned xLoc value corresponds to the location field associated with “x”, which will later be used to bind a specific texture unit to this sampler variable via gl.uniform1i(xLoc, unit_index).

SAMPLER_2D is the internal representation (usually an integer) that identifies a sampler2D type in the shader.

API:

gl.uniform1i(xLoc, 1);

This API call binds the sampler uniform variable x (located at xLoc) to Texture Unit 1. It works by writing the integer value 1 to the internal GLSL program memory at the location of the sampler variable x, as indicated by xLoc.

{xLoc, 1} : 1 is 'Texture Unit 1', xLoc is the memory address of 'sampler uniform variable' x

After this call, the OpenGL driver updates the Texture Descriptor table in GPU memory with this {xLoc, 1} information.

Next, the driver associates the memory address or index of the GPU’s texture descriptor with a hardware register or pointer used during fragment shader execution. For example, as shown in the diagram, the driver may write a pointer k to the .tex_a field in memory.

This .tex_a address is used by the GPU to locate the correct Texture Unit and access the texture object during shader execution.

// gpu machine code
load $1, tex_a;
sample2d_inst $1, $2, $3 // $1: tex_a, $2: %uv_2d, $3: %bias

.tex_a // Driver set the index of gpu descriptor regsters here at step 4

When executing the texture instructions from glsl binary file on gpu, the corresponding ‘Texture Unit 1’ on gpu will being executed through texture descriptor in gpu’s memory because .tex_a: {xLoc, 1}. Driver may set texture descriptor in gpu’s texture desciptors if gpu provides specific texture descriptors in architecture [65].

For instance, Nvidia texture instruction as follow,

// the content of tex_a bound to texture unit as step 5 above
tex.3d.v4.s32.s32  {r1,r2,r3,r4}, [tex_a, {f1,f2,f3,f4}];

.tex_a

The content of tex_a bound to texture unit set by driver as the end of step 4. The pixel of coordinates (x,y,z) is given by (f1,f2,f3) user input. The f4 is skipped for 3D texture.

Above tex.3d texture instruction load the calculated color of pixel (x,y,z) from texture image into GPRs (r1,r2,r3,r4)=(R,G,B,A). And fragment shader can re-calculate the color of this pixel with the color of this pixel at texture image [63].

If it is 1d texture instruction, the tex.1d as follows,

GPU Execution of Texture Instruction

// GPU machine code
load $1, tex_a;
sample2d_inst $1, $2, $3  // $1: tex_a, $2: %uv_2d, $3: %bias

.tex_a // Set by driver to index of GPU descriptor at step 4

When the GPU executes the texture sampling instruction (e.g., sample2d_inst), it uses the .tex_a address, which was assigned by the driver in step 4, to access the appropriate Texture Descriptor from GPU memory. This descriptor corresponds to Texture Unit 1 because of the earlier API call:

gl.uniform1i(xLoc, 1);

If the GPU hardware provides dedicated texture descriptor registers or memory structures, the driver maps .tex_a to those structures [65].

Example (NVIDIA PTX texture instruction):

// The content of tex_a is bound to a texture unit, as in step 4
tex.3d.v4.s32.s32 {r1,r2,r3,r4}, [tex_a, {f1,f2,f3,f4}];

.tex_a

Here, the .tex_a register holds the texture binding information set by the driver. The vector {f1, f2, f3} represents the 3D coordinates (x, y, z) provided by the shader or program logic. The f4 value is ignored for 3D textures.

This tex.3d instruction performs a texture fetch from the bound 3D texture and loads the resulting color values into general-purpose registers:

r1: Red
r2: Green
r3: Blue
r4: Alpha

The fragment shader can then use or modify this color value based on further calculations or blending logic [63].

If a 1D texture is used instead, the texture instruction would look like:

// For compatibility with prior versions of PTX, the square brackets are not
// required and .v4 coordinate vectors are allowed for any geometry, with
// the extra elements being ignored.
tex.1d.v4.s32.f32  {r1,r2,r3,r4}, [tex_a, {f1}];

Since the ‘Texture Unit’ is a limited hardware accelerator on the GPU, OpenGL provides APIs that allow user programs to bind ‘Texture Units’ to ‘Sampler Variables’. As a result, user programs can balance the use of ‘Texture Units’ efficiently through OpenGL APIs without recompiling GLSL. Fast texture sampling is one of the key requirements for good GPU performance [61].

In addition to the API for binding textures, OpenGL provides the glTexParameteri API for texture wrapping [66]. Furthermore, the texture instruction for some GPUs may include S# and T# values in the operands. Similar to associating ‘Sampler Variables’ to ‘Texture Units’, S# and T# are memory locations associated with texture wrapping descriptor registers. This allows user programs to change wrapping options without recompiling GLSL.

Even though the GLSL frontend compiler always expands function calls into inline functions, and LLVM intrinsic extensions provide an easy way to generate code through LLVM’s target description (TD) files, the GPU backend compiler is still somewhat more complex than the CPU backend.

(However, considering the effort required for the CPU frontend compiler such as Clang, or toolchains like the linker and GDB/LLDB, the overall difficulty of building a CPU compiler is not necessarily less than that of a GPU compiler.)

Here is the software stack of the 3D graphics system for OpenGL on Linux [11]. The Mesa open source project website is here [67].

GPU Architecture ¶

_images/gpu-terms.png — Fig. 83 Terms in Nvidia’s gpu (figure from book [84])¶

SIMT ¶

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing, where single instruction, multiple data (SIMD) is combined with multithreading [68].

The leading GPU architecture of Nvidia GPUs is shown in the following figures.

_images/threads-lanes.png — Fig. 84 Threads and lanes in gpu (figure from book [73])¶

Note

A SIMD Thread executed by SIMD Processor, a.k.a. SM, has 16 Lanes.

_images/sm.png — Fig. 85 Streaming Multiprocessor SM has two -16-way SIMD units and four special function units [70]. SM has L1 and Read Only Cache (Uniform Cache) GTX480 has 48 SMs. **ALUs run at twice the clock rate of rest of chip. So each decoded instruction runs on 32 pieces of data on the 16 ALUs over two ALU clocks** [71].¶

_images/sm2.png — Fig. 86 Multithreaded SIMD Processor (Streaming Multiprocessor SM) figure from book [74]¶

Note

A SIMD thread executed by a SIMD processor, also known as an SM, processes 32 elements.

Number of registers in a thread block = 16 (SMs) * 32 (CUDA threads) * 64 (TLRs, Thread-Level Registers) = 32,768 registers in the register file. Fermi has a mode bit that offers the choice of using 64 KB of SRAM as a 16 KB L1 cache with 48 KB of Local Memory or as a 48 KB L1 cache with 16 KB of Local Memory [85].

_images/threadblock.jpg — Fig. 87 SM select Thread Blocks to run¶

Two levels of scheduling:
- Level 1: Thread Block Scheduler When an SM executes a thread block, all the threads within the block are are executed at the same time. If any thread in a warp is not ready due to operand data dependencies, the scheduler switches context between warps. During a context switch, all the data of the current warp remains in the register file so it can resume quickly once its operands are ready [91].
- Level 2: Warp Scheduler Manages CUDA threads within the same warp.

// Invoke MATMUL with 256 threads per Thread Block
__host__
int nblocks = (n + 255) / 512;
matmul<<<nblocks, 255>>>(n, A, B, C);
// MATMUL in CUDA
__device__
void matmul(int n, double A, double *B, double *C) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) A[i] = B[i] + C[i];
}

_images/grid.png — Fig. 88 Mapping 8192 elements of matrix multiplication for Nvidia’s GPU (figure from [69]). SIMT: 16 SIMD threads in one thread block.¶

_images/memory.png — Fig. 89 GPU memory (figure from book [75])¶

Summarize as table below.

Table 46 More Descriptive Name for Cuda term in Fermi G:PU.¶
More Desciptive Name	Cuda term	Structure	Description
Grid	Grid		Grid is Vectorizable Loop as Fig. 83.
Thread Block	Thread Block / GPU Core	Each Grid has 16 Thread Block.	Each Thread Block is assigned 512 elements of the vectors to work on. SIMD Processors are full processors with separate PCs and are programmed using threads [87]. As Fig. 88, it assigns 16 Thread Block to 16 SIMD Processors. CPU Core is the processor which include multi-threads. A thread of CPU is execution unit with its own PC (Program Counter). Similarly, Once a thread block is launched on a multiprocessor (SM), all of its warps are resident until their execution finishes. Thus a new block is not launched on an SM until there is sufficient number of free registers for all warps of the new block, and until there is enough free shared memory for the new block. As this concept, GPU Core is the SIMD Processor includes several SIMD Thread (Warp). Each Warp has its PC [91].
SIMD Thread (run by SIMD Processor)	Warp (run by Streaming Multiprocessor, SM)	Each SIMD Processor has 16 SIMD Threads.	Each SIMD processor includes local memory, as in Fig. 89. Local memory is shared among SIMD lanes within a SIMD processor but not across different SIMD processors. A warp has its own PC and may correspond to a whole function or part of a function. Compiler and runtime may assign functions to the same or different warps [92].
SIMD Lane	Cuda Thread	Each SIMD Thread has 16 Lanes..	A vertical cut of a thread of SIMD instructions corresponding to one element executed by one SIMD Lane. It is a vector instruction with processing 16-elements. SIMD Lane registers: each Lane has its TLR (Thread Level Registers) allocated from Register file (32768 x 32-bit) by SM as Fig. 85.
Chime	Chime	Each SIMD Lane has 2 chimes.	One clock rate of rest of chip executes 2 data elements on two Cuda-core as in Fig. 85. Vector length is 32 (32 elements), SIMD Lanes = 16. Chime = 2. Chimes refer to ALU cycles that run in “ping-pong” mode. As Fig. 88 for the later Fermi-generation GPUs.

Vertex unit ¶

VAR unit ¶

VAR Variable Rate Shading Unit [88].

Texture unit ¶

As depicted in section OpenGL Shader Compiler.

Speedup Features ¶

Gather-scatter data transfer: HW support sparse vector access is called gather-scatter. The VMIPS instructions are LVI (load vector indexed or gather) and SVI (store vector indexed or scatter) [89].
Address Coalescing: This is a feature provided by the GPU, explained as follows.
- Note that unlike vector architectures, GPUs don’t have separate instructions for sequential data transfers, strided data transfers, and gather-scatter data transfers. All data transfers are gather-scatter! To regain the efficiency of sequential (unit-stride) data transfers, GPUs include special Address Coalescing hardware to recognize when the SIMD Lanes within a thread of SIMD instructions are col- lectively issuing sequential addresses [90]..

Buffers ¶

In addition to texture units and instructions, GPU provides different Buffers to speedup OpenGL pipeline rendering [42].

Color buffer

They contain the RGB or sRGB color data and may also contain alpha values for each pixel in the framebuffer. There may be multiple color buffers in a framebuffer. You’ve already used double buffering for animation. Double buffering is done by making the main color buffer have two parts: a front buffer that’s displayed in your window; and a back buffer, which is where you render the new image [76].
Depth buffer (Z buffer)

Depth is measured in terms of distance to the eye, so pixels with larger depth-buffer values are overwritten by pixels with smaller values [77] [79] [80].
Stencil Buffer

In the simplest case, the stencil buffer is used to limit the area of rendering (stenciling) [81] [80].
Frame Buffer

OpenGL offers: the color, depth and stencil buffers. This combination of buffers is known as the default framebuffer and as you’ve seen, a framebuffer is an area in memory that can be rendered to [82].

General purpose GPU ¶

Since GLSL shaders provide a general way for writing C code in them, if applying a software frame work instead of OpenGL API, then the system can run some data parallel computation on GPU for speeding up and even get CPU and GPU executing simultaneously. Furthmore, any language that allows the code running on the CPU to poll a GPU shader for return values, can create a GPGPU framework [83].

Mapping data in GPU ¶

As described in the previous section on GPUs, the subset of the array calculation y[] = a * x[] + y[] is shown as follows:

// Invoke DAXPY with 256 threads per Thread Block
__host__
int nblocks = (n+255) / 256;
daxpy<<<nblocks, 256>>>(n, 2.0, x, y);
// DAXPY in CUDA
__device__
void daxpy(int n, double a, double *x, double *y) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

name<<<dimGrid, dimBlock>>>(… parameter list …):
- dimGrid: Number of Blocks in Grid
- dimBlock: 256 Threads in Block

Assembly code of PTX (from page 300 of Quantitative book)

// code to set VLR, Vector Length Register, to (n % 256)
//   ...
//
shl.u32 R8, blockIdx, 9       ; Thread Block ID * Block size (512)
add.u32 R8, R8, threadIdx     ; R8 = i = my CUDA Thread ID
shl.u32 R8, R8, 3             ; byte offset
setp.neq.s32 P1, RD8, RD3     ; RD3 = n, P1 is predicate register 1
ld.global.f64 RD0, [X+R8]     ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8]     ; RD2 = Y[i]
mul.f64 RD0, RD0, RD4         ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 RD0, RD0, RD2         ; SuminRD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0     ; Y[i] = sum (X[i]*a + Y[i])

Need to set VLR if PTX has this instruction. Otherwise, set lane-mask in the similar way of the code below.

__device__
void lane-mask-ex( double *X, double *Y, double *Z) {
  if (X[i] != 0)
    X[i] = X[i] – Y[i];
  else X[i] = Z[i];
}

Assembly code of Vector Processor

LV V1,Rx         ;load vector X into V1
LV V2,Ry         ;load vector Y
L.D F0,#0        ;load FP zero into F0
SNEVS.D V1,F0    ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV V1,Rx         ;store the result in X

Assembly code of PTX (modified code from refering page 208 - 302 of Quantitative book)

ld.global.f64 RD0, [X+R9]     ; RD0 = X[i]
setp.neq.s32 P1, RD0, #0      ; P1 is predicate register 1
@!P1, bra ELSE1, *Push        ; Push old mask, set new mask bits
                              ; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8]     ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2         ; Difference in RD0
st.global.f64 [X+R8], RD0     ; X[i]=RD0
ELSE1:
ld.global.f64 RD0, [Z+R8]     ; RD0 = Z[i]
st.global.f64 [X+R8], RD0     ; X[i] = RD0
ENDIF1:
ret, *Pop                     ; pop to restore old mask

For Lane Mask, refer to [94] [95].

The following table explains how the elements of saxpy() are mapped to the lanes of a SIMD Thread (Warp), which belongs to a Thread Block (Core) within a Grid.

Table 47 Mapping saxpy code to Fig. 88.¶
saxpy(()	Instance in Fig. 88	Description
blockDim.x	The index of Thread Block	blockDim: in this example configured as Fig. 88 is 16(Thread Blocks) * 16(SIDM Threads) = 256
blockIdx.x	The index of SIMD Thread	blockIdx: the index of Thread Block within the Grid
threadIdx.x	The index of elements	threadIdx: the index of the SIMD Thread within its Thread Block

With Fermi, each 32-wide thread of SIMD instructions is mapped to 16 physical SIMD Lanes, so each SIMD instruction in a thread of SIMD instructions takes two clock cycles to complete.
You could say that it has 16 lanes, the vector length would be 32, and the chime is 2 clock cycles.
The mape of y[0..31] = a * x[0..31] * y[0..31] to <Core, Warp, Cuda Thread> of GPU as the following table. x[0..31] map to 32 Cuda Threads; two Cuda Threads map to one SIMD lane.

Table 48 Map <Core, Warp> to saxpy¶
	Warp-0	Warp-1	…	Warp-15
Core-0	y[0..31] = a * x[0..31] * y[0..31]	y[32..63] = a * x[32..63] + y[32..63]	…	y[480..511] = a * x[480..511] + y[480..511]
…	…	…	…	…
Core-15	y[7680..7711] = a * …	…	…	y[8160..8191] = a * x[8160..8191] + y[8160..8191]

Each Cuda Thread runs the GPU function code saxpy. Fermi has a register file of size 32768 x 32-bit. As shown in Fig. 85, the number of registers in a Thread Block is: 16 (SM) * 32 (Cuda Threads) * 64 (TLR, Thread Level Register) = 32768 x 32-bit (Register file).
When mapping to fragments/pixels in a graphics GPU, x[0..15] corresponds to a two-dimensional tile of fragments/pixels at pixel[0..3][0..3], since images use tile-based grouping to cluster similar colors together.

Work between CPU and GPU in Cuda ¶

The previous daxpy() GPU code did not include the host (CPU) side code that triggers the GPU function.

The following example shows the host (CPU) side of a CUDA program that calls saxpy on the GPU [86]:

#include <stdio.h>

__global__
void saxpy(int n, float a, float * x, float * y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  ...
  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
  ...
  saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);
  ...
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
  ...
}

The main() function runs on the CPU, while saxpy() runs on the GPU. The CPU copies data from x and y to the corresponding device arrays d_x and d_y using cudaMemcpy.

The saxpy kernel is launched with the following statement:

saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);

This launches the kernel with thread blocks containing 256 threads, and uses integer arithmetic to determine the number of thread blocks needed to process all N elements in the arrays. The expression (N+255)/256 ensures full coverage of the input data.

Using cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost, the CPU can pass data in x and y to the GPU, and retrieve the results back to y.

Since both memory transfers are handled by DMA and do not require CPU operation, the performance can be improved by running CPU and GPU independently, each accessing their own cache.

After the DMA copy from CPU memory to GPU memory, the GPU performs the full matrix operation loop for y[] = a * x[] + y[]; using a single Grid of threads.

DMA memcpy maps the data in CPU memory to each L1 cache of a core on GPU memory.

Many GPUs support scatter and gather operations to access DRAM efficiently for stream processing tasks [97] [83] [98].

When the GPU function is dense computation in array such as MPEG4 encoder or deep learning for tuning weights, it may get much speed up [99]. However when GPU function is matrix addition and CPU will idle for waiting GPU’s result. It may slow down than doing matrix addition by CPU only. Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speedup [83].

Wiki here [100] includes GPU-accelerated applications for speedup as follows:

General Purpose Computing on GPU, has found its way into fields as diverse as machine learning, oil exploration, scientific image processing, linear algebra, statistics, 3D reconstruction and even stock options pricing determination. In addition, section “GPU accelerated video decoding and encoding” for video compressing [100] gives the more applications for GPU acceleration.

Table 49 The differences for speedup in architecture of CPU and GPU¶
Item	CPU	GPU
Application	Non-data parallel	Data parallel
Architecture	SISD, small vector (eg.4*32bits)	Large SIMD (eg.16*32bits)
Cache	Smaller and faster	Larger and slower (ref. The following Note)
ILP	Pipeline	Pipeline
	Superscalar, SMT	SIMT
	Super-pipeline
Core	Smaller threads for SMT (2 or 4)	Larger threads (16 or 32)
Branch	Conditional-instructions	Mask & conditional-instructions

Note

GPU-Cache

In theory, for data-parallel applications using GPU’s SMT, the GPU can schedule more threads and aims for throughput rather than speedup of a single thread, as seen in SISD on CPUs.

However, in practice, GPUs provide only a small L1 cache, similar to CPUs, and handle cache misses by scheduling another thread.

As a result, GPUs often lack L2 and L3 caches, which are common in CPUs with deeper cache hierarchies.

Volta (Cuda thread/SIMD lane with PC, Program Couner and Call Stack)¶

One way the compiler handles this is by keeping executing instructions in order and if some threads don’t have to execute certain instructions it switches off those threads and turns them on their relevant instructions and switches off the other threads, this process is called masking.

_images/pre-volta-1.png — Fig. 90 SIMT Warp Execution Model of Pascal and Earlier GPUs [93]¶

_images/volta-1.png — Fig. 91 Volta Warp with Per-Thread Program Counter and Call Stack [93]¶

After Nvidia’s Volta GPU, each thread in a warp has its own program counter (PC), as shown in Fig. 91.

//
__device__ void insert_after(Node *a, Node *b)
{
  Node *c;
  lock(a); lock(a->next);
  ...
  unlock(c); unlock(a);
}

Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. As the above example [93], each thread can progress with its own PC. Therefore, different threads in the same warp can run insert_after() independently without waiting for lock().
Provide both thread in group efficency and independently thread progression.

Beside each thread in same Warp can progress independently as above, To maximize parallel efficiency, Volta includes a schedule optimizer which determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity, while the convergence optimizer in Volta will still group together threads which are executing the same code and run them in parallel for maximum efficiency. In Cuda Applications, this feature provides more parallel opportunities with __syncwarp() to user programmers as shown in Fig. 92.

_images/volta-2.png — Fig. 92 Programs use Explicit Synchronization to Reconverge Threads in a Warp [93]¶

OpenCL, Vulkan and spir-v ¶

digraph G {
rankdir=LR;

compound=true;
node [shape=record];
SW_LAYER [label="{ GLSL | OpenCL } | SPIR-V | GPU machine code"];
} — Fig. 93 OpenCL and GLSL(OpenGL)¶

Table 50 OpenCL and OpenGL SW system¶
Name of SW	GPU language	Level of GPU language
OpenCL	OpenCL	C99 dialect (with C pointer, …)
OpenGL	GLSL	C-like (no C pointer, …)
Vulkan	SPIR-V	IR

_images/opencl-to-spirv-offine-compilation.png — Fig. 94 Offline Compilation of OpenCL Kernels into SPIR-V Using Open Source Tooling [102]¶

clang: Compile OpenCL to spirv for runtime+driver. Or compile OpenCL to llvm, then “SPIR-V LLVM Translator” translate llvm to spirv for runtime+driver.
clspv: Compile OpenCL to spirv for tuntime

digraph G {
rankdir=LR;

node [shape=record];

glslang [style=filled,fillcolor=green];
spirv_cross [label="spirv-cross",style=filled,fillcolor=green];
glsl -> glslang -> spirv;
glsl -> spirv_cross -> spirv [dir="back"];
} — Fig. 95 Convertion between glsl and spirv¶

Comparsion for OpenCL and OpenGL’s compute shader.
- Same:
  
  Both are for General Computing of GPU.
- Difference:
  
  OpenCL include GPU and other accelerate device/processor. OpenCL is C language on Device and C++ on Host based on OpenCL runtime. Compute shader is GLSL shader language run on OpenGL graphic enviroment and integrate and access data of OpenGL API easily [101].
OpenGL/GLSL vs Vulkan/spir-v.
- High level of API and shader: OpenGL, GLSL.
- Low level of API and shader: Vulkan, spir-v.

Though OpenGL api existed in higher level with many advantages from sections above, sometimes it cannot compete in efficience with direct3D providing lower levels api for operating memory by user program [103]. Vulkan api is lower level’s C/C++ api to fill the gap allowing user program to do these things in OpenGL to compete against Microsoft direct3D. Here is an example [104]. Meanwhile glsl is C-like language. The vulkan infrastructure provides tool, glslangValidator [105], to compile glsl into an Intermediate Representation Form (IR) called spir-v off-line. As a result, it saves part of compilation time from glsl to gpu instructions on-line since spir-v is an IR of level closing to llvm IR [106]. In addition, vulkan api reduces gpu drivers efforts in optimization and code generation [103]. These standards provide user programmer option to using vulkan/spir-v instead of OpenGL/glsl, and allow them pre-compiling glsl into spir-v off-line to saving part of on-line compilation time.

With vulkan and spir-v standard, the gpu can be used in OpenCL for Parallel Programming of Heterogeneous Systems [107] [108]. Similar with Cuda, a OpenCL example for fast Fourier transform (FFT) is here [109]. Once OpenCL grows into a popular standard when more computer languages or framework supporting OpenCL language, GPU will take more jobs from CPU [110].

Most GPUs have 16 or 32 lanes in a SIMD processor (Warp), vulkan provides Subgroup operations to data parallel programming on lanes of SIMD processor [111].

Subgroup operations provide a fast way for moving data between lanes intra Warp. Assuming each Warp has eight lanes. The following table lists result of reduce, inclusive and exclusive operations.

Table 51 Lists each lane’s value after **Reduce**, **Inclusive** and **Exclusive** operations repectively¶
Lane	0	1	2	3
Initial value	a	b	c	d
Reduce	OP(abcd)	OP(abcd)	OP(abcd)	OP(abcd)
Inclusive	OP(a)	OP(ab)	OP(abc)	OP(abcd)
Exclusive	not define	OP(a)	OP(ab)	OP(abc)

Reduce: e.g. subgroupAdd. Inclusive: e.g. subgroupInclusiveAdd. Exclusive: e.g. subgroupExclusiveAdd.
For examples:
- ADD operation: OP(abcd) = a+b+c+d.
- MAX operation: OP(abc) = MAX(a,b,c).
When lane i is inactive, it is value is none.
- For instance of lane 0 is inactive, then MUL operation: OP(abcd) = b*c*d.

The following is a code example.

An example of subgroup operations in glsl for vulkan

vec4 sum = vec4(0, 0, 0, 0);
if (gl_SubgroupInvocationID < 16u) {
  sum += subgroupAdd(in[gl_SubgroupInvocationID]);
}
else {
  sum += subgroupInclusiveMul(in[gl_SubgroupInvocationID]);
}
subgroupMemoryBarrier();

Nvidia’s GPU provides __syncwarp() for subgroupMemoryBarrier() or compiler to sync for the lanes in the same Warp.

In order to let lanes in the same SIMD processor work efficently, data unifomity analysis will provide many optimization opporturnities in register allocation, transformation and code generation [112].

LLVM IR expansion from CPU to GPU is becoming increasingly influential. In fact, LLVM IR has been expanding steadily from version 3.1 until now, as I have observed.

Accelerate ML/DL on OpenCL/SYCL ¶

_images/opencl_ml_graph.png — Fig. 96 Implement ML graph scheduler both on compiler and runtime¶

As shown in Fig. 96, the Device, such as a GPU or a CPU+NPU, is capable of running the entire ML graph. However, if the Device has only an NPU, then operations like Avg-Pool, which require CPU support, must run on the Host side. This introduces communication overhead between the Host and the Device.

Similar to OpenGL shaders, the “kernel” function may be compiled either on-line or off-line and then sent to the GPU as a programmable function.

In order to run ML (Machine Learning) efficiently, all platforms for ML on GPU/NPU implement scheduling SW both on graph compiler and runtime. If OpenCL can extend to support ML graph, then graph compiler such as TVM or Runtime from Open Source have chance to leverage the effort of scheduling SW from programmers [113]. Cuda graph is an idea like this [114] [115] .

SYCL: Using C++ templates to optimize and genertate code for OpenCL and Cuda. Provides a consistent language, APIs, and ecosystem in which to write and tune code for different accelerator architecture, CPUs, GPUs, and FPGAs [116].
- SYCL uses generic programming with templates and generic lambda functions to enable higher-level application software to be cleanly coded with optimized acceleration of kernel code across an extensive range of acceleration backend APIs, such as OpenCL and CUDA [117].

_images/sycl.png — Fig. 97 SYCL = C++ template and compiler for Data Parallel Applications on AI on CPUs, GPUs and HPGAs.¶

DPC++ (OneDPC) compiler: Based on SYCL, DPC++ can compile the DPC++ language for both CPU host and GPU device. DPC++ (Data Parallel C++) is a language developed by Intel and may be adopted into standard C++. The GPU-side (kernel code) is written in C++ but does not support exception handling [118] [119].
- Features of Kernel Code:
  - Not supported:
    
    Dynamic polymorphism, dynamic memory allocations (therefore no object management using new or delete operators), static variables, function pointers, runtime type information (RTTI), and exception handling. No virtual member functions, and no variadic functions, are allowed to be called from kernel code. Recursion is not allowed within kernel code.
  - Supported:
    
    Lambdas, operator overloading, templates, classes, and static polymorphism [120].

Open Sources ¶

[73]

The SIMD Thread Scheduler includes a scoreboard that lets it know which threads of SIMD instructions are ready to run, and then it sends them off to a dispatch unit to be run on the multithreaded SIMD Processor. It is identical to a hardware thread scheduler in a traditional multithreaded processor (see Chapter 3), just that it is scheduling threads of SIMD instructions. Thus, GPU hardware has two levels of hardware schedulers: (1) the Thread Block Scheduler that assigns Thread Blocks (bodies of vectorized loops) to multi- threaded SIMD Processors, which ensures that thread blocks are assigned to the processors whose local memories have the corresponding data, and (2) the SIMD Thread Scheduler within a SIMD Processor, which schedules when threads of SIMD instructions should run. Book Figure 4.14 of Computer Architecture: A Quantitative Approach 5th edition (The Morgan Kaufmann Series in Computer Architecture and Design)