The concept of GPU compiler

Basicly CPU is SISD (Single Instruction Single Data) Architecture in each core. The multimedia instructions in CPU are smaller scaled of SIMD (Single Instruction Multiple Data) while GPU is a large scaled of SIMD processor, coloring millions of pixels of image in few mini seconds. Since the 2D or 3D graphic processing provides large opportunity in parallel data processing, GPU hardware usually composed tens thousands of functional units in each chip for N-Vidia and other’s manufacturers.

This chapter is giving an overview for how 3D animation to be created and run on CPU+GPU first. After that, providing a concept in GPU compiler and HW featrues for graphic application. Finally, explaining how GPU has taking more applications from CPU through GPGPU concept and related standards emerged.

Webiste, Basic theory of 3D graphics with OpenGL, [1].

Concept in graphic and system

3D modeling

Through creating 3D model with Triangles or Quads along on skin, the 3D model is created with polygon mesh [2] formed by all the vertices on the first image as Fig. 60.

_images/modeling1.png

Fig. 60 Creating 3D model and texturing

After the next smooth shading [2], the vertices and edge lines are covered with color (or remove edges, actually edge never has black line), and model looks much more smooth [3]. Further, after texturing (texture mapping), the model looks real more [4].

To get to know how animation for a 3D modeling, please look video here [5]. According to the video for skeleton animation, setting the joints poistion at different poses and giving time to each pose (keyframe) as Fig. 61.

_images/animation.png

Fig. 61 Set time point at keyframes

In this series of videos, you find the 3D modeling tools creating Java instead of C/C++ code calling OpenGL api and shaders. It’s because Java can call OpenGL api through a wrapper library [6].

Every CAD software manufacturer such as AutoDesk and Blender has their own proprietary format. To solve the problem of interoperability, neutral or open source formats were invented as intermediate formats for converting between two proprietary formats. Naturally, these formats have become hugely popular now. Two famous examples of neutral formats are STL (with a .STL extension) and COLLADA (with a .DAE extension). Here is the list, where the 3D file formats are marked with their type.

Table 43 3D file formats [7]

3D file format

Type

STL

Neutral

OBJ

ASCII variant is neutral, binary variant is proprietary

FBX

Proprietary

COLLADA

Neutral

3DS

Proprietary

IGES

Neutral

STEP

Neutral

VRML/X3D

Neutral

The four key features a 3D file can store include the model’s geometry, the model’s surface texture, scene details, and animation of the model [7].

Specifically, they can store details about four key features of a 3D model, though it’s worth bearing in mind that you may not always take advantage of all four features in all projects, and not all file formats support all four features!

3D printer applications do not to support animation. CAD and CAM such as designing airplane does not need feature of scene details.

DAE (Collada) appeared in the video animation above. Collada files belong to a neutral format used heavily in the video game and film industries. It’s managed by the non-profit technology consortium, the Khronos Group.

The file extension for the Collada format is .dae. The Collada format stores data using the XML mark-up language.

The original intention behind the Collada format was to become a standard among 3D file formats. Indeed, in 2013, it was adopted by ISO as a publicly available specification, ISO/PAS 17506. As a result, many 3D modeling programs support the Collada format.

That said, the consensus is that the Collada format hasn’t kept up with the times. It was once used heavily as an interchange format for Autodesk Max/Maya in film production, but the industry has now shifted more towards OBJ, FBX, and Alembic [7].

Graphic HW and SW stack

The block diagram of Graphic Processing Unit (GPU) as Fig. 62.

_images/gpu-block-diagram.png

Fig. 62 Components of a GPU: GPU has accelerated video decoding and encoding [8]

The role of CPU and GPU for graphic animation as Fig. 63.

_images/graphic-cpu-gpu.png

Fig. 63 OpenGL and Vulkan are both rendering APIs. In both cases, the GPU executes shaders, while the CPU executes everything else [9].

  • GPU can’t directly read user input from, say, keyboard, mouse, gamepad, or play audio, or load files from a hard drive, or anything like that. In this situation, cannot let GPU handle the animation work [10].

  • A graphics driver consists of an implementation of the OpenGL state machine and a compilation stack to compile the shaders into the GPU’s machine language. This compilation, as well as pretty much anything else, is executed on the CPU, then the compiled shaders are sent to the GPU and are executed by it. (SDL = Simple DirectMedia Layer) [11].

_images/graphic-gpu-csf.png

Fig. 64 MCU and specific HW circuits to speedup the processing of CSF (Command Stream Fronted) [12].

The GPU driver write command and data from CPU to GPU’s system memory through PCIe. These commands are called Command Stream Fronted (CSF) in the memory of GPU. A chipset of GPU includes tens of SIMD processors (cores). In order to speedup the GPU driver’s processing, the CSF is designed to a simpler form. As result, GPU chipset include MCU (Micro Chip Unit) and specfic HW to transfer the CSF into individual data structure for each SIMD processor to execute as Fig. 64. The firmware version of MCU is updated by MCU itself usually.

The driver run on CPU side as Fig. 65. The OpenGL Api will call driver’s function eventually and driver finish the function’s work via issuing GPU-HW’s command and/or sending data to GPU. Even so, GPU’s rendor work from the data of 3D vertex, colors, … sending from CPU and storing in GPU’s memory or shared memory consume more computing power than CPU.

digraph G {
  rankdir=LR;
  
  compound=true;
  node [shape=record];
  subgraph cluster_cpu {
    label = "CPU (Client)";
    CPU_SW [label=" 3D Model | JAVA | JOGL | { OpenGL API | Shaders \n (buitin-functions)} | <f1> Driver"];
  }
  subgraph cluster_gpu {
    label = "GPU HW (Server)"
    GPU_SW [label="<f1> 3D Rendering-pipeline \ndescribed in next section"];
  }
  CPU_SW:f1 -> GPU_SW:f1 [label=" VAO, texture, ..., from 3D model, \n shader-exectuable-code"];
    
//  label = "Graphic SW Stack";
}

Fig. 65 Graphic SW Stack

  • According the previous section, after user create skeleton and skin for each model and set keyframes time through 3D modeling tool, the 3D modeling tool can either generate Java code which calling JOGL (Java OpenGL) [6], or generate OpenCL API directly. The frame data can be calculated from interplation between keyframes.

  • As above, every animation the client CPU program set new position of obect (vertices) and colors, the data of one frame, server (driver and GPU) does the 3D to 2D rendering. Higher-level libraries and frameworks on top of OpenGL provide animation framework and tools to generate OpenGL API and shaders from 3D model.

  • Shader may call Builtin-functions which written from Compute Shader, spriv or LLVM-IR. LLVM libclc is a project for builtin-functions in OpenCL which can be used in OpenGL too [13]. Like CPU’s builtin-functions, new GPU ISA/architecture has to implement their builtin-functions or porting from open source such as libclc.

  • 3D model (CPU) does the rendering animation to generate each frame between keyframes (poses) while GPU does the rendering pipeline from each frame to each pixel’s value.

  • These frames data existed in the form of VAO (Vertex Array Object) in OpenGL. It will be explaned in later section OpenGL.

  • In addition, OpenGL provides vertex buffer object (VBO) allowing vertex array data to be stored in high-performance graphics memory on the server side and promotes efficient data transfer [14] [15].

  • 3D animation SW provides a lot of builtin shaders. Programmer can write their shaders to the game engine.

The flow for 3D/2D graphic processing as Fig. 66.

digraph G {
  rankdir=LR;

  compound=true;
  node [shape=record];
  subgraph cluster_3d {
    label = "3D/2D modeling software";
    subgraph cluster_code {
      label = "3D/2D's code: engine, lib, shader, ...";
      Api [label="<g> Generated Code | <a> OpenGL API | lib | <s> Shaders (3D animation's shaders \n or programmer writing shaders"];
    }
  }
  subgraph cluster_driver {
    label = "Driver"
    Compiler [label="On-line Compiler"];
    Obj [label="obj"];
    Linker [label="On-line binding (Linker)"];
    Exe [label="exe"];
  }
  Api:a -> Obj [lhead ="cluster_driver"];
  Api:s -> Compiler;
  Compiler -> Obj -> Linker -> Exe;
  Exe -> GPU;
  Exe -> CPU [ltail ="cluster_driver"]; 

//  label = "OpenGL Flow";
}

Fig. 66 OpenGL Flow

_images/db-vsync.png

Fig. 67 VSync

VSync

No tearing, GPU and Display run at same refresh rate since GPU refresh faster
than Display.

              A    B

GPU      | ----| ----|

Display  |-----|-----|

            B      A

Tearing, GPU has exactly refresh cycles but VSync takes one cycle more.
than Display.

              A

GPU      | -----|

Display  |-----|-----|

            B      A

Avoid tearing, GPU has refresh rate 1/2 of Display's refresh rate.
than Display.

              A          B

GPU      | -----|    | -----|

Display  |-----|-----|-----|-----|

            B      B    A     A
  • Double Buffering

    While the display is reading from the frame buffer to display the current frame, we might be updating its contents for the next frame (not necessarily in raster-scan manner). This would result in the so-called tearing, in which the screen shows parts of the old frame and parts of the new frame. This could be resolved by using so-called double buffering. Instead of using a single frame buffer, modern GPU uses two of them: a front buffer and a back buffer. The display reads from the front buffer, while we can write the next frame to the back buffer. When we finish, we signal to GPU to swap the front and back buffer (known as buffer swap or page flip).

  • VSync

    Double buffering alone does not solve the entire problem, as the buffer swap might occur at an inappropriate time, for example, while the display is in the middle of displaying the old frame. This is resolved via the so-called vertical synchronization (or VSync) at the end of the raster-scan. When we signal to the GPU to do a buffer swap, the GPU will wait till the next VSync to perform the actual swap, after the entire current frame is displayed.

    As above text digram. The most important point is: When the VSync buffer-swap is enabled, you cannot refresh the display faster than the refresh rate of the display!!! If GPU is capable of producing higher frame rates than the display’s refresh rate, then GPU can use fast rate without tearing. If GPU has same or less frame rates then display’s and you application refreshes at a fixed rate, the resultant refresh rate is likely to be an integral factor of the display’s refresh rate, i.e., 1/2, 1/3, 1/4, etc. Otherwise it will cause tearing [1].

  • NVIDIA G-SYNC and AMD FreeSync

    If your monitor and graphics card both in your customer computer support NVIDIA G-SYNC, you’re in luck. With this technology, a special chip in the display communicates with the graphics card. This lets the monitor vary the refresh rate to match the frame rate of the NVIDIA GTX graphics card, up to the maximum refresh rate of the display. This means that the frames are displayed as soon as they are rendered by the GPU, eliminating screen tearing and reducing stutter for when the frame rate is both higher and lower than the refresh rate of the display. This makes it perfect for situations where the frame rate varies, which happens a lot when gaming. Today, you can even find G-SYNC technology in gaming laptops!

    AMD has a similar solution called FreeSync. However, this doesn’t require a proprietary chip in the monitor. In FreeSync, the AMD Radeon driver, and the display firmware handle the communication. Generally, FreeSync monitors are less expensive than their G-SYNC counterparts, but gamers generally prefer G-SYNC over FreeSync as the latter may cause ghosting, where old images leave behind artifacts [16].

Basic geometry in computer graphics

This section instroduces the basic geometry math for computer graphics. The complete concept can be found in Book: “Computer graphics principles and practice 3rd editon, authors: JOHN F, …”. But it is 1 thousand of pages. This book is very complete and may take much time to understand every detail.

Color

  • Additive colors in light as Fig. 68 [17] [18]. If in paints, it adds shade and become light grey since it add shade (dark color) [19].

_images/additive-colors.png

Fig. 68 Additive colors in light

Note

Additive colors

I know it’s not match human’s intuition. However the additive colors RGB in light become totally white light, and the additive colors RGB in paints become light grey paint is reasonalbe since light has no shade. This result comes from the sense of human’s eyes. When no light no color can be sensed by eyes. Computer engineers should know if you try to explore the very basic nature, then it is fields of physics or human’s eyes structure in bilogy.

Transformation

Objects (Triangle/Quad) can be moved in 2D/3D with martix representation in wiki here [20]. The rotation matrix in wiki is derived from wiki here [21].

Every computer graphics book has provided topics of transformation of object and position in space. Chapter 4 of Blue book: OpenGL SuperBible 7th Edition give a short description (40 pages) and useful concept is a good material for knowing the concept. Given the following for Quaternion Product (Hamilton product) from Wiki [22] since the book miss this.

\[\mathbf ij = -ji = k, jk = -kj = i, ki = -ik = j.\]
_images/trans-steps.png

Fig. 69 Cooridinates Transform Pipeline [1]

Detail for Fig. 69 on website [1].

Projection

_images/ViewFrustum.png

Only objects in the cone between near and far planes are projected to 2D for prospective projection..

Prospective projection and orthographic projection (used in CAD tools) from 3D to 2D can be represented by transformation matrix in the previous section [23].

Cross product

Both Triangles or Quads are ploygon. So, objects can be formed with ploygon in both 2D and 3D. About transfermation in 2D or 3D, almost every book of computer graphics has mentioned well already. This section introduces the most important concept and method for deciding Inner and Outer planes, then a point or object can be checked for showing or hidding during 2D or 3D rendering.

Any area of polygon can be calculated by dividing into Triangles or Quads. And any area of Triangle or Quad can be calculated by cross product in 3D. The cross product in 3D is defined by the formula and can be represented with matrix notation as proved here [24].

\[\mathbf a \mathsf x \mathbf b = \Vert a \Vert \Vert b \Vert sin(\Theta) n\]
\[\begin{split}\mathbf a \mathsf x \mathbf b = \begin{vmatrix} \mathbf i & \mathbf j& \mathbf k\\ a_1& a_2& a_3\\ b_1& b_2& b_3 \end{vmatrix}\end{split}\]

The cross product in 2D is defined by the formula and can be represented with matrix notation as proved here [25] [26].

\[\mathbf a \mathsf x \mathbf b = \Vert a \Vert \Vert b \Vert sin(\Theta)\]
\[\begin{split}\mathbf a \mathsf x \mathbf b = \begin{vmatrix} \mathbf i & \mathbf j& \mathbf k\\ a_1& a_2& 0\\ b_1& b_2& 0 \end{vmatrix} = \begin{bmatrix} a_1& a_2 \\ b_1& b_2 \end{bmatrix}\end{split}\]

After above matrix form is proved, the Antisymmetric may be proved as follows,

\[\begin{split}a \mathsf x b = \mathsf x& \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} a_1& a_2 \\ b_1& b_2 \end{bmatrix} = a_1b_2 - a_2b_1 =\end{split}\]
\[\begin{split}-b_1a_2 - (-b_2a_1) = \begin{bmatrix} - b_1& - b_2 \\ a_1& a_2 \end{bmatrix} = \mathsf x& \begin{bmatrix} -b \\ a \end{bmatrix} = -b \mathsf x a\end{split}\]

In 2D, any two points \(\text{ from } P_i \text{ to } P_{i+1}\) can form a vector and decide inner side or outer side. For example, as Fig. 70, \(\Theta\) is the angle from \(P_iP_{i+1}\) to \(P_iP'_{i+1} = 180^\circ\). So, with right-hand rule, counter clockwise order, any \(P_iQ\) between \(P_iP_{i+1}\) to \(P_iP'_{i+1}\), the angle of \(P_iP_{i+1}\) to \(P_iQ = \theta, 0^\circ < \theta < 180^\circ\) then the inward direction be decided.

_images/inward-edge-normals.png

Fig. 70 Inward edge normals

_images/2d-vector-inward.png

Fig. 71 Inward and outward in 2D for a vector.

Base on this observation, the rule for inward and outward to any vector as Fig. 70. Face the same direction of a specific vector, the left side is inward and right side is outward as Fig. 71.

For each edge \(P_i - P_{i+1}\), the inward edge normal is the vector \(\mathsf x\; v_i\); the outward edge normal is \(\; -\; \mathsf x\; v_i\). Where \(\; \mathsf x\; v_i\) is coss-product(\(\mathsf v_i\)) as Fig. 70.

Polygon can be created from vertices. Suppose that \((P_0, P_1, ..., P_n)\) is a polygon. The line segments \(P_0P_1, P_1P_2\), etc., are the edges of the polygon; the vectors \(v_0 = P_1 - P_0, v_1 = P_2 - P_1, ..., v_n = P_0 - P_n\) are the edges of the polygon. Through counter clockwise, the left side is inward, then the inward region of polygon can be decided.

For a convex polygon whose vertices are listed in counter clockwise order, the inward edge normals point toward the interior of the polygon, and the outward edge normals point toward the unbounded exterior of the polygon, corresponding to our ordinary intuition. But if the vertices of a polygon are given in clockwise order, the interior and exterior swap roles.

This cross product has an important property: Going from v to ×v involves a rotation by 90◦ in the same direction as the rotation that takes the positive x-axis to the positive y-axis.

_images/polygon.png

Fig. 72 Draw a polygon with vectices counter clockwise

As Fig. 72, when drawing polygon with vectors(lines) counter clockwise, the ploygon will be created and the two sides of a vector(line) can be indentified [27]. Further a point in polygon or out of polygon can be identified. One simple way of finding whether the point is inside or outside a simple polygon is to test how many times a ray, starting from the point and going in any fixed direction, intersects the edges of the polygon. If the point is on the outside of the polygon the ray will intersect its edge an even number of times. If the point is on the inside of the polygon then it will intersect the edge an odd number of times [28].

_images/3d-cross-product.png

Fig. 73 Cross product definition in 3D

In the same way, through following the same direction counter clockwise to create 2D polygon one by one, then the 3D polygon will be created. As Fig. 73 from wiki [24], the inward direction can be decided with a x b < 0 and outward is a x b > 0 in OpenGL. Replace a, b with x, y as Fig. 74 axis z+ is the outer surface and z- is the inner surface [29].

_images/ogl-pointing-outwards.png

Fig. 74 OpenGL pointing outwards, indicating the outer surface (z axis is +)

_images/3d-polygon.png

Fig. 75 3D polygon with directions on each plane

The Fig. 75 is an example of 3D polygon created by 2D triangles. The direction of plane (triangle) as the line perpendicular to the plane.

Cast a ray from the 3D point along X-axis and check how many intersections with outer object you find. Depending on the intersection number on each axis (even or odd) you can understand if your point is inside or outside [30]. Inside is odd and outside is even. As Fig. 76, points on the line going through the object satisfy this rule.

_images/in-3d-object.png

Fig. 76 Point in or out 3D object

OpenGL uses counter clockwise and pointing outwards as default [14].

// unit cube
// A cube has 6 sides and each side has 4 vertices, therefore, the total number
// of vertices is 24 (6 sides * 4 verts), and 72 floats in the vertex array
// since each vertex has 3 components (x,y,z) (= 24 * 3)
//    v6----- v5
//   /|      /|
//  v1------v0|
//  | |     | |
//  | v7----|-v4
//  |/      |/
//  v2------v3

// vertex position array
GLfloat vertices[]  = {
   .5f, .5f, .5f,  -.5f, .5f, .5f,  -.5f,-.5f, .5f,  .5f,-.5f, .5f, // v0,v1,v2,v3 (front)
   .5f, .5f, .5f,   .5f,-.5f, .5f,   .5f,-.5f,-.5f,  .5f, .5f,-.5f, // v0,v3,v4,v5 (right)
   .5f, .5f, .5f,   .5f, .5f,-.5f,  -.5f, .5f,-.5f, -.5f, .5f, .5f, // v0,v5,v6,v1 (top)
  -.5f, .5f, .5f,  -.5f, .5f,-.5f,  -.5f,-.5f,-.5f, -.5f,-.5f, .5f, // v1,v6,v7,v2 (left)
  -.5f,-.5f,-.5f,   .5f,-.5f,-.5f,   .5f,-.5f, .5f, -.5f,-.5f, .5f, // v7,v4,v3,v2 (bottom)
   .5f,-.5f,-.5f,  -.5f,-.5f,-.5f,  -.5f, .5f,-.5f,  .5f, .5f,-.5f  // v4,v7,v6,v5 (back)
};

From code above, we can see that OpenGL uses counter clockwise and pointing outwards as default. However OpenGL provides glFrontFace(GL_CW) for clockwise [31].

For group of objects, scene graph provides better animation and saving memory [32].

OpenGL

Example of OpenGL program

The following example from openGL redbook and example code [37] [38].

References/triangles.vert


#version 400 core

layout( location = 0 ) in vec4 vPosition;

void
main()
{
    gl_Position = vPosition;
}

References/triangles.frag

#version 450 core

out vec4 fColor;

void main()
{
    fColor = vec4(0.5, 0.4, 0.8, 1.0);
}

References/01-triangles.cpp

  1//////////////////////////////////////////////////////////////////////////////
  2//
  3//  Triangles.cpp
  4//
  5//////////////////////////////////////////////////////////////////////////////
  6
  7#include "vgl.h"
  8#include "LoadShaders.h"
  9
 10enum VAO_IDs { Triangles, NumVAOs };
 11enum Buffer_IDs { ArrayBuffer, NumBuffers };
 12enum Attrib_IDs { vPosition = 0 };
 13
 14GLuint  VAOs[NumVAOs];
 15GLuint  Buffers[NumBuffers];
 16
 17const GLuint  NumVertices = 6;
 18
 19//----------------------------------------------------------------------------
 20//
 21// init
 22//
 23
 24void
 25init( void )
 26{
 27    glGenVertexArrays( NumVAOs, VAOs ); // Same with glCreateVertexArray( NumVAOs, VAOs ); 
 28      // https://stackoverflow.com/questions/24441430/glgen-vs-glcreate-naming-convention
 29    // Make the new VAO:VAOs[Triangles] active, creating it if necessary.
 30    glBindVertexArray( VAOs[Triangles] );
 31    // opengl->current_array_buffer = VAOs[Triangles]
 32    
 33    GLfloat  vertices[NumVertices][2] = {
 34        { -0.90f, -0.90f }, {  0.85f, -0.90f }, { -0.90f,  0.85f },  // Triangle 1
 35        {  0.90f, -0.85f }, {  0.90f,  0.90f }, { -0.85f,  0.90f }   // Triangle 2
 36    };
 37
 38    glCreateBuffers( NumBuffers, Buffers );
 39    
 40    // Make the buffer the active array buffer.
 41    glBindBuffer( GL_ARRAY_BUFFER, Buffers[ArrayBuffer] );
 42    // Attach the active VBO:Buffers[ArrayBuffer] to VAOs[Triangles]
 43    // as an array of vectors with 4 floats each.
 44    // Kind of like:
 45    // opengl->current_vertex_array->attributes[attr] = {
 46    //     type = GL_FLOAT,
 47    //     size = 4,
 48    //     data = opengl->current_array_buffer
 49    // }
 50    // Can be replaced with glVertexArrayVertexBuffer(VAOs[Triangles], Triangles, 
 51    // buffer[ArrayBuffer], ArrayBuffer, sizeof(vmath::vec2));, glVertexArrayAttribFormat(), ...
 52    // in OpenGL 4.5.
 53    
 54    glBufferStorage( GL_ARRAY_BUFFER, sizeof(vertices), vertices, 0);
 55
 56    ShaderInfo  shaders[] =
 57    {
 58        { GL_VERTEX_SHADER, "media/shaders/triangles/triangles.vert" },
 59        { GL_FRAGMENT_SHADER, "media/shaders/triangles/triangles.frag" },
 60        { GL_NONE, NULL }
 61    };
 62
 63    GLuint program = LoadShaders( shaders );
 64    glUseProgram( program );
 65
 66    glVertexAttribPointer( vPosition, 2, GL_FLOAT,
 67                           GL_FALSE, 0, BUFFER_OFFSET(0) );
 68    glEnableVertexAttribArray( vPosition );
 69    // Above two functions specify vPosition to vertex shader at layout (location = 0)
 70}
 71
 72//----------------------------------------------------------------------------
 73//
 74// display
 75//
 76
 77void
 78display( void )
 79{
 80    static const float black[] = { 0.0f, 0.0f, 0.0f, 0.0f };
 81
 82    glClearBufferfv(GL_COLOR, 0, black);
 83
 84    glBindVertexArray( VAOs[Triangles] );
 85    glDrawArrays( GL_TRIANGLES, 0, NumVertices );
 86}
 87
 88//----------------------------------------------------------------------------
 89//
 90// main
 91//
 92
 93#ifdef _WIN32
 94int CALLBACK WinMain(
 95  _In_ HINSTANCE hInstance,
 96  _In_ HINSTANCE hPrevInstance,
 97  _In_ LPSTR     lpCmdLine,
 98  _In_ int       nCmdShow
 99)
100#else
101int
102main( int argc, char** argv )
103#endif
104{
105    glfwInit();
106
107    GLFWwindow* window = glfwCreateWindow(800, 600, "Triangles", NULL, NULL);
108
109    glfwMakeContextCurrent(window);
110    gl3wInit();
111
112    init();
113
114    while (!glfwWindowShouldClose(window))
115    {
116        display();
117        glfwSwapBuffers(window);
118        glfwPollEvents();
119    }
120
121    glfwDestroyWindow(window);
122
123    glfwTerminate();
124}

Init():

  • Generate Vertex Array VAOs and bind VAOs[0].

    (glGenVertexArrays( NumVAOs, VAOs ); glBindVertexArray( VAOs[Triangles] ); glCreateBuffers( NumBuffers, Buffers );) A vertex-array object holds various data related to a collection of vertices. Those data are stored in buffer objects and managed by the currently bound vertex-array object.

    • glBindBuffer( GL_ARRAY_BUFFER, Buffers[ArrayBuffer] );

      Because there are many different places where buffer objects can be in OpenGL, when we bind a buffer, we need to specify which what we’d like to use it for. In our example, because we’re storing vertex data into the buffer, we use GL_ARRAY_BUFFER. The place where the buffer is bound is known as the binding target.

  • According counter clockwise rule in previous section, Triangle Primitives are defined in varaible vertices. After binding OpenGL VBO Buffers[0] to vertices, vertices data will send to memory of server(gpu). Think of the “active” buffer as just a global variable, and there are a bunch of functions which use the active buffer instead of using a parameter. These global state variables are the ugly side of OpenGL [39] and can be replaced with glVertexArrayVertexBuffer(), glVertexArrayAttribFormat(), …, then call glBindVertexArray(vao) before drawing in OpenGL 4.5 [40] [41].

  • glVertexAttribPointer( vPosition, 2, GL_FLOAT, GL_FALSE, 0, BUFFER_OFFSET(0) ): During gpu rendering, each vertex position will be held in vPosition and pass to “triangles.vert” shader because LoadShaders( shaders ).

glfwSwapBuffers(window):

  • You’ve already used double buffering for animation. Double buffering is done by making the main color buffer have two parts: a front buffer that’s displayed in your window; and a back buffer, which is where you render the new image. When you swap the buffers (by calling glfwSwapBuffers(), for example), the front and back buffers are exchanged [78].

display():

  • Bind VAOs[0], set render mode to GL_TRIANGLES and send vertex data to Buffer (gpu memory, OpenGL pipeline). Next, GPU will do rendering pipeline descibed in next section.

The triangles.vert has input vPosition and no output variable, so using gl_Position default varaible without declaration. The triangles.frag has not defined input variable and has defined output variable fColor instead of using gl_FragColor.

The “in” and “out” in shaders above are “type qualifier”. A type qualifier is used in the OpenGL Shading Language (GLSL) to modify the storage or behavior of global and locally defined variables. These qualifiers change particular aspects of the variable, such as where they get their data from and so forth [45].

Though attribute and varying are removed from later version 1.4 of OpenGL, many materials in website using them [46] [47]. It’s better to use “in” and “out” to replace them as the following code. OpenGL has a few ways to binding API’s variable with shader’s variable. glVertexAttrib* as the following code and glBindAttribLocation() [48], …

replace attribute and varying with in and out

uniform float scale;
layout (location = 0) attribute vec2 position;
// layout (location = 0) in vec2 position;
layout (location = 1) attribute vec4 color;
// layout (location = 1) in vec4 color;
varying vec4 v_color;
// out v_color

void main()
{
  gl_Position = vec4(position*scale, 0.0, 1.0);
  v_color = color;
}
// OpenGL API
GLfloat attrib[] = { x * 0.5f, x * 0.6f, x* 0.4f, 0.0f };
// Update the value of input attribute 1 : layout (location = 1) in vec4 color
glVertexAttrib4fv(1, attrib);
varying vec4 v_color;
// in vec4 v_color;

void main()
{
  gl_FragColor = v_color;
}

An OpenGL program is made of two shaders [43] [44]:

  • The vertex shader is (commonly) executed once for every vertex we want to draw. It receives some attributes as input, computes the position of this vertex in space and returns it in a variable called gl_Position. It also defines some varyings.

  • The fragment shader is executed once for each pixel to be rendered. It receives some varyings as input, computes the color of this pixel and returns it in a variable called fColor.

Since we have 6 vertices in our buffer, this shader will be executed 6 times by the GPU (once per vertex)! We can also expect all 6 instances of the shader to be executed in parallel, since a GPU have so many cores.

3D Rendering

Based on the previous section of 3D modeling, the 3D modeling tool will generate 3D vertex model and OpenGL code, then programmers may hand-change OpenGL code and add or modify shaders. The 3D animation will trigger the 3D rendering for each 2D image drawing.

3D rendering is the process of converting 3D models into 2D images on a computer [33]. The steps as the following Fig. 77.

_images/short-rendering-pipeline.png

Fig. 77 3D Graphics Rendering Pipeline

  • A fragment can be treated as a pixel in 3D spaces, which is aligned with the pixel grid, with attributes such as position, color, normal and texture.

The complete steps as the following Fig. 78 from OpenGL website [34] and the website has descripiton for each stage.

_images/rendering_pipeline.png

Fig. 78 Diagram of the Rendering Pipeline. The blue boxes are programmable shader stages.

In addition, list OpenGL rendering pipeline Figure 1.2 and stage from book “OpenGL Programming Guide 9th Edition” [37] as follows,

_images/OpenGL-pipeline.png
Table 44 OpenGL rendering pipeline from page 10 of book “OpenGL Programming Guide 9th Edition” [37] and [34].

Stage.

Description

Vertex Specification

After setting data as the example of previous section, glDrawArrays() will send data to gpu through buffer objects.

Vertex Shading

Vertex -> Vertex and other data such as color for later passes. For each vertex that is issued by a drawing command, a vertex shader will be called to process the data associated with that vertex.

Tessellation Shading

Create more detail on demand when room in. After the vertex shader has processed each vertex’s associated data, the tessellation shader stage will continue processing that data, if it’s been activated. Reference below.

Geometry Shading

The next shader stage, geometry shading, allows additional processing of individual geometric primitives, including creating new ones, before rasterization. Chapter 10 of Red Book [37] has details.

Primitive Assembly

The previous shading stages all operate on vertices, with the information about how those vertices are organized into geometric primitives being carried along internal to OpenGL. The primitive assembly stage organizes the vertices into their associated geometric primitives in preparation for clipping and rasterization.

Clipping

Clipping hidden parts. Occasionally, vertices will be outside of the viewport—the region of the window where you’re permitted to draw—and cause the primitive associated with that vertex to be modified so none of its pixels are outside of the viewport. This operation is called clipping and is handled automatically by OpenGL.

Rasterization

Vertex -> Fragment. The job of the rasterizer is to determine which screen locations are covered by a particular piece of geometry (point, line, or triangle). Knowing those locations, along with the input vertex data, the rasterizer linearly interpolates the data values for each varying variable in the fragment shader and sends those values as inputs into your fragment shader. A fragment can be treated as a pixel in 3D spaces, which is aligned with the pixel grid, with attributes such as position, color, normal and texture.

Fragment Shading

Determine color for each pixel. The final stage where you have programmable control over the color of a screen location is fragment shading. In this shader stage, you use a shader to determine the fragment’s final color (although the next stage, per-fragment operations, can modify the color one last time) and potentially its depth value. Fragment shaders are very powerful, as they often employ texture mapping to augment the colors provided by the vertex processing stages. A fragment shader may also terminate processing a fragment if it determines the fragment shouldn’t be drawn; this process is called fragment discard. A helpful way of thinking about the difference between shaders that deal with vertices and fragment shaders is this: vertex shading (including tessellation and geometry shading) determines where on the screen a primitive is, while fragment shading uses that information to determine what color that fragment will be.

Per-Fragment Operations

During this stage, a fragment’s visibility is determined using depth testing (also commonly known as z-buffering) and stencil testing. If a fragment successfully makes it through all of the enabled tests, it may be written directly to the framebuffer, updating the color (and possibly depth value) of its pixel, or if blending is enabled, the fragment’s color will be combined with the pixel’s current color to generate a new color that is written into the framebuffer.

  • Tessellation Shading: The core problem that Tessellation deals with is the static nature of 3D models in terms of their detail and polygon count. The thing is that when we look at a complex model such as a human face up close we prefer to use a highly detailed model that will bring out the tiny details (e.g. skin bumps, etc). A highly detailed model automatically translates to more triangles and more compute power required for processing. … One possible way to solve this problem using the existing features of OpenGL is to generate the same model at multiple levels of detail (LOD). For example, highly detailed, average and low. We can then select the version to use based on the distance from the camera. This, however, will require more artist resources and often will not be flexible enough. … Let’s take a look at how Tessellation has been implemented in the graphics pipeline. The core components that are responsible for Tessellation are two new shader stages and in between them a fixed function stage that can be configured to some degree but does not run a shader. The first shader stage is called Tessellation Control Shader (TCS), the fixed function stage is called the Primitive Generator (PG), and the second shader stage is called Tessellation Evaluation Shader (TES). Some GPU havn’t this fixed function stage implemented in HW and even havn’t provide these TCS, TES and Gemoetry Shader. User can write Compute Shaders instead for this on-fly detail display. This surface is usually defined by some polynomial formula and the idea is that moving a CP has an effect on the entire surface. … The group of CPs is usually called a Patch [35]. Chapter 9 of Red Book [37] has details.

For 2D animation, the model is created by 2D only (1 face only), so it only can be viewed from the same face of model. If you want to display different faces of model, multiple 2D models need to be created and switch these 2D models from face(flame) to face(flame) from time to time [36].

GLSL (GL Shader Language)

OpenGL is a standard for designing 2D/3D animation in computer graphic. To do animation well, OpenGL provides a lots of api(functions) call for graphic processing. The 3D model construction tools such as Maya, Blender, …, etc, only need to call this api to finish the 3D to 2D projecting function in computer. Any GPU hardware dependent code in these api provided by GPU manufacturer. An OpenGL program looks like the following,

Vertex shader

#version 330 core
layout (location = 0) in vec3 aPos; // the position variable has attribute position 0

out vec4 vertexColor; // specify a color output to the fragment shader

void main()
{
    gl_Position = vec4(aPos, 1.0); // see how we directly give a vec3 to vec4's constructor
    vertexColor = vec4(0.5, 0.0, 0.0, 1.0); // set the output variable to a dark-red color
}

Fragment shader

#version 330 core
out vec4 FragColor;

in vec4 vertexColor; // the input variable from the vertex shader (same name and same type)

void main()
{
    FragColor = computeColorOfThisPixel(...);
}

OpenGl user program

int main(int argc, char ** argv)
{
  // init window, detect user input and do corresponding animation by calling opengl api
  ...
}

The last main() is programed by user obviously. Let’s explain what the first two main() work for. As you know, the OpenGL is a lots of api to let programmer display the 3D object into 2D computer screen explained from book of concept of computer graphic. 3D graphic model can set light and object texture by user firstly, and calculating the postion of each vertex secondly, then color for each pixel automatically by 3D software and GPU thirdly, finally display the color of each pixel in computer screen. But in order to let user/programmer add some special effect or decoration in coordinate for each vertex or in color for each pixel, OpenGL provides these two functions to do it. OpenGL uses fragment shader instead of pixel is : “Fragment shaders are a more accurate name for the same functionality as Pixel shaders. They aren’t pixels yet, since the output still has to past several tests (depth, alpha, stencil) as well as the fact that one may be using antialiasing, which renders one-fragment-to-one-pixel non-true [49]. Programmer is allowed to add their converting functions that compiler translate them into GPU instructions running on GPU processor. With these two shaders, new features have been added to allow for increased flexibility in the rendering pipeline at the vertex and fragment level [50]. Unlike the shaders example here [51], some converting functions for coordinate in vertex shader or for color in fragment shade are more complicated according the scenes of animation. Here is an example [52]. In wiki shading page [3], Gourand and Phong shading methods make the surface of object more smooth by glsl. Example glsl code of Gourand and Phong shading on OpenGL api are here [53]. Since the hardware of graphic card and software graphic driver can be replaced, the compiler is run on-line meaning driver will compile the shaders program when it is run at first time and kept in cache after compilation [54].

The shaders program is C-like syntax and can be compiled in few mini-seconds, add up this few mini-seconds of on-line compilation time in running OpenGL program is a good choice for dealing the cases of driver software or gpu hardware replacement [55].

OpenGL Shader compiler

OpenGL standard is here [56]. The OpenGL is for desktop computer or server while the OpenGL ES is for embedded system [57]. Though shaders are only a small part of the whole OpenGL software/hardware system. It is still a large effort to finish the compiler implementation since there are lots of api need to be implemented. For example, there are 80 related texture APIs [58]. This implementation can be done by generating llvm extended intrinsic functions from shader parser of frontend compiler as well as llvm backend converting those intrinsic to gpu instructions as follows,

Fragment shader

#version 320 es
uniform sampler2D x;
out vec4 FragColor;

void main()
{
    FragColor = texture(x, uv_2d, bias);
}

llvm-ir

...
!1 = !{!"sampler_2d"}
!2 = !{i32 SAMPLER_2D} ; SAMPLER_2D is integer value for sampler2D, for example: 0x0f02
; A named metadata.
!x_meta = !{!1, !2}

define void @main() #0 {
    ...
    %1 = @llvm.gpu0.texture(metadata !x_meta, %1, %2, %3); ; %1: %sampler_2d, %2: %uv_2d, %3: %bias
    ...
}

asm of gpu

...
// gpu machine code
load $1, tex_a;
sample2d_inst $1, $2, $3 // $1: tex_a, $2: %uv_2d, $3: %bias

.tex_a // Driver set the index of gpu descriptor regsters here

As the bottom of code above, .tex_a memory address includes the Texture Object which binding by driver in on-line compilation/linking. Through binding Texture Object (SW) and Texture Unit (HW) with OpenGL API, gpu will uses Texture Unit HW resources efficiently. Explaining it the following.

About llvm intrinsic extended function, please refer this book here [59].

gvec4 texture(gsampler2D sampler, vec2 P, [float bias]);

GPU provides ‘Texture Unit’ to speedup fragment shader. However the ‘Texture Unit’ HW is expensive resouce and only few of them in a GPU. Driver can associate ‘Texture Unit’ to sampler variable by OpenGL api and switch between shaders as the following statements.

_images/sampling_diagram.png

Fig. 79 Relationships between the texturing concept [60].

The Fig. 79 as above. The texture object is not bound directly into the shader (where the actual sampling takes place). Instead, it is bound to a ‘texture unit’ whose index is passed to the shader. So the shader reaches the texture object by going through the texture unit. There are usually multiple texture units available and the exact number depends on the capability of your graphic card [60]. A texture unit, also called a texture mapping unit (TMU) or a texture processing unit (TPU), is a hardware component in a GPU that does sampling operation. The argument sampler in texture function as above is sampler_2d index from ‘teuxture unit’ for texture object [60].

‘sampler uniform variable’:

There is a group of special uniform variables for that, according to the texture target: ‘sampler1D’, ‘sampler2D’, ‘sampler3D’, ‘samplerCube’, etc. You can create as many ‘sampler uniform variables’ as you want and assign the value of a texture unit to each one from the application. Whenever you call a sampling function on a ‘sampler uniform variable’ the corresponding texture unit (and texture object) will be used [60].

_images/sampling_diagram_binding.png

Fig. 80 Binding sampler variables [61].

As Fig. 80, the Java api gl.bindTexture binding ‘Texture Object’ to ‘Texture Unit’. The gl.getUniformLocation and gl.uniform1i associate ‘Texture Unit’ to ‘sampler uniform variables’.

gl.uniform1i(xLoc, 1): where 1 is ‘Texture Unit 1’, 2 is ‘Texture Unit 2’, …, etc [61].

The following figure depicts how driver read metadata from compiled glsl obj, OpenGL api associate ‘Sample Variable’ and gpu executing texture instruction.

_images/driverSamplerTable.png

Fig. 81 Associating Sampler Variables and gpu executing texture instruction

Explaining the detail steps for figure above as the following.

1. In order to let the ‘texture unit’ binding by driver, frontend compiler must pass the metadata of ‘sampler uniform variable’ (sampler_2d_var in this example) [64] to backend, and backend must allocate the metadata of ‘sampler uniform variable’ in the compiled binary file [62].

2. After gpu driver executing glsl on-line compiling, driver read this metadata from compiled binary file and maintain a table of {name, type, location} for each ‘sampler uniform variable’. Driver also fill this information to Texture Desciptor in GPU’s memory.

  1. Api,

xLoc = gl.getUniformLocation(prog, "x"); // prog: glsl program, xLoc

will get the location from the table for ‘sampler uniform variable’ x that driver created.

SAMPLER_2D: is integer value for Sampler2D type.

  1. Api,

gl.uniform1i( xLoc, 1 );

will binding xLoc of ‘sampler uniform variable’ x to ‘Texture Unit 1’ by writing 1 to the glsl binary metadata location of ‘sampler uniform variable’ x as follows,

{xLoc, 1} : 1 is 'Texture Unit 1', xLoc is the location(memory address) of 'sampler uniform variable' x

This api will set the texture descriptors in gpu with this {xLoc, 1} information. Next, driver set the index or memory address of gpu texture descriptors to variable .tex_a of memory address. For example as diagram, driver set k to .tex_a.

// gpu machine code
load $1, tex_a;
sample2d_inst $1, $2, $3 // $1: tex_a, $2: %uv_2d, $3: %bias

.tex_a // Driver set the index of gpu descriptor regsters here at step 4

When executing the texture instructions from glsl binary file on gpu, the corresponding ‘Texture Unit 1’ on gpu will being executed through texture descriptor in gpu’s memory because .tex_a: {xLoc, 1}. Driver may set texture descriptor in gpu’s texture desciptors if gpu provides specific texture descriptors in architecture [65].

For instance, Nvidia texture instruction as follow,

// the content of tex_a bound to texture unit as step 5 above
tex.3d.v4.s32.s32  {r1,r2,r3,r4}, [tex_a, {f1,f2,f3,f4}];

.tex_a

The content of tex_a bound to texture unit set by driver as the end of step 4. The pixel of coordinates (x,y,z) is given by (f1,f2,f3) user input. The f4 is skipped for 3D texture.

Above tex.3d texture instruction load the calculated color of pixel (x,y,z) from texture image into GPRs (r1,r2,r3,r4)=(R,G,B,A). And fragment shader can re-calculate the color of this pixel with the color of this pixel at texture image [63].

If it is 1d texture instruction, the tex.1d as follows,

tex.1d.v4.s32.f32  {r1,r2,r3,r4}, [tex_a, {f1}];

Since ‘Texture Unit’ is limited hardware accelerator on gpu, OpenGL providing api to user program for binding ‘Texture Unit’ to ‘Sampler Variables’. As a result, user program is allowed doing load balance in using ‘Texture Unit’ through OpenGL api without recompiling glsl. Fast texture sampling is one of the key requirements for good GPU performance [61].

In addition to api for binding texture, OpenGL provides glTexParameteri api to do Texture Wrapping [66]. Furthmore the texture instruction for some gpu may including S# T# values in operands. Same with associating ‘Sampler Variables’ to ‘Texture Unit’, S# and T# are location of memory associated to Texture Wrapping descriptor registers allowing user program to change Wrapping option without re-compiling glsl.

Even glsl frontend compiler always expanding function call into inline function as well as llvm intrinsic extended function providing an easy way to do code generation through llvm td (Target Description) file written, GPU backend compiler is still a little complex than CPU backend. (But when considering the effort in frontend compier such as clang, or other toolchain such as linker and gdb/lldb, of course, CPU compiler is not easier than GPU compiler.)

Here is the software stack of 3D graphic system for OpenGL in linux [11]. And mesa open source website is here [67].

GPU Architecture

_images/gpu-terms.png

Fig. 82 Terms in Nvidia’s gpu (figure from book [84])

SIMT

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading [68].

The leading GPU architecture of Nvidia’s gpu is as the following figures.

_images/threads-lanes.png

Fig. 83 Threads and lanes in gpu (figure from book [73])

Note

A SIMD Thread executed by SIMD Processor, a.k.a. SM, has 16 Lanes.

_images/sm.png

Fig. 84 Streaming Multiprocessor SM has two -16-way SIMD units and four special function units [70]. SM has L1 and Read Only Cache (Uniform Cache) GTX480 has 48 SMs. ALUs run at twice the clock rate of rest of chip. So each decoded instruction runs on 32 pieces of data on the 16 ALUs over two ALU clocks [71].

_images/sm2.png

Fig. 85 Multithreaded SIMD Processor (Streaming Multiprocessor SM) figure from book [74]

Note

A SIMD Thread executed by SIMD Processor, a.k.a. SM, processes 32 elements. Number of registers in a Thread Block = 16 (SM) * 32 (Cuda Thread) * 64 (TLR, Thread Level Register) = 32768 Register file. Fermi has a mode bit that offers the choice of using 64 KB of SRAM as a 16 KB L1 cache with 48 KB of Local Memory or as a 48 KB L1 cache with 16 KB of Local Memory [85].

_images/threadblock.jpg

Fig. 86 SM select Thread Blocks to run

  • Two level of scheduler.

    • Level 1: Thread Block Scheduler: Whenever an SM executes a thread block, all the threads inside the thread block are executed at the same time. When any of thread in Warp not ready for operands data dependence, context switching between Warps. When switching away from a particular warp, all the data of that warp remains in the register file so that it can be quickly resumed when its operands become ready [91].

  • Level 2: Warp Scheduler: Cuda Threads in the same Warp.

// Invoke MATMUL with 256 threads per Thread Block
__host__
int nblocks = (n + 255) / 512;
matmul<<<nblocks, 255>>>(n, A, B, C);
// MATMUL in CUDA
__device__
void matmul(int n, double A, double *B, double *C) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) A[i] = B[i] + C[i];
}
_images/grid.png

Fig. 87 Mapping 8192 elements of matmul for Nvidia’s GPU (figure from book [69]). SIMT: 16 SIMD Threads in 1 Thread Block.

_images/memory.png

Fig. 88 GPU memory (figure from book [75])

Summarize as table below.

Table 45 More Descriptive Name for Cuda term in Fermi GPU.

More Desciptive Name

Cuda term

Structure

Description

Grid

Grid

Grid is Vectorizable Loop as Fig. 82.

Thread Block

Thread Block / GPU Core

Each Grid has 16 Thread Block.

Each Thread Block is assigned 512 elements of the vectors to work on. SIMD Processors are full processors with separate PCs and are programmed using threads [87]. As Fig. 87, it assigns 16 Thread Block to 16 SIMD Processors. CPU Core is the processor which include multi-threads. A thread of CPU is execution unit with its own PC (Program Counter). Similarly, Once a thread block is launched on a multiprocessor (SM), all of its warps are resident until their execution finishes. Thus a new block is not launched on an SM until there is sufficient number of free registers for all warps of the new block, and until there is enough free shared memory for the new block. As this concept, GPU Core is the SIMD Processor includes several SIMD Thread (Warp). Each Warp has its PC [91].

SIMD Thread (run by SIMD Processor)

Warp (run by Streaming Multiprocessor, SM)

Each SIMD Processor has 16 SIMD Threads.

Each SIMD Processor has Memory:Local Memory as Fig. 88. Local Memory is shared by the SIMD Lanes within a multithreaded SIMD Processor, but this memory is not shared between multithreaded SIMD Processors. Warp has it’s own PC and may map to one whole function or part of function. Compiler and run time may assign them to the same Warp or different Warps [92].

SIMD Lane

Cuda Thread

Each SIMD Thread has 16 Lanes..

A vertical cut of a thread of SIMD instructions corresponding to one element executed by one SIMD Lane. It is a vector instruction with processing 16-elements. SIMD Lane registers: each Lane has its TLR (Thread Level Registers) which is allocated from Register file (32768 x 32-bit) by SM as Fig. 84.

Chime

Chime

Each SIMD Lane has 2 chimes.

One clock rate of rest of chip executes 2 data elements on two Cuda-core as Fig. 84. Vector length is 32 (32 elements). SIMD Lanes is 16. Chime is 2. This ALU clock cycles, also known as “ping pong” cycles. As Fig. 87 for the later Fermi-generation GPUs.

Vertex unit

VAR unit

VAR Variable Rate Shading Unit [88].

Texture unit

As depicted in section OpenGL Shader Compiler.

Speedup Features

  • Gather-scatter data transfer: HW support sparse vector access is called gather-scatter. The VMIPS instructions are LVI (load vector indexed or gather) and SVI (store vector indexed or scatter) [89].

  • Address Coalescing: GPU provides this feature explained as follows,

    • Note that unlike vector architectures, GPUs don’t have separate instructions for sequential data transfers, strided data transfers, and gather-scatter data transfers. All data transfers are gather-scatter! To regain the efficiency of sequential (unit-stride) data transfers, GPUs include special Address Coalescing hardware to recognize when the SIMD Lanes within a thread of SIMD instructions are col- lectively issuing sequential addresses [90]..

Buffers

In addition to texture units and instructions, GPU provides different Buffers to speedup OpenGL pipeline rendering [42].

  • Color buffer

    They contain the RGB or sRGB color data and may also contain alpha values for each pixel in the framebuffer. There may be multiple color buffers in a framebuffer. You’ve already used double buffering for animation. Double buffering is done by making the main color buffer have two parts: a front buffer that’s displayed in your window; and a back buffer, which is where you render the new image [76].

  • Depth buffer (Z buffer)

    Depth is measured in terms of distance to the eye, so pixels with larger depth-buffer values are overwritten by pixels with smaller values [77] [79] [80].

  • Stencil Buffer

    In the simplest case, the stencil buffer is used to limit the area of rendering (stenciling) [81] [80].

  • Frame Buffer

    OpenGL offers: the color, depth and stencil buffers. This combination of buffers is known as the default framebuffer and as you’ve seen, a framebuffer is an area in memory that can be rendered to [82].

General purpose GPU

Since GLSL shaders provide a general way for writing C code in them, if applying a software frame work instead of OpenGL api, then the system can run some data parallel computation on GPU for speeding up and even get CPU and GPU executing simultaneously. Furthmore, any language that allows the code running on the CPU to poll a GPU shader for return values, can create a GPGPU framework [83].

Mapping data in GPU

As previous section GPU, the subset of y[]=a*x[]+y[] array-calculation as follows,

// Invoke DAXPY with 256 threads per Thread Block
__host__
int nblocks = (n+255) / 256;
daxpy<<<nblocks, 256>>>(n, 2.0, x, y);
// DAXPY in CUDA
__device__
void daxpy(int n, double a, double *x, double *y) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}
  • name<<<dimGrid, dimBlock>>>(… parameter list …):

    • dimGrid: Number of Blocks in Grid

    • dimBlock: 256 Threads in Block

Assembly code of PTX (from page 300 of Quantitative book)

// code to set VLR, Vector Length Register, to (n % 256)
//   ...
//
shl.u32 R8, blockIdx, 9       ; Thread Block ID * Block size (512)
add.u32 R8, R8, threadIdx     ; R8 = i = my CUDA Thread ID
shl.u32 R8, R8, 3             ; byte offset
setp.neq.s32 P1, RD8, RD3     ; RD3 = n, P1 is predicate register 1
ld.global.f64 RD0, [X+R8]     ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8]     ; RD2 = Y[i]
mul.f64 RD0, RD0, RD4         ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 RD0, RD0, RD2         ; SuminRD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0     ; Y[i] = sum (X[i]*a + Y[i])
  • Need to set VLR if PTX has this instruction. Otherwise, set lane-mask in the similar way of the code below.

__device__
void lane-mask-ex( double *X, double *Y, double *Z) {
  if (X[i] != 0)
    X[i] = X[i] – Y[i];
  else X[i] = Z[i];
}

Assembly code of Vector Processor

LV V1,Rx         ;load vector X into V1
LV V2,Ry         ;load vector Y
L.D F0,#0        ;load FP zero into F0
SNEVS.D V1,F0    ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV V1,Rx         ;store the result in X

Assembly code of PTX (modified code from refering page 208 - 302 of Quantitative book)

ld.global.f64 RD0, [X+R9]     ; RD0 = X[i]
setp.neq.s32 P1, RD0, #0      ; P1 is predicate register 1
@!P1, bra ELSE1, *Push        ; Push old mask, set new mask bits
                              ; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8]     ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2         ; Difference in RD0
st.global.f64 [X+R8], RD0     ; X[i]=RD0
ELSE1:
ld.global.f64 RD0, [Z+R8]     ; RD0 = Z[i]
st.global.f64 [X+R8], RD0     ; X[i] = RD0
ENDIF1:
ret, *Pop                     ; pop to restore old mask

The following table explains how the elemements of saxpy() maps to lane of SIMD Thread(Warp) of Thread Block(Core) of Grid.

Table 46 Mapping saxpy code to Fig. 87.

saxpy(()

Instance in Fig. 87

Description

blockDim.x

The index of Thread Block

blockDim: in this example configured as Fig. 87 is 16(Thread Blocks) * 16(SIDM Threads) = 256

blockIdx.x

The index of SIMD Thread

blockIdx: the index of Thread Block within the Grid

threadIdx.x

The index of elements

threadIdx: the index of the SIMD Thread within its Thread Block

  • With Fermi, each 32-wide thread of SIMD instructions is mapped to 16 physical SIMD Lanes, so each SIMD instruction in a thread of SIMD instructions takes two clock cycles to complete.

  • You could say that it has 16 lanes, the vector length would be 32, and the chime is 2 clock cycles.

  • The mape of y[0..31] = a * x[0..31] * y[0..31] to <Core, Warp, Cuda Thread> of GPU as the following table. x[0..31] map to 32 Cuda Threads; two Cuda Thread map to one SIMD lane.

Table 47 Map (Core,Warp) to saxpy

Warp-0

Warp-1

Warp-15

Core-0

y[0..31] = a * x[0..31] * y[0..31]

y[32..63] = a * x[32..63] + y[32..63]

y[480..511] = a * x[480..511] + y[480..511]

Core-15

y[7680..7711] = a * …

y[8160..8191] = a * x[8160..8191] + y[8160..8191]

  • Each Cuda Thread run GPU function-code saxpy. Fermi has Register file (32768 x 32-bit). As Fig. 84, Number of registers in a Thread Block = 16 (SM) * 32 (Cuda Thread) * 64 (TLR, Thread Level Register) = 32768 x 32-bit (Register file).

  • When mapping to the fragments/pixels in graphic GPU, x[0..15] corresponding to a two dimensions of tile of fragments/pixels at pixel[0..3][0..3] since image uses tile base for grouping closest color together.

Work between CPU and GPU in Cuda

Above daxpy() GPU code did not mention the host (CPU) side of code for triggering GPU’s function. The following is host (CPU) side of a CUDA example to call saxpy on GPU [86] as follows,

#include <stdio.h>

__global__
void saxpy(int n, float a, float * x, float * y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  ...
  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
  ...
  saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);
  ...
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
  ...
}

The main() run on CPU while the saxpy() run on GPU. CPU copy the data from x and y to the corresponding device arrays d_x and d_y using cudaMemcpy. The saxpy kernel is launched by the statement: saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y); In this case we launch the kernel with thread blocks containing 512 elements, and use integer arithmetic to determine the number of thread blocks required to process all N elements of the arrays ((N+255)/256) Through cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost, CPU can pass data in x and y arrays to GPU and get result from GPU to y array. Since both of these memory transfers trigger the DMA functions without CPU operation, it may speed up by running both CPU/GPU with their data in their own cache repectively. After DMA memcpy from cpu’s memory to gpu’s, gpu operates the whole loop of matrix operation for “y[] = a*x[]+y[];” instructions with one Grid.

GPU persues throughput from SIMD application. Can hide cache-miss latence from SMT. As result GPU may hasn’t L2 and L3 like CPU for each core since GPU is highly latency-tolerant multithreading for data parallel application [96]. DMA memcpy map the data in cpu memory to each l1 cache of core on gpu memory. Many gpu provides operations scatter and gather to access DRAM data for stream processing [97] [83] [98].

When the GPU function is dense computation in array such as MPEG4 encoder or deep learning for tuning weights, it may get much speed up [99]. However when GPU function is matrix addition and CPU will idle for waiting GPU’s result. It may slow down than doing matrix addition by CPU only. Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speedup [83].

Wiki here [100] includes speedup applications for gpu as follows:

General Purpose Computing on GPU, has found its way into fields as diverse as machine learning, oil exploration, scientific image processing, linear algebra, statistics, 3D reconstruction and even stock options pricing determination. In addition, section “GPU accelerated video decoding and encoding” for video compressing [100] gives the more applications for GPU acceleration.

Table 48 The differences for speedup in architecture of CPU and GPU

Item

CPU

GPU

Application

Non-data parallel

Data parallel

Architecture

SISD, small vector (eg.4*32bits)

Large SIMD (eg.16*32bits)

Cache

Smaller and faster

Larger and slower (ref. The following Note)

ILP

Pipeline

Pipeline

Superscalar, SMT

SIMT

Super-pipeline

Core

Smaller threads for SMT (2 or 4)

Larger threads (16 or 32)

Branch

Conditional-instructions

Mask & conditional-instructions

Note

GPU-Cache

In theory for data parallel application in GPU’s SMT, GPU can schedule more threads and pursues throughput rather speedup for one single thread as SISD in CPU. However in reality, GPU provides small L1 cache like CPU’s and fill the cache-miss with scheduline another thread. So, GPU may has no L2 and L3 while CPU has deep level of caches.

Volta (Cuda thread/SIMD lane with PC, Program Couner and Call Stack)

One way the compiler handles this is by keeping executing instructions in order and if some threads don’t have to execute certain instructions it switches off those threads and turns them on their relevant instructions and switches off the other threads, this process is called masking.

_images/pre-volta-1.png

Fig. 89 SIMT Warp Execution Model of Pascal and Earlier GPUs [93]

_images/volta-1.png

Fig. 90 Volta Warp with Per-Thread Program Counter and Call Stack [93]

  • After Volta GPU of Nvidia, each thread in Warp has it’s own PC as Fig. 90.

//
__device__ void insert_after(Node *a, Node *b)
{
  Node *c;
  lock(a); lock(a->next);
  ...
  unlock(c); unlock(a);
}
  • Volta’s independent thread scheduling allows the GPU to yield execution of any thread, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. As above example [93], each thread can progress with it’s own PC. So, the different threads in the same Warp can run insert_ater() function in dependently without waiting lock().

  • Provide both thread in group efficency and independently thread progression.

    Beside each thread in same Warp can progress independently as above, To maximize parallel efficiency, Volta includes a schedule optimizer which determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity, while the convergence optimizer in Volta will still group together threads which are executing the same code and run them in parallel for maximum efficiency. In Cuda Applications, this feature provides more parallel opportunities with __syncwarp() to user programmers as Fig. 91.

_images/volta-2.png

Fig. 91 Programs use Explicit Synchronization to Reconverge Threads in a Warp [93]

OpenCL, Vulkan and spir-v

digraph G {
  rankdir=LR;

  compound=true;
  node [shape=record];
  SW_LAYER [label="{ GLSL | OpenCL } | SPIR-V | GPU machine code"];
}

Fig. 92 OpenCL and GLSL(OpenGL)

Table 49 OpenCL and OpenGL SW system

Name of SW

GPU language

Level of GPU language

OpenCL

OpenCL

C99 dialect (with C pointer, …)

OpenGL

GLSL

C-like (no C pointer, …)

Vulkan

SPIR-V

IR

_images/opencl-to-spirv-offine-compilation.png

Fig. 93 Offline Compilation of OpenCL Kernels into SPIR-V Using Open Source Tooling [102]

  • clang: Compile OpenCL to spirv for runtime+driver. Or compile OpenCL to llvm, then “SPIR-V LLVM Translator” translate llvm to spirv for runtime+driver.

  • clspv: Compile OpenCL to spirv for tuntime

digraph G {
  rankdir=LR;

  node [shape=record];

  glslang [style=filled,fillcolor=green];
  spirv_cross [label="spirv-cross",style=filled,fillcolor=green];
  glsl -> glslang -> spirv;
  glsl -> spirv_cross -> spirv [dir="back"];
}

Fig. 94 Convertion between glsl and spirv

  • Comparsion for OpenCL and OpenGL’s compute shader.

    • Same:

      Both are for General Computing of GPU.

    • Difference:

      OpenCL include GPU and other accelerate device/processor. OpenCL is C language on Device and C++ on Host based on OpenCL runtime. Compute shader is GLSL shader language run on OpenGL graphic enviroment and integrate and access data of OpenGL API easily [101].

  • OpenGL/GLSL vs Vulkan/spir-v.

    • High level of API and shader: OpenGL, GLSL.

    • Low level of API and shader: Vulkan, spir-v.

Though OpenGL api existed in higher level with many advantages from sections above, sometimes it cannot compete in efficience with direct3D providing lower levels api for operating memory by user program [103]. Vulkan api is lower level’s C/C++ api to fill the gap allowing user program to do these things in OpenGL to compete against Microsoft direct3D. Here is an example [104]. Meanwhile glsl is C-like language. The vulkan infrastructure provides tool, glslangValidator [105], to compile glsl into an Intermediate Representation Form (IR) called spir-v off-line. As a result, it saves part of compilation time from glsl to gpu instructions on-line since spir-v is an IR of level closing to llvm IR [106]. In addition, vulkan api reduces gpu drivers efforts in optimization and code generation [103]. These standards provide user programmer option to using vulkan/spir-v instead of OpenGL/glsl, and allow them pre-compiling glsl into spir-v off-line to saving part of on-line compilation time.

With vulkan and spir-v standard, the gpu can be used in OpenCL for Parallel Programming of Heterogeneous Systems [107] [108]. Similar with Cuda, a OpenCL example for fast Fourier transform (FFT) is here [109]. Once OpenCL grows into a popular standard when more computer languages or framework supporting OpenCL language, GPU will take more jobs from CPU [110].

Most GPUs have 16 or 32 lanes in a SIMD processor (Warp), vulkan provides Subgroup operations to data parallel programming on lanes of SIMD processor [111].

Subgroup operations provide a fast way for moving data between lanes intra Warp. Assuming each Warp has eight lanes. The following table lists result of reduce, inclusive and exclusive operations.

Table 50 Lists each lane’s value after Reduce, Inclusive and Exclusive operations repectively

Lane

0

1

2

3

Initial value

a

b

c

d

Reduce

OP(abcd)

OP(abcd)

OP(abcd)

OP(abcd)

Inclusive

OP(a)

OP(ab)

OP(abc)

OP(abcd)

Exclusive

not define

OP(a)

OP(ab)

OP(abc)

  • Reduce: e.g. subgroupAdd. Inclusive: e.g. subgroupInclusiveAdd. Exclusive: e.g. subgroupExclusiveAdd.

  • For examples:

    • ADD operation: OP(abcd) = a+b+c+d.

    • MAX operation: OP(abc) = MAX(a,b,c).

  • When lane i is inactive, it is value is none.

    • For instance of lane 0 is inactive, then MUL operation: OP(abcd) = b*c*d.

The following is a code example.

An example of subgroup operations in glsl for vulkan

vec4 sum = vec4(0, 0, 0, 0);
if (gl_SubgroupInvocationID < 16u) {
  sum += subgroupAdd(in[gl_SubgroupInvocationID]);
}
else {
  sum += subgroupInclusiveMul(in[gl_SubgroupInvocationID]);
}
subgroupMemoryBarrier();
  • Nvidia’s GPU provides __syncwarp() for subgroupMemoryBarrier() or compiler to sync for the lanes in the same Warp.

In order to let lanes in the same SIMD processor work efficently, data unifomity analysis will provide many optimization opporturnities in register allocation, transformation and code generation [112].

Now, you find llvm IR expanding from cpu to gpu becoming influentially more and more. And actually, llvm IR expanding from version 3.1 util now as I can feel.

Accelerate ML/DL on OpenCL/SYCL

_images/opencl_ml_graph.png

Fig. 95 Implement ML graph scheduler both on compiler and runtime

As Fig. 95, the Device of GPU or CPU+NPU is able to run the whole ML graph. However if the Device has NPU only, then the CPU operation such as Avg-Pool has to run on Host side which add communication cost between Host and Device.

Like OpenGL’s shader, the “kernel” function may be compiled on-line or off-line and sending to GPU as programmable functions.

In order to run ML (Machine Learning) efficiently, all platforms for ML on GPU/NPU implement scheduling SW both on graph compiler and runtime. If OpenCL can extend to support ML graph, then graph compiler such as TVM or Runtime from Open Source have chance to leverage the effort of scheduling SW from programmers [113]. Cuda graph is an idea like this [114] [115] .

  • SYCL: Using C++ templates to optimize and genertate code for OpenCL and Cuda. Provides a consistent language, APIs, and ecosystem in which to write and tune code for different accelerator architecture, CPUs, GPUs, and FPGAs [116].

    • SYCL uses generic programming with templates and generic lambda functions to enable higher-level application software to be cleanly coded with optimized acceleration of kernel code across an extensive range of acceleration backend APIs, such as OpenCL and CUDA [117].

_images/sycl.png

Fig. 96 SYCL = C++ template and compiler for Data Parallel Applications on AI on CPUs, GPUs and HPGAs.

  • DPC++ (OneDPC) compiler: Based on SYCL, DPC++ can compile DPC++ language for CPU host and GPU device. DPC++ (Data Parallel C++) is a language from Intel and maybe accepted by C++ which GPU side (Kernal code) is C++ without exception handler [118] [119].

    • Features of Kernel Code:

      • Not supported:

        Dynamic polymorphism, dynamic memory allocations (therefore no object management using new or delete operators), static variables, function pointers, runtime type information (RTTI), and exception handling. No virtual member functions, and no variadic functions, are allowed to be called from kernel code. Recursion is not allowed within kernel code.

      • Supported:

        Lambdas, operator overloading, templates, classes, and static polymorphism [120].

Open Sources