The concept of GPU compiler

Basicly CPU is SISD (Single Instruction Single Data) Architecture in each core. The multimedia instructions in CPU are smaller scaled of SIMD (Single Instruction Multiple Data) while GPU is a large scaled of SIMD processor, coloring millions of pixels of image in few mini seconds. Since the 2D or 3D graphic processing provides large opportunity in parallel data processing, GPU hardware usually composed of thousands of functional units in each core in N-Vidia processors.

This chapter is giving a overview for how 3D animation to be created and run on CPU+GPU. Give a concept for GPU compiler and HW featrues for graphic application. Furthermore, explaining how GPU has taking more applications from CPU through GPGPU concept and related standards emerged.

Webiste, Basic theory of 3D graphics with OpenGL, [1].

Concept in graphic and system

3D modeling

Through creating 3D model with Triangles or Quads along on skin, the 3D model is created with polygon mesh [2] formed by all the vertices on the first image as Fig. 52.

_images/modeling1.png

Fig. 52 Creating 3D model and texturing

After the next smooth shading [2], the vertices and edge lines are covered with color (or remove edges, actually edge never has black line), and model looks much more smooth [3]. Further, after texturing (texture mapping), the model looks real more [4].

To get to know how animation for a 3D modeling, please look video here [5]. According the video for skeleton animation, setting the joints poistion at different poses and giving time to each pose (keyframe) as Fig. 53.

_images/animation.png

Fig. 53 Set time point at keyframes

In this series of video, you find the 3D modeling tools creating Java instead of C/C++ code calling OpenGL api and shaders. It’s because Java can call OpenGL api through a wrapper library [6].

Every CAD software manufacturer such as AutoDesk and Blender has their own proprietary format. To solve the problem of interoperability, neutral or open source formats were invented as intermediate formats for converting between two proprietary formats. Naturally, these formats have become hugely popular now. Two famous examples of neutral formats are STL (with a .STL extension) and COLLADA (with a .DAE extension). Here is the list, where the 3D file formats are marked with their type.

Table 42 3D file formats [7]

3D file format

Type

STL

Neutral

OBJ

ASCII variant is neutral, binary variant is proprietary

FBX

Proprietary

COLLADA

Neutral

3DS

Proprietary

IGES

Neutral

STEP

Neutral

VRML/X3D

Neutral

The four key features a 3D file can store include the model’s geometry, the model’s surface texture, scene details, and animation of the model [7].

Specifically, they can store details about four key features of a 3D model, though it’s worth bearing in mind that you may not always take advantage of all four features in all projects, and not all file formats support all four features!

3D printer applications do not to support animation. CAD and CAM such as designing airplane does not need feature of scene details.

DAE (Collada) appeared in the video animation above. Collada files belong to a neutral format used heavily in the video game and film industries. It’s managed by the non-profit technology consortium, the Khronos Group.

The file extension for the Collada format is .dae. The Collada format stores data using the XML mark-up language.

The original intention behind the Collada format was to become a standard among 3D file formats. Indeed, in 2013, it was adopted by ISO as a publicly available specification, ISO/PAS 17506. As a result, many 3D modeling programs support the Collada format.

That said, the consensus is that the Collada format hasn’t kept up with the times. It was once used heavily as an interchange format for Autodesk Max/Maya in film production, but the industry has now shifted more towards OBJ, FBX, and Alembic [7].

Graphic SW stack

The role of CPU and GPU for graphic animation as Fig. 54.

_images/graphic-cpu-gpu.png

Fig. 54 OpenGL and Vulkan are both rendering APIs. In both cases, the GPU executes shaders, while the CPU executes everything else [8].

  • GPU can’t directly read user input from, say, keyboard, mouse, gamepad, or play audio, or load files from a hard drive, or anything like that. In this situation, cannot let GPU handle the animation work [9].

The driver run on CPU side as Fig. 55. The OpenGL Api will call driver’s function eventually and driver finish the function’s work via issuing GPU-HW’s command and/or sending data to GPU. GPU’s firmware only manage clock, voltage, power comsumption, …, etc [10]. Even so, GPU’s rendor work from the data of 3D vertex, colors, … sending from CPU and storing in GPU’s memory or shared memory consume more computing power than CPU.

digraph G {
  rankdir=LR;
  
  compound=true;
  node [shape=record];
  subgraph cluster_cpu {
    label = "CPU (Client)";
    CPU_SW [label=" 3D Model | JAVA | JOGL | { OpenGL API | Shaders \n (buitin-functions)} | <f1> Driver"];
  }
  subgraph cluster_gpu {
    label = "GPU HW (Server)"
    GPU_SW [label="<f1> 3D Rendering-pipeline \ndescribed in next section"];
  }
  CPU_SW:f1 -> GPU_SW:f1 [label=" VAO, texture, ..., from 3D model, \n shader-exectuable-code"];
    
//  label = "Graphic SW Stack";
}

Fig. 55 Graphic SW Stack

  • According the previous section, after user create skeleton and skin for each model and set keyframes time through 3D modeling tool, the 3D modeling tool can either generate Java code which calling JOGL (Java OpenGL) [6], or generate OpenCL API directly. The frame data can be calculated from interplation between keyframes.

  • As above, every animation the client CPU program set new position of obect (vertices) and colors, the data of one frame, server (driver and GPU) does the 3D to 2D rendering. Higher-level libraries and frameworks on top of OpenGL provide animation framework and tools to generate OpenGL API and shaders from 3D model.

  • Shader may call Builtin-functions which written from Compute Shader, spriv or LLVM-IR. LLVM libclc is a project for builtin-functions in OpenCL which can be used in OpenGL too [11]. Like CPU’s builtin-functions, new GPU ISA/architecture has to implement their builtin-functions or porting from open source such as libclc.

  • 3D model (CPU) does the rendering animation to generate each frame between keyframes (poses) while GPU does the rendering pipeline from each frame to each pixel’s value.

  • These frames data existed in the form of VAO (Vertex Array Object) in OpenGL. It will be explaned in later section OpenGL.

  • In addition, OpenGL provides vertex buffer object (VBO) allowing vertex array data to be stored in high-performance graphics memory on the server side and promotes efficient data transfer [12] [13].

The flow for 3D/2D graphic processing as Fig. 56.

digraph G {
  rankdir=LR;

  compound=true;
  node [shape=record];
  subgraph cluster_3d {
    label = "3D/2D modeling software";
    CodeGen [label="code-gen"];
    subgraph cluster_code {
      label = "Generated Code";
      Api [label="<a> OpenGL API | <s> Shaders"];
    }
    Hand [label="hand-modifying"];
  }
  subgraph cluster_driver {
    label = "Driver"
    Compiler [label="On-line Compiler"];
    Obj [label="obj"];
    Linker [label="On-line binding (Linker)"];
    Exe [label="exe"];
  }
  CodeGen -> Api [lhead ="cluster_code"];
  Api -> Hand [ltail ="cluster_code"];
  Hand -> Api [lhead ="cluster_code"];
  Api:a -> Obj [lhead ="cluster_driver"];
  Api:s -> Compiler;
  Compiler -> Obj -> Linker -> Exe;
  Exe -> GPU;
  Exe -> CPU [ltail ="cluster_driver"]; 

//  label = "OpenGL Flow";
}

Fig. 56 OpenGL Flow

_images/db-vsync.png

Fig. 57 VSync

VSync

No tearing, GPU and Display run at same refresh rate since GPU refresh faster
than Display.

              A    B

GPU      | ----| ----|

Display  |-----|-----|

            B      A

Tearing, GPU has exactly refresh cycles but VSync takes one cycle more.
than Display.

              A

GPU      | -----|

Display  |-----|-----|

            B      A

Avoid tearing, GPU has refresh rate 1/2 of Display's refresh rate.
than Display.

              A          B

GPU      | -----|    | -----|

Display  |-----|-----|-----|-----|

            B      B    A     A
  • Double Buffering

    While the display is reading from the frame buffer to display the current frame, we might be updating its contents for the next frame (not necessarily in raster-scan manner). This would result in the so-called tearing, in which the screen shows parts of the old frame and parts of the new frame. This could be resolved by using so-called double buffering. Instead of using a single frame buffer, modern GPU uses two of them: a front buffer and a back buffer. The display reads from the front buffer, while we can write the next frame to the back buffer. When we finish, we signal to GPU to swap the front and back buffer (known as buffer swap or page flip).

  • VSync

    Double buffering alone does not solve the entire problem, as the buffer swap might occur at an inappropriate time, for example, while the display is in the middle of displaying the old frame. This is resolved via the so-called vertical synchronization (or VSync) at the end of the raster-scan. When we signal to the GPU to do a buffer swap, the GPU will wait till the next VSync to perform the actual swap, after the entire current frame is displayed.

    As above text digram. The most important point is: When the VSync buffer-swap is enabled, you cannot refresh the display faster than the refresh rate of the display!!! If GPU is capable of producing higher frame rates than the display’s refresh rate, then GPU can use fast rate without tearing. If GPU has same or less frame rates then display’s and you application refreshes at a fixed rate, the resultant refresh rate is likely to be an integral factor of the display’s refresh rate, i.e., 1/2, 1/3, 1/4, etc. Otherwise it will cause tearing [1].

  • NVIDIA G-SYNC and AMD FreeSync

    If your monitor and graphics card both in your customer computer support NVIDIA G-SYNC, you’re in luck. With this technology, a special chip in the display communicates with the graphics card. This lets the monitor vary the refresh rate to match the frame rate of the NVIDIA GTX graphics card, up to the maximum refresh rate of the display. This means that the frames are displayed as soon as they are rendered by the GPU, eliminating screen tearing and reducing stutter for when the frame rate is both higher and lower than the refresh rate of the display. This makes it perfect for situations where the frame rate varies, which happens a lot when gaming. Today, you can even find G-SYNC technology in gaming laptops!

    AMD has a similar solution called FreeSync. However, this doesn’t require a proprietary chip in the monitor. In FreeSync, the AMD Radeon driver, and the display firmware handle the communication. Generally, FreeSync monitors are less expensive than their G-SYNC counterparts, but gamers generally prefer G-SYNC over FreeSync as the latter may cause ghosting, where old images leave behind artifacts [14].

Basic geometry in computer graphics

  • Additive colors in light as Fig. 58 [15] [16]. If in paints, it adds shade and become light grey since it add shade (dark color) [17].

_images/additive-colors.png

Fig. 58 Additive colors in light

This section instroduces the basic geometry math for computer graphics. Every computer graphics book has provided topics of transformation of object and position in space. Chapter 4 of Blue book: OpenGL SuperBible 7th Edition give a short description (40 pages) and useful concept is a good material for knowing the concept. Given the following for Quaternion Product (Hamilton product) from Wiki [18] since the book miss this.

\[\mathbf ij = -ji = k, jk = -kj = i, ki = -ik = j.\]

The complete concept can be found in Book: “Computer graphics principles and practice 3rd editon, authors: JOHN F, …”. But it is 1 thousand of pages. This book is very complete and may take much time to understand every detail.

Both Triangles or Quads are ploygon. So, objects can be formed with ploygon in both 2D and 3D. About transfermation in 2D or 3D, almost every book of computer graphics has mentioned well already. This section introduces the most important concept and method for deciding Inner and Outer planes, then a point or object can be checked for showing or hidding during 2D or 3D rendering.

Any area of polygon can be calculated by dividing into Triangles or Quads. And any area of Triangle or Quad can be calculated by cross product in 3D. The cross product in 3D is defined by the formula and can be represented with matrix notation as proved here [19].

\[\mathbf a \mathsf x \mathbf b = \Vert a \Vert \Vert b \Vert sin(\Theta) n\]
\[\begin{split}\mathbf a \mathsf x \mathbf b = \begin{vmatrix} \mathbf i & \mathbf j& \mathbf k\\ a_1& a_2& a_3\\ b_1& b_2& b_3 \end{vmatrix}\end{split}\]

The cross product in 2D is defined by the formula and can be represented with matrix notation as proved here [20] [21].

\[\mathbf a \mathsf x \mathbf b = \Vert a \Vert \Vert b \Vert sin(\Theta)\]
\[\begin{split}\mathbf a \mathsf x \mathbf b = \begin{vmatrix} \mathbf i & \mathbf j& \mathbf k\\ a_1& a_2& 0\\ b_1& b_2& 0 \end{vmatrix} = \begin{bmatrix} a_1& a_2 \\ b_1& b_2 \end{bmatrix}\end{split}\]

After above matrix form is proved, the Antisymmetric may be proved as follows,

\[\begin{split}a \mathsf x b = \mathsf x& \begin{bmatrix} a \\ b \end{bmatrix} = \begin{bmatrix} a_1& a_2 \\ b_1& b_2 \end{bmatrix} = a_1b_2 - a_2b_1 =\end{split}\]
\[\begin{split}-b_1a_2 - (-b_2a_1) = \begin{bmatrix} - b_1& - b_2 \\ a_1& a_2 \end{bmatrix} = \mathsf x& \begin{bmatrix} -b \\ a \end{bmatrix} = -b \mathsf x a\end{split}\]

In 2D, any two points \(\text{ from } P_i \text{ to } P_{i+1}\) can form a vector and decide inner side or outer side. For example, as Fig. 59, \(\Theta\) is the angle from \(P_iP_{i+1}\) to \(P_iP'_{i+1} = 180^\circ\). So, with right-hand rule, counter clockwise order, any \(P_iQ\) between \(P_iP_{i+1}\) to \(P_iP'_{i+1}\), the angle of \(P_iP_{i+1}\) to \(P_iQ = \theta, 0^\circ < \theta < 180^\circ\) then the inward direction be decided.

_images/inward-edge-normals.png

Fig. 59 Inward edge normals

_images/2d-vector-inward.png

Fig. 60 Inward and outward in 2D for a vector.

Base on this observation, the rule for inward and outward to any vector as Fig. 59. Face the same direction of a specific vector, the left side is inward and right side is outward as Fig. 60.

For each edge \(P_i - P_{i+1}\), the inward edge normal is the vector \(\mathsf x\; v_i\); the outward edge normal is \(\; -\; \mathsf x\; v_i\). Where \(\; \mathsf x\; v_i\) is coss-product(\(\mathsf v_i\)) as Fig. 59.

Polygon can be created from vertices. Suppose that \((P_0, P_1, ..., P_n)\) is a polygon. The line segments \(P_0P_1, P_1P_2\), etc., are the edges of the polygon; the vectors \(v_0 = P_1 - P_0, v_1 = P_2 - P_1, ..., v_n = P_0 - P_n\) are the edges of the polygon. Through counter clockwise, the left side is inward, then the inward region of polygon can be decided.

For a convex polygon whose vertices are listed in counter clockwise order, the inward edge normals point toward the interior of the polygon, and the outward edge normals point toward the unbounded exterior of the polygon, corresponding to our ordinary intuition. But if the vertices of a polygon are given in clockwise order, the interior and exterior swap roles.

This cross product has an important property: Going from v to ×v involves a rotation by 90◦ in the same direction as the rotation that takes the positive x-axis to the positive y-axis.

_images/polygon.png

Fig. 61 Draw a polygon with vectices counter clockwise

As Fig. 61, when drawing polygon with vectors(lines) counter clockwise, the ploygon will be created and the two sides of a vector(line) can be indentified [22]. Further a point in polygon or out of polygon can be identified. One simple way of finding whether the point is inside or outside a simple polygon is to test how many times a ray, starting from the point and going in any fixed direction, intersects the edges of the polygon. If the point is on the outside of the polygon the ray will intersect its edge an even number of times. If the point is on the inside of the polygon then it will intersect the edge an odd number of times [23].

_images/3d-cross-product.png

Fig. 62 Cross product definition in 3D

In the same way, through following the same direction counter clockwise to create 2D polygon one by one, then the 3D polygon will be created. As Fig. 62 from wiki [19], the inward direction can be decided with a x b < 0 and outward is a x b > 0 in OpenGL. Replace a, b with x, y as Fig. 63 axis z+ is the outer surface and z- is the inner surface [24].

_images/ogl-pointing-outwards.png

Fig. 63 OpenGL pointing outwards, indicating the outer surface (z axis is +)

_images/3d-polygon.png

Fig. 64 3D polygon with directions on each plane

The Fig. 64 is an example of 3D polygon created by 2D triangles. The direction of plane (triangle) as the line perpendicular to the plane.

Cast a ray from the 3D point along X-axis and check how many intersections with outer object you find. Depending on the intersection number on each axis (even or odd) you can understand if your point is inside or outside [25]. Inside is odd and outside is even. As Fig. 65, points on the line going through the object satisfy this rule.

_images/in-3d-object.png

Fig. 65 Point in or out 3D object

OpenGL uses counter clockwise and pointing outwards as default [12].

// unit cube
// A cube has 6 sides and each side has 4 vertices, therefore, the total number
// of vertices is 24 (6 sides * 4 verts), and 72 floats in the vertex array
// since each vertex has 3 components (x,y,z) (= 24 * 3)
//    v6----- v5
//   /|      /|
//  v1------v0|
//  | |     | |
//  | v7----|-v4
//  |/      |/
//  v2------v3

// vertex position array
GLfloat vertices[]  = {
   .5f, .5f, .5f,  -.5f, .5f, .5f,  -.5f,-.5f, .5f,  .5f,-.5f, .5f, // v0,v1,v2,v3 (front)
   .5f, .5f, .5f,   .5f,-.5f, .5f,   .5f,-.5f,-.5f,  .5f, .5f,-.5f, // v0,v3,v4,v5 (right)
   .5f, .5f, .5f,   .5f, .5f,-.5f,  -.5f, .5f,-.5f, -.5f, .5f, .5f, // v0,v5,v6,v1 (top)
  -.5f, .5f, .5f,  -.5f, .5f,-.5f,  -.5f,-.5f,-.5f, -.5f,-.5f, .5f, // v1,v6,v7,v2 (left)
  -.5f,-.5f,-.5f,   .5f,-.5f,-.5f,   .5f,-.5f, .5f, -.5f,-.5f, .5f, // v7,v4,v3,v2 (bottom)
   .5f,-.5f,-.5f,  -.5f,-.5f,-.5f,  -.5f, .5f,-.5f,  .5f, .5f,-.5f  // v4,v7,v6,v5 (back)
};

From code above, we can see that OpenGL uses counter clockwise and pointing outwards as default. However OpenGL provides glFrontFace(GL_CW) for clockwise [26].

For group of objects, scene graph provides better animation and saving memory [27].

OpenGL

Example of OpenGL program

The following example from openGL redbook and example code [32] [33].

References/triangles.vert


#version 400 core

layout( location = 0 ) in vec4 vPosition;

void
main()
{
    gl_Position = vPosition;
}

References/triangles.frag

#version 450 core

out vec4 fColor;

void main()
{
    fColor = vec4(0.5, 0.4, 0.8, 1.0);
}

References/01-triangles.cpp

  1//////////////////////////////////////////////////////////////////////////////
  2//
  3//  Triangles.cpp
  4//
  5//////////////////////////////////////////////////////////////////////////////
  6
  7#include "vgl.h"
  8#include "LoadShaders.h"
  9
 10enum VAO_IDs { Triangles, NumVAOs };
 11enum Buffer_IDs { ArrayBuffer, NumBuffers };
 12enum Attrib_IDs { vPosition = 0 };
 13
 14GLuint  VAOs[NumVAOs];
 15GLuint  Buffers[NumBuffers];
 16
 17const GLuint  NumVertices = 6;
 18
 19//----------------------------------------------------------------------------
 20//
 21// init
 22//
 23
 24void
 25init( void )
 26{
 27    glGenVertexArrays( NumVAOs, VAOs ); // Same with glCreateVertexArray( NumVAOs, VAOs ); 
 28      // https://stackoverflow.com/questions/24441430/glgen-vs-glcreate-naming-convention
 29    // Make the new VAO:VAOs[Triangles] active, creating it if necessary.
 30    glBindVertexArray( VAOs[Triangles] );
 31    // opengl->current_array_buffer = VAOs[Triangles]
 32    
 33    GLfloat  vertices[NumVertices][2] = {
 34        { -0.90f, -0.90f }, {  0.85f, -0.90f }, { -0.90f,  0.85f },  // Triangle 1
 35        {  0.90f, -0.85f }, {  0.90f,  0.90f }, { -0.85f,  0.90f }   // Triangle 2
 36    };
 37
 38    glCreateBuffers( NumBuffers, Buffers );
 39    
 40    // Make the buffer the active array buffer.
 41    glBindBuffer( GL_ARRAY_BUFFER, Buffers[ArrayBuffer] );
 42    // Attach the active VBO:Buffers[ArrayBuffer] to VAOs[Triangles]
 43    // as an array of vectors with 4 floats each.
 44    // Kind of like:
 45    // opengl->current_vertex_array->attributes[attr] = {
 46    //     type = GL_FLOAT,
 47    //     size = 4,
 48    //     data = opengl->current_array_buffer
 49    // }
 50    // Can be replaced with glVertexArrayVertexBuffer(VAOs[Triangles], Triangles, 
 51    // buffer[ArrayBuffer], ArrayBuffer, sizeof(vmath::vec2));, glVertexArrayAttribFormat(), ...
 52    // in OpenGL 4.5.
 53    
 54    glBufferStorage( GL_ARRAY_BUFFER, sizeof(vertices), vertices, 0);
 55
 56    ShaderInfo  shaders[] =
 57    {
 58        { GL_VERTEX_SHADER, "media/shaders/triangles/triangles.vert" },
 59        { GL_FRAGMENT_SHADER, "media/shaders/triangles/triangles.frag" },
 60        { GL_NONE, NULL }
 61    };
 62
 63    GLuint program = LoadShaders( shaders );
 64    glUseProgram( program );
 65
 66    glVertexAttribPointer( vPosition, 2, GL_FLOAT,
 67                           GL_FALSE, 0, BUFFER_OFFSET(0) );
 68    glEnableVertexAttribArray( vPosition );
 69    // Above two functions specify vPosition to vertex shader at layout (location = 0)
 70}
 71
 72//----------------------------------------------------------------------------
 73//
 74// display
 75//
 76
 77void
 78display( void )
 79{
 80    static const float black[] = { 0.0f, 0.0f, 0.0f, 0.0f };
 81
 82    glClearBufferfv(GL_COLOR, 0, black);
 83
 84    glBindVertexArray( VAOs[Triangles] );
 85    glDrawArrays( GL_TRIANGLES, 0, NumVertices );
 86}
 87
 88//----------------------------------------------------------------------------
 89//
 90// main
 91//
 92
 93#ifdef _WIN32
 94int CALLBACK WinMain(
 95  _In_ HINSTANCE hInstance,
 96  _In_ HINSTANCE hPrevInstance,
 97  _In_ LPSTR     lpCmdLine,
 98  _In_ int       nCmdShow
 99)
100#else
101int
102main( int argc, char** argv )
103#endif
104{
105    glfwInit();
106
107    GLFWwindow* window = glfwCreateWindow(800, 600, "Triangles", NULL, NULL);
108
109    glfwMakeContextCurrent(window);
110    gl3wInit();
111
112    init();
113
114    while (!glfwWindowShouldClose(window))
115    {
116        display();
117        glfwSwapBuffers(window);
118        glfwPollEvents();
119    }
120
121    glfwDestroyWindow(window);
122
123    glfwTerminate();
124}

Init():

  • Generate Vertex Array VAOs and bind VAOs[0].

    (glGenVertexArrays( NumVAOs, VAOs ); glBindVertexArray( VAOs[Triangles] ); glCreateBuffers( NumBuffers, Buffers );) A vertex-array object holds various data related to a collection of vertices. Those data are stored in buffer objects and managed by the currently bound vertex-array object.

    • glBindBuffer( GL_ARRAY_BUFFER, Buffers[ArrayBuffer] );

      Because there are many different places where buffer objects can be in OpenGL, when we bind a buffer, we need to specify which what we’d like to use it for. In our example, because we’re storing vertex data into the buffer, we use GL_ARRAY_BUFFER. The place where the buffer is bound is known as the binding target.

  • According counter clockwise rule in previous section, Triangle Primitives are defined in varaible vertices. After binding OpenGL VBO Buffers[0] to vertices, vertices data will send to memory of server(gpu). Think of the “active” buffer as just a global variable, and there are a bunch of functions which use the active buffer instead of using a parameter. These global state variables are the ugly side of OpenGL [34] and can be replaced with glVertexArrayVertexBuffer(), glVertexArrayAttribFormat(), …, then call glBindVertexArray(vao) before drawing in OpenGL 4.5 [35] [36].

  • glVertexAttribPointer( vPosition, 2, GL_FLOAT, GL_FALSE, 0, BUFFER_OFFSET(0) ): During gpu rendering, each vertex position will be held in vPosition and pass to “triangles.vert” shader because LoadShaders( shaders ).

glfwSwapBuffers(window):

  • You’ve already used double buffering for animation. Double buffering is done by making the main color buffer have two parts: a front buffer that’s displayed in your window; and a back buffer, which is where you render the new image. When you swap the buffers (by calling glfwSwapBuffers(), for example), the front and back buffers are exchanged [71].

display():

  • Bind VAOs[0], set render mode to GL_TRIANGLES and send vertex data to Buffer (gpu memory, OpenGL pipeline). Next, GPU will do rendering pipeline descibed in next section.

The triangles.vert has input vPosition and no output variable, so using gl_Position default varaible without declaration. The triangles.frag has not defined input variable and has defined output variable fColor instead of using gl_FragColor.

The “in” and “out” in shaders above are “type qualifier”. A type qualifier is used in the OpenGL Shading Language (GLSL) to modify the storage or behavior of global and locally defined variables. These qualifiers change particular aspects of the variable, such as where they get their data from and so forth [40].

Though attribute and varying are removed from later version 1.4 of OpenGL, many materials in website using them [41] [42]. It’s better to use “in” and “out” to replace them as the following code. OpenGL has a few ways to binding API’s variable with shader’s variable. glVertexAttrib* as the following code and glBindAttribLocation() [43], …

replace attribute and varying with in and out

uniform float scale;
layout (location = 0) attribute vec2 position;
// layout (location = 0) in vec2 position;
layout (location = 1) attribute vec4 color;
// layout (location = 1) in vec4 color;
varying vec4 v_color;
// out v_color

void main()
{
  gl_Position = vec4(position*scale, 0.0, 1.0);
  v_color = color;
}
// OpenGL API
GLfloat attrib[] = { x * 0.5f, x * 0.6f, x* 0.4f, 0.0f };
// Update the value of input attribute 1 : layout (location = 1) in vec4 color
glVertexAttrib4fv(1, attrib);
varying vec4 v_color;
// in vec4 v_color;

void main()
{
  gl_FragColor = v_color;
}

An OpenGL program is made of two shaders [38] [39]:

  • The vertex shader is (commonly) executed once for every vertex we want to draw. It receives some attributes as input, computes the position of this vertex in space and returns it in a variable called gl_Position. It also defines some varyings.

  • The fragment shader is executed once for each pixel to be rendered. It receives some varyings as input, computes the color of this pixel and returns it in a variable called fColor.

Since we have 6 vertices in our buffer, this shader will be executed 6 times by the GPU (once per vertex)! We can also expect all 6 instances of the shader to be executed in parallel, since a GPU have so many cores.

3D Rendering

Based on the previous section of 3D modeling, the 3D modeling tool will generate 3D vertex model and OpenGL code, then programmers may hand-change OpenGL code and add or modify shaders. The 3D animation will trigger the 3D rendering for each 2D image drawing.

3D rendering is the process of converting 3D models into 2D images on a computer [28]. The steps as the following Fig. 66.

_images/short-rendering-pipeline.png

Fig. 66 3D Graphics Rendering Pipeline

  • A fragment can be treated as a pixel in 3D spaces, which is aligned with the pixel grid, with attributes such as position, color, normal and texture.

The complete steps as the following Fig. 67 from OpenGL website [29] and the website has descripiton for each stage.

_images/rendering_pipeline.png

Fig. 67 Diagram of the Rendering Pipeline. The blue boxes are programmable shader stages.

In addition, list OpenGL rendering pipeline Figure 1.2 and stage from book “OpenGL Programming Guide 9th Edition” [32] as follows,

_images/OpenGL-pipeline.png
Table 43 OpenGL rendering pipeline from page 10 of book “OpenGL Programming Guide 9th Edition” [32] and [29].

Stage.

Description

Vertex Specification

After setting data as the example of previous section, glDrawArrays() will send data to gpu through buffer objects.

Vertex Shading

Vertex -> Vertex and other data such as color for later passes. For each vertex that is issued by a drawing command, a vertex shader will be called to process the data associated with that vertex.

Tessellation Shading

Create more detail on demand when room in. After the vertex shader has processed each vertex’s associated data, the tessellation shader stage will continue processing that data, if it’s been activated. Reference below.

Geometry Shading

The next shader stage, geometry shading, allows additional processing of individual geometric primitives, including creating new ones, before rasterization. Chapter 10 of Red Book [32] has details.

Primitive Assembly

The previous shading stages all operate on vertices, with the information about how those vertices are organized into geometric primitives being carried along internal to OpenGL. The primitive assembly stage organizes the vertices into their associated geometric primitives in preparation for clipping and rasterization.

Clipping

Clipping hidden parts. Occasionally, vertices will be outside of the viewport—the region of the window where you’re permitted to draw—and cause the primitive associated with that vertex to be modified so none of its pixels are outside of the viewport. This operation is called clipping and is handled automatically by OpenGL.

Rasterization

Vertex -> Fragment. The job of the rasterizer is to determine which screen locations are covered by a particular piece of geometry (point, line, or triangle). Knowing those locations, along with the input vertex data, the rasterizer linearly interpolates the data values for each varying variable in the fragment shader and sends those values as inputs into your fragment shader. A fragment can be treated as a pixel in 3D spaces, which is aligned with the pixel grid, with attributes such as position, color, normal and texture.

Fragment Shading

Determine color for each pixel. The final stage where you have programmable control over the color of a screen location is fragment shading. In this shader stage, you use a shader to determine the fragment’s final color (although the next stage, per-fragment operations, can modify the color one last time) and potentially its depth value. Fragment shaders are very powerful, as they often employ texture mapping to augment the colors provided by the vertex processing stages. A fragment shader may also terminate processing a fragment if it determines the fragment shouldn’t be drawn; this process is called fragment discard. A helpful way of thinking about the difference between shaders that deal with vertices and fragment shaders is this: vertex shading (including tessellation and geometry shading) determines where on the screen a primitive is, while fragment shading uses that information to determine what color that fragment will be.

Per-Fragment Operations

During this stage, a fragment’s visibility is determined using depth testing (also commonly known as z-buffering) and stencil testing. If a fragment successfully makes it through all of the enabled tests, it may be written directly to the framebuffer, updating the color (and possibly depth value) of its pixel, or if blending is enabled, the fragment’s color will be combined with the pixel’s current color to generate a new color that is written into the framebuffer.

  • Tessellation Shading: The core problem that Tessellation deals with is the static nature of 3D models in terms of their detail and polygon count. The thing is that when we look at a complex model such as a human face up close we prefer to use a highly detailed model that will bring out the tiny details (e.g. skin bumps, etc). A highly detailed model automatically translates to more triangles and more compute power required for processing. … One possible way to solve this problem using the existing features of OpenGL is to generate the same model at multiple levels of detail (LOD). For example, highly detailed, average and low. We can then select the version to use based on the distance from the camera. This, however, will require more artist resources and often will not be flexible enough. … Let’s take a look at how Tessellation has been implemented in the graphics pipeline. The core components that are responsible for Tessellation are two new shader stages and in between them a fixed function stage that can be configured to some degree but does not run a shader. The first shader stage is called Tessellation Control Shader (TCS), the fixed function stage is called the Primitive Generator (PG), and the second shader stage is called Tessellation Evaluation Shader (TES). Some GPU havn’t this fixed function stage implemented in HW and even havn’t provide these TCS, TES and Gemoetry Shader. User can write Compute Shaders instead for this on-fly detail display. This surface is usually defined by some polynomial formula and the idea is that moving a CP has an effect on the entire surface. … The group of CPs is usually called a Patch [30]. Chapter 9 of Red Book [32] has details.

For 2D animation, the model is created by 2D only (1 face only), so it only can be viewed from the same face of model. If you want to display different faces of model, multiple 2D models need to be created and switch these 2D models from face(flame) to face(flame) from time to time [31].

GLSL (GL Shader Language)

OpenGL is a standard for designing 2D/3D animation in computer graphic. To do animation well, OpenGL provides a lots of api(functions) call for graphic processing. The 3D model construction tools such as Maya, Blender, …, etc, only need to call this api to finish the 3D to 2D projecting function in computer. Any GPU hardware dependent code in these api provided by GPU manufacturer. An OpenGL program looks like the following,

Vertex shader

#version 330 core
layout (location = 0) in vec3 aPos; // the position variable has attribute position 0

out vec4 vertexColor; // specify a color output to the fragment shader

void main()
{
    gl_Position = vec4(aPos, 1.0); // see how we directly give a vec3 to vec4's constructor
    vertexColor = vec4(0.5, 0.0, 0.0, 1.0); // set the output variable to a dark-red color
}

Fragment shader

#version 330 core
out vec4 FragColor;

in vec4 vertexColor; // the input variable from the vertex shader (same name and same type)

void main()
{
    FragColor = computeColorOfThisPixel(...);
}

OpenGl user program

int main(int argc, char ** argv)
{
  // init window, detect user input and do corresponding animation by calling opengl api
  ...
}

The last main() is programed by user obviously. Let’s explain what the first two main() work for. As you know, the OpenGL is a lots of api to let programmer display the 3D object into 2D computer screen explained from book of concept of computer graphic. 3D graphic model can set light and object texture by user firstly, and calculating the postion of each vertex secondly, then color for each pixel automatically by 3D software and GPU thirdly, finally display the color of each pixel in computer screen. But in order to let user/programmer add some special effect or decoration in coordinate for each vertex or in color for each pixel, OpenGL provides these two functions to do it. OpenGL uses fragment shader instead of pixel is : “Fragment shaders are a more accurate name for the same functionality as Pixel shaders. They aren’t pixels yet, since the output still has to past several tests (depth, alpha, stencil) as well as the fact that one may be using antialiasing, which renders one-fragment-to-one-pixel non-true [44]. Programmer is allowed to add their converting functions that compiler translate them into GPU instructions running on GPU processor. With these two shaders, new features have been added to allow for increased flexibility in the rendering pipeline at the vertex and fragment level [45]. Unlike the shaders example here [46], some converting functions for coordinate in vertex shader or for color in fragment shade are more complicated according the scenes of animation. Here is an example [47]. In wiki shading page [3], Gourand and Phong shading methods make the surface of object more smooth by glsl. Example glsl code of Gourand and Phong shading on OpenGL api are here [48]. Since the hardware of graphic card and software graphic driver can be replaced, the compiler is run on-line meaning driver will compile the shaders program when it is run at first time and kept in cache after compilation [49].

The shaders program is C-like syntax and can be compiled in few mini-seconds, add up this few mini-seconds of on-line compilation time in running OpenGL program is a good choice for dealing the cases of driver software or gpu hardware replacement [50].

OpenGL Shader compiler

OpenGL standard is here [51]. The OpenGL is for desktop computer or server while the OpenGL ES is for embedded system [52]. Though shaders are only a small part of the whole OpenGL software/hardware system. It is still a large effort to finish the compiler implementation since there are lots of api need to be implemented. For example, there are 80 related texture APIs [53]. This implementation can be done by generating llvm extended intrinsic functions from shader parser of frontend compiler as well as llvm backend converting those intrinsic to gpu instructions as follows,

Fragment shader

#version 320 es
uniform sampler2D x;
out vec4 FragColor;

void main()
{
    FragColor = texture(x, uv_2d, bias);
}

llvm-ir

...
!1 = !{!"sampler_2d"}
!2 = !{i32 SAMPLER_2D} ; SAMPLER_2D is integer value for sampler2D, for example: 0x0f02
; A named metadata.
!x_meta = !{!1, !2}

define void @main() #0 {
    ...
    %1 = @llvm.gpu0.texture(metadata !x_meta, %1, %2, %3); ; %1: %sampler_2d, %2: %uv_2d, %3: %bias
    ...
}

asm of gpu

...
// gpu machine code
load $1, tex_a;
sample2d_inst $1, $2, $3 // $1: tex_a, $2: %uv_2d, $3: %bias

.tex_a // Driver set the index of gpu descriptor regsters here

As the bottom of code above, .tex_a memory address includes the Texture Object which binding by driver in on-line compilation/linking. Through binding Texture Object (SW) and Texture Unit (HW) with OpenGL API, gpu will uses Texture Unit HW resources efficiently. Explaining it the following.

About llvm intrinsic extended function, please refer this book here [54].

gvec4 texture(gsampler2D sampler, vec2 P, [float bias]);

GPU provides ‘Texture Unit’ to speedup fragment shader. However the ‘Texture Unit’ HW is expensive resouce and only few of them in a GPU. Driver can associate ‘Texture Unit’ to sampler variable by OpenGL api and switch between shaders as the following statements.

_images/sampling_diagram.png

Fig. 68 Relationships between the texturing concept [55].

The Fig. 68 as above. The texture object is not bound directly into the shader (where the actual sampling takes place). Instead, it is bound to a ‘texture unit’ whose index is passed to the shader. So the shader reaches the texture object by going through the texture unit. There are usually multiple texture units available and the exact number depends on the capability of your graphic card [55]. A texture unit, also called a texture mapping unit (TMU) or a texture processing unit (TPU), is a hardware component in a GPU that does sampling operation. The argument sampler in texture function as above is sampler_2d index from ‘teuxture unit’ for texture object [55].

‘sampler uniform variable’:

There is a group of special uniform variables for that, according to the texture target: ‘sampler1D’, ‘sampler2D’, ‘sampler3D’, ‘samplerCube’, etc. You can create as many ‘sampler uniform variables’ as you want and assign the value of a texture unit to each one from the application. Whenever you call a sampling function on a ‘sampler uniform variable’ the corresponding texture unit (and texture object) will be used [55].

_images/sampling_diagram_binding.png

Fig. 69 Binding sampler variables [56].

As Fig. 69, the Java api gl.bindTexture binding ‘Texture Object’ to ‘Texture Unit’. The gl.getUniformLocation and gl.uniform1i associate ‘Texture Unit’ to ‘sampler uniform variables’.

gl.uniform1i(xLoc, 1): where 1 is ‘Texture Unit 1’, 2 is ‘Texture Unit 2’, …, etc [56].

The following figure depicts how driver read metadata from compiled glsl obj, OpenGL api associate ‘Sample Variable’ and gpu executing texture instruction.

_images/driverSamplerTable.png

Fig. 70 Associating Sampler Variables and gpu executing texture instruction

Explaining the detail steps for figure above as the following.

1. In order to let the ‘texture unit’ binding by driver, frontend compiler must pass the metadata of ‘sampler uniform variable’ (sampler_2d_var in this example) [59] to backend, and backend must allocate the metadata of ‘sampler uniform variable’ in the compiled binary file [57].

2. After gpu driver executing glsl on-line compiling, driver read this metadata from compiled binary file and maintain a table of {name, type, location} for each ‘sampler uniform variable’. Driver also fill this information to Texture Desciptor in GPU’s memory.

  1. Api,

xLoc = gl.getUniformLocation(prog, "x"); // prog: glsl program, xLoc

will get the location from the table for ‘sampler uniform variable’ x that driver created.

SAMPLER_2D: is integer value for Sampler2D type.

  1. Api,

gl.uniform1i( xLoc, 1 );

will binding xLoc of ‘sampler uniform variable’ x to ‘Texture Unit 1’ by writing 1 to the glsl binary metadata location of ‘sampler uniform variable’ x as follows,

{xLoc, 1} : 1 is 'Texture Unit 1', xLoc is the location(memory address) of 'sampler uniform variable' x

This api will set the texture descriptors in gpu with this {xLoc, 1} information. Next, driver set the index or memory address of gpu texture descriptors to variable .tex_a of memory address. For example as diagram, driver set k to .tex_a.

// gpu machine code
load $1, tex_a;
sample2d_inst $1, $2, $3 // $1: tex_a, $2: %uv_2d, $3: %bias

.tex_a // Driver set the index of gpu descriptor regsters here at step 4

When executing the texture instructions from glsl binary file on gpu, the corresponding ‘Texture Unit 1’ on gpu will being executed through texture descriptor in gpu’s memory because .tex_a: {xLoc, 1}. Driver may set texture descriptor in gpu’s texture desciptors if gpu provides specific texture descriptors in architecture [60].

For instance, Nvidia texture instruction as follow,

// the content of tex_a bound to texture unit as step 5 above
tex.3d.v4.s32.s32  {r1,r2,r3,r4}, [tex_a, {f1,f2,f3,f4}];

.tex_a

The content of tex_a bound to texture unit set by driver as the end of step 4. The pixel of coordinates (x,y,z) is given by (f1,f2,f3) user input. The f4 is skipped for 3D texture.

Above tex.3d texture instruction load the calculated color of pixel (x,y,z) from texture image into GPRs (r1,r2,r3,r4)=(R,G,B,A). And fragment shader can re-calculate the color of this pixel with the color of this pixel at texture image [58].

If it is 1d texture instruction, the tex.1d as follows,

tex.1d.v4.s32.f32  {r1,r2,r3,r4}, [tex_a, {f1}];

Since ‘Texture Unit’ is limited hardware accelerator on gpu, OpenGL providing api to user program for binding ‘Texture Unit’ to ‘Sampler Variables’. As a result, user program is allowed doing load balance in using ‘Texture Unit’ through OpenGL api without recompiling glsl. Fast texture sampling is one of the key requirements for good GPU performance [56].

In addition to api for binding texture, OpenGL provides glTexParameteri api to do Texture Wrapping [61]. Furthmore the texture instruction for some gpu may including S# T# values in operands. Same with associating ‘Sampler Variables’ to ‘Texture Unit’, S# and T# are location of memory associated to Texture Wrapping descriptor registers allowing user program to change Wrapping option without re-compiling glsl.

Even glsl frontend compiler always expanding function call into inline function as well as llvm intrinsic extended function providing an easy way to do code generation through llvm td (Target Description) file written, GPU backend compiler is still a little complex than CPU backend. (But when considering the effort in frontend compier such as clang, or other toolchain such as linker and gdb/lldb, of course, CPU compiler is not easier than GPU compiler.)

Here is the software stack of 3D graphic system for OpenGL in linux [62]. And mesa open source website is here [63].

Architecture

SIMT

Single instruction, multiple threads (SIMT) is an execution model used in parallel computing where single instruction, multiple data (SIMD) is combined with multithreading [64].

The leading GPU architecture of Nvidia’s gpu is as the following figures.

_images/grid.png

Fig. 71 core(grid) in Nvidia gpu (figure from book [65])

_images/SIMD-processors.png

Fig. 72 SIMD processors (figure from book [66])

_images/threads-lanes.png

Fig. 73 threads and lanes in gpu (figure from book [67])

_images/memory.png

Fig. 74 core(grid) in Nvidia’s gpu (figure from book [68])

Buffers

In addition the texture unit and instruction, GPU provides different Buffers to speedup OpenGL pipeline rendering [37].

  • Color buffer

    They contain the RGB or sRGB color data and may also contain alpha values for each pixel in the framebuffer. There may be multiple color buffers in a framebuffer. You’ve already used double buffering for animation. Double buffering is done by making the main color buffer have two parts: a front buffer that’s displayed in your window; and a back buffer, which is where you render the new image [69].

  • Depth buffer (Z buffer)

    Depth is measured in terms of distance to the eye, so pixels with larger depth-buffer values are overwritten by pixels with smaller values [70] [72] [73].

  • Stencil Buffer

    In the simplest case, the stencil buffer is used to limit the area of rendering (stenciling) [74] [73].

  • Frame Buffer

    OpenGL offers: the color, depth and stencil buffers. This combination of buffers is known as the default framebuffer and as you’ve seen, a framebuffer is an area in memory that can be rendered to [75].

General purpose GPU

Since GLSL shaders provide a general way for writing C code in them, if applying a software frame work instead of OpenGL api, then the system can run some data parallel computation on GPU for speeding up and even get CPU and GPU executing simultaneously. Furthmore, any language that allows the code running on the CPU to poll a GPU shader for return values, can create a GPGPU framework [76].

_images/gpu-terms.png

Fig. 75 Terms in Nvidia’s gpu (figure from book [77])

Table 44 More Desciptive Name for Cuda term in Fermi GPU and Desciption.

More Desciptive Name

Cuda term

Structure

Description

Grid

Grid

Grid is Vectorizable Loop as Fig. 75.

SIMD Processor / SIMD Block / Core

Cuda Thread Engine

Each Grid has 16 SIMD Processors.

Each multithreaded SIMD Processor is assigned 512 elements of the vectors to work on. SIMD Processors are full processors with separate PCs and are programmed using threads [79]. As Fig. 72, it assigns 16 Thread Blocks to 16 SIMD Processors. CPU Core is the processor which include multi-threads. A thread of CPU is execution unit with its own PC (Program Counter). As this concept, GPU Core is the SIMD Processor includes several SIMD Thread (Warp). Each Warp has its PC [80].

SIMD Thread

Warp

Each Warp has 16 Cuda Thread.

Warp has it’s own PC and TLR (Thread Level Registers). Warp may map to one whole function or part of function. Compiler and run time may assign them to the same Warp or different Warps [81].

SIMD Lane

Cuda Thread

Each SIMD Thread has 16 Lanes.

A vertical cut of a thread of SIMD instructions corresponding to one element executed by one SIMD Lane. It is a vector instruction with processing 16-elements.

Chime

Chime

Each SIMD Lane has 2 chimes.

If vector length would be 32 (32 elements) and SIMD Lanes is 16, the chime is 2 clock cycles, also known as “ping pong” cycles. As Fig. 71 for the later Fermi-generation GPUs. Each SIMD Thread (Warp) has 32 elements run as Fig. 73 on 16 SIMD lanes (number of functional units just same as in vector processor). So it takes 2 clock cycles to (chime is 2 clock cycles) complete [82].

Mapping data in GPU

A GPU may has the HW structure and handle the subset of y[]=a*x[]+y[] array-calculation as follows,

// Invoke DAXPY with 256 threads per Thread Block
__host__
int nblocks = (n+ 255) / 256;
daxpy<<<nblocks, 256>>>(n, 2.0, x, y);
// DAXPY in CUDA
__device__
void daxpy(int n, double a, double *x, double *y) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}
shl.u32 R8, blockIdx, 9   ; Thread Block ID * Block size (512 or 29)
add.u32 R8, R8, threadIdx ; R8 = i = my CUDA Thread ID
shl.u32 R8, R8, 3         ; byte offset
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
mul.f64 RD0, RD0, RD4     ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 RD0, RD0, RD2     ; SuminRD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])

The following table explains how the elemements of saxpy() maps to lane of SIMD Thread(Warp) of Thread Block(Core) of Grid.

Table 45 Mapping saxpy code to Fig. 71.

saxpy(()

Instance in Fig. 71

Description

blockDim.x

The index of Thread Block

blockDim: in this example configured as Fig. 71 is 16(Thread Blocks) * 16(SIDM Threads) = 256

blockIdx.x

The index of SIMD Thread

blockIdx: the index of Thread Block within the Grid

threadIdx.x

The index of elements

threadIdx: the index of the SIMD Thread within its Thread Block

  • With Fermi, each 32-wide thread of SIMD instructions is mapped to 16 physical SIMD Lanes, so each SIMD instruction in a thread of SIMD instructions takes two clock cycles to complete.

  • You could say that it has 16 lanes, the vector length would be 32, and the chime is 2 clock cycles.

  • The mape of y[0..31] = a * x[0..31] * y[0..31] to <Core, Warp, Cuda Thread> of GPU as the following table. x[0..31] map to 32 Cuda Threads; two Cuda Thread map to one SIMD lane.

Table 46 Map (Core,Warp) to saxpy

Warp-0

Warp-1

Warp-15

Core-0

y[0..31] = a * x[0..31] * y[0..31]

y[32..63] = a * x[32..63] + y[32..63]

y[480..511] = a * x[480..511] + y[480..511]

Core-15

y[7680..7711] = a * …

y[8160..8191] = a * x[8160..8191] + y[8160..8191]

  • If a SIMD Lane (Cuda Thread) handles 2 elements computing, assuming 4 registers for 1 element, then there are 4*32=128 Thread Level Registers, TLR, occupied in a SIMD Thread (Warp) to support the SIMT computing. So, assume a GPU architecture allocating 256 TLR to a SIMD Thread (Warp), then it has sufficient TLR for more complicated statement, such as a*X[i]+b*Y[i]+c*Z[i] without spilling in register allocation. All 16 lanes share the 256 TLR. Each Thread Block (Core/Warp) has 16 SIMD Threads, so there are 16*256 = 4K TLR in a SIMD Processor (Core, Cuda Thread Engine).

  • When mapping to the fragments/pixels in graphic GPU, x[0..15] corresponding to a two dimensions of tile of fragments/pixels at pixel[0..3][0..3] since image uses tile base for grouping closest color together.

Work between CPU and GPU in Cuda

Above daxpy() GPU code did not mention the host (CPU) side of code for triggering GPU’s function. The following is host (CPU) side of a CUDA example to call saxpy on GPU [78] as follows,

for(i=0;i<64; i=i+1)
  if (X[i] != 0)
    X[i] = X[i] – Y[i];
__global__
void saxpy(int n, float a, float * x, float * y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  ...
  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
  ...
  saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);
  ...
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
  ...
}

The main() run on CPU while the saxpy() run on GPU. CPU copy the data from x and y to the corresponding device arrays d_x and d_y using cudaMemcpy. The saxpy kernel is launched by the statement: saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y); Through cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost, CPU can pass data in x and y arrays to GPU and get result from GPU to y array. Since both of these memory transfers trigger the DMA functions without CPU operation, it may speed up by running both CPU/GPU with their data in their own cache repectively. After DMA memcpy from cpu’s memory to gpu’s, gpu operates the whole loop of matrix operation for “y[] = a*x[]+y[];” instructions with one Grid. Furthermore like vector processor, gpu provides Vector Mask Registers to Handling IF Statements in Vector Loops as the following code [84],

LV V1,Rx         ;load vector X into V1
LV V2,Ry         ;load vector Y
L.D F0,#0        ;load FP zero into F0
SNEVS.D V1,F0    ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV V1,Rx         ;store the result in X

GPU persues throughput from SIMD application. Can hide cache-miss latence from SMT. As result GPU may hasn’t L2 and L3 like CPU for each core since GPU is highly latency-tolerant multithreading for data parallel application [85]. DMA memcpy map the data in cpu memory to each l1 cache of core on gpu memory. Many gpu provides operations scatter and gather to access DRAM data for stream processing [86] [76] [87].

When the GPU function is dense computation in array such as MPEG4 encoder or deep learning for tuning weights, it may get much speed up [88]. However when GPU function is matrix addition and CPU will idle for waiting GPU’s result. It may slow down than doing matrix addition by CPU only. Arithmetic intensity is defined as the number of operations performed per word of memory transferred. It is important for GPGPU applications to have high arithmetic intensity else the memory access latency will limit computational speedup [76].

Wiki here [89] includes speedup applications for gpu as follows:

General Purpose Computing on GPU, has found its way into fields as diverse as machine learning, oil exploration, scientific image processing, linear algebra, statistics, 3D reconstruction and even stock options pricing determination. In addition, section “GPU accelerated video decoding and encoding” for video compressing [89] gives the more applications for GPU acceleration.

Table 47 The differences for speedup in architecture of CPU and GPU

Item

CPU

GPU

Application

Non-data parallel

Data parallel

Architecture

SISD, small vector (eg.4*32bits)

Large SIMD (eg.16*32bits)

Cache

Smaller and faster

Larger and slower (ref. The following Note)

ILP

Pipeline

Pipeline

Superscalar, SMT

SIMT

Super-pipeline

Core

Smaller threads for SMT (2 or 4)

Larger threads (16 or 32)

Branch

Conditional-instructions

Mask & conditional-instructions

Note

GPU-Cache

In theory for data parallel application in GPU’s SMT, GPU can schedule more threads and pursues throughput rather speedup for one single thread as SISD in CPU. However in reality, GPU provides small L1 cache like CPU’s and fill the cache-miss with scheduline another thread. So, GPU may has no L2 and L3 while CPU has deep level of caches.

Volta (Cuda thread/SIMD lane with PC, Program Couner and Call Stack)

Todo: According the assembly of dapxy(), this is not useful. Why is this existed?

One way the compiler handles this is by keeping executing instructions in order and if some threads don’t have to execute certain instructions it switches off those threads and turns them on their relevant instructions and switches off the other threads, this process is called masking.

_images/pre-volta-1.png

Fig. 76 SIMT Warp Execution Model of Pascal and Earlier GPUs [83]

_images/volta-1.png

Fig. 77 Volta Warp with Per-Thread Program Counter and Call Stack [83]

_images/volta-2.png

Fig. 78 Programs use Explicit Synchronization to Reconverge Threads in a Warp [83]

  • After Volta GPU of Nvidia, each thread in Warp has it’s own PC as Fig. 77. In Cuda Applications, this feature provides more parallel opportunities with __syncwarp() to user programmers as Fig. 78.

Vulkan and spir-v

_images/opencl-to-spirv-offine-compilation.png

Fig. 79 Offline Compilation of OpenCL Kernels into SPIR-V Using Open Source Tooling [91]

digraph G {
  rankdir=LR;

  node [shape=record];

  glslang [style=filled,fillcolor=green];
  spirv_cross [label="spirv-cross",style=filled,fillcolor=green];
  glsl -> glslang -> spirv;
  glsl -> spirv_cross -> spirv [dir="back"];
}

Fig. 80 Convertion between glsl and spirv

  • Comparsion for OpenCL and OpenGL’s compute shader.

    • Same:

      Both are for General Computing of GPU.

    • Difference:

      OpenCL include GPU and other accelerate device/processor. OpenCL is C language on Device and C++ on Host based on OpenCL runtime. Compute shader is GLSL shader language run on OpenGL graphic enviroment and integrate and access data of OpenGL API easily [90].

  • OpenGL/GLSL vs Vulkan/spir-v.

    • High level of API and shader: OpenGL, GLSL.

    • Low level of API and shader: Vulkan, spir-v.

Though OpenGL api existed in higher level with many advantages from sections above, sometimes it cannot compete in efficience with direct3D providing lower levels api for operating memory by user program [92]. Vulkan api is lower level’s C/C++ api to fill the gap allowing user program to do these things in OpenGL to compete against Microsoft direct3D. Here is an example [93]. Meanwhile glsl is C-like language. The vulkan infrastructure provides tool, glslangValidator [94], to compile glsl into an Intermediate Representation Form (IR) called spir-v off-line. As a result, it saves part of compilation time from glsl to gpu instructions on-line since spir-v is an IR of level closing to llvm IR [95]. In addition, vulkan api reduces gpu drivers efforts in optimization and code generation [92]. These standards provide user programmer option to using vulkan/spir-v instead of OpenGL/glsl, and allow them pre-compiling glsl into spir-v off-line to saving part of on-line compilation time.

With vulkan and spir-v standard, the gpu can be used in OpenCL for Parallel Programming of Heterogeneous Systems [96] [97]. Similar with Cuda, a OpenCL example for fast Fourier transform (FFT) is here [98]. Once OpenCL grows into a popular standard when more computer languages or framework supporting OpenCL language, GPU will take more jobs from CPU [99].

Now, you find llvm IR expanding from cpu to gpu becoming influentially more and more. And actually, llvm IR expanding from version 3.1 util now as I can feel.

Accelerate ML/DL on OpenCL/SYCL

_images/opencl_ml_graph.png

Fig. 81 Implement ML graph scheduler both on compiler and runtime

As above figure, the Device of GPU or CPU+NPU is able to run the whole ML graph. However if the Device has NPU only, then the CPU operation such as Avg-Pool has to run on Host side which add communication cost between Host and Device.

Like OpenGL’s shader, the “kernel” function may be compiled on-line or off-line and sending to GPU as programmable functions.

In order to run ML (Machine Learning) efficiently, all platforms for ML on GPU/NPU implement scheduling SW both on graph compiler and runtime. If OpenCL can extend to support ML graph, then graph compiler such as TVM or Runtime from Open Source have chance to leverage the effort of scheduling SW from programmers [100]. Cuda graph is an idea like this [101] [102] .

Open Sources