Monday, November 28, 2016

Wanted: GPU rendering developers

I'm working for an international company with very large (<Trump voice>"YUUUUUGE"<\Trump voice>) industry partners.

We are currently looking for excellent developers with experience in GPU rendering for a new project.

Our ideal candidates have either a:
  • Bachelor in Computer Science, Electrical/Software Engineering or Physics with a minimum of 2 years of work experience in a relevant field, or
  • Master in Computer Science, Electrical/Software Engineering or Physics, or
  • PhD in a relevant field
with a strong interest in physically based rendering and ray tracing.


Self-taught programmers are encouraged to apply if they meet the following requirements:
  • you breathe rendering and have Monte Carlo simulations running through your blood
  • you have a copy of PBRT (www.pbrt.org, version 3 was released just last week) on your bedside table
  • provable experience working with open source rendering frameworks such as LuxRender, Cycles, AMD RadeonRays or with a commercial renderer will earn you extra brownie points
  • 5+ years of experience with C++
  • experience with CUDA or OpenCL
  • experience with version control systems and working on large projects
  • proven rendering track record (publications, Github projects, blog)

Other requirements:
  • insatiable hunger to innovate
  • a "can do" attitude
  • strong work ethic and focus on results
  • continuous self-learner
  • work well in a team
  • work independently and able to take direction
  • ability to communicate effectively
  • comfortable speaking English
  • own initiatives and original ideas are highly encouraged
  • willing to relocate to New Zealand

What we offer:
  • unique location in one of the most beautiful and greenest countries in the world
  • be part of a small, high-performance team 
  • competitive salary
  • jandals, marmite and hokey pokey ice cream

For more information, contact me at sam.lapere@live.be

If you are interested, send your CV and cover letter to sam.lapere@live.be. Applications will close on 16 December or when we find the right people. (update: spots are filling up quickly so we advanced the closing date with five days)

Monday, November 14, 2016

OpenCL path tracing tutorial 2: path tracing spheres

This tutorial consists of two parts: the first part will describe how to ray trace one sphere using OpenCL, while the second part covers path tracing of a scene made of spheres. The tutorial will be light on ray tracing/path tracing theory (there are plenty of excellent resources available online such as Scratch-a-Pixel) and will focus instead on the practical implementation of rendering algorithms in OpenCL.The end result will be a rendered image featuring realistic light effects such as indirect lighting, diffuse colour bleeding and soft shadows, all achieved with just a few lines of code:



Part 1: Ray tracing a sphere

Computing a test image on the OpenCL device

The host (CPU) sets up the OpenCL environment and launches the OpenCL kernel which will be executed on the OpenCL device (GPU or CPU) in parallel. Each work item (or thread) on the device will calculate one pixel of the image. There will thus be as many work items in the global pool as there are pixels in the image. Each work item has a unique ID which distinguishes from all other work items in the global pool of threads and which is obtained with get_global_id(0)

The X- and Y-coordinates of each pixel can be computed by using that pixel's unique work item ID:
  • x-coordinate: divide by the image width and take the remainder
  • y-coordinate: divide by the image width
By remapping the x and y coordinates from the [0 to width] range for x and [0 to height] range for y to the range [0 - 1] for both, and plugging those values in the red and green channels repsectively yields the following gradient image (the image is saved in ppm format which can be opened with e.g. IrfanView of Gimp):


The OpenCL code to generate this image:


1
2
3
4
5
6
7
8
9
__kernel void render_kernel(__global float3* output, int width, int height)
{
 const int work_item_id = get_global_id(0); /* the unique global id of the work item for the current pixel */
 int x = work_item_id % width; /* x-coordinate of the pixel */
 int y = work_item_id / width; /* y-coordinate of the pixel */
 float fx = (float)x / (float)width; /* convert int to float in range [0-1] */
 float fy = (float)y / (float)height; /* convert int to float in range [0-1] */
 output[work_item_id] = (float3)(fx, fy, 0); /* simple interpolated colour gradient based on pixel coordinates */
}

Now let's use the OpenCL device for some ray tracing.


Ray tracing a sphere with OpenCL

We first define a Ray and a Sphere struct in the OpenCL code:

A Ray has 
  • an origin in 3D space (3 floats for x, y, z coordinates) 
  • a direction in 3D space (3 floats for the x, y, z coordinates of the 3D vector)
A Sphere has 
  • a radius
  • a position in 3D space (3 floats for x, y, z coordinates), 
  • an object colour (3 floats for the Red, Green and Blue channel) 
  • an emission colour (again 3 floats for each of the RGB channels)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
struct Ray{
 float3 origin;
 float3 dir;
};

struct Sphere{
 float radius;
 float3 pos;
 float3 emi;
 float3 color;
};

Camera ray generation

Rays are shot from the camera (which is in a fixed position for this tutorial) through an imaginary grid of pixels into the scene, where they intersect with 3D objects (in this case spheres). For each pixel in the image, we will generate one camera ray (also called primary rays, view rays or eye rays) and follow or trace it into the scene. For camera rays, the ray origin is the camera position and the ray direction is the vector connecting the camera and the pixel on the screen.

Source: Wikipedia


The OpenCL code for generating a camera ray:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
struct Ray createCamRay(const int x_coord, const int y_coord, const int width, const int height){

 float fx = (float)x_coord / (float)width;  /* convert int in range [0 - width] to float in range [0-1] */
 float fy = (float)y_coord / (float)height; /* convert int in range [0 - height] to float in range [0-1] */

 /* calculate aspect ratio */
 float aspect_ratio = (float)(width) / (float)(height);
 float fx2 = (fx - 0.5f) * aspect_ratio;
 float fy2 = fy - 0.5f;

 /* determine position of pixel on screen */
 float3 pixel_pos = (float3)(fx2, -fy2, 0.0f);

 /* create camera ray*/
 struct Ray ray;
 ray.origin = (float3)(0.0f, 0.0f, 40.0f); /* fixed camera position */
 ray.dir = normalize(pixel_pos - ray.origin);

 return ray;
}




Ray-sphere intersection

To find the intersection of a ray with a sphere, we need the parametric equation of a line, which denotes the distance from the ray origin to the intersection point along the ray direction with the parameter "t"

intersection point = ray origin + ray direction * t

The equation of a sphere follows from the Pythagorean theorem in 3D (all points on the surface of a sphere are located at a distance of radius r from its center): 

(sphere surface point - sphere center)2 = radius2 

In the case of a sphere centered at the origin (with coordinates [0,0,0]), the vector [sphere surface point - sphere center] reduces to the coordinates of a point on the sphere's surface (the intersection point). Combining both equations then gives

(ray origin + ray direction * t)2 = radius2

Expanding this equation in a quadratic equation of the form ax2 + bx + c = 0 where
  • a = (ray direction) . (ray direction)  
  • b = 2 * (ray direction) . (ray origin to sphere center) 
  • c = (ray origin to sphere center) . (ray origin to sphere center) - radius2 
yields solutions for t (the distance to the point where the ray intersects the sphere) given by the quadratic formula −b ± √  b2 − 4ac / 2a (where b2 - 4ac is called the discriminant).

Depending on whether the determinant is negative, zero or positive, there can be zero (ray misses sphere), one (ray just touches the sphere at one point) or two solutions (ray fully intersects the sphere at two points) respectively. The distance t can be positive (intersection in front of ray origin) or negative (intersection behind ray origin). The details of the mathematical derivation are explained in this Scratch-a-Pixel article.

The ray-sphere intersection algorithm is optimised by omitting the "a" coefficient in the quadratic formula, because its value is the dot product of the normalised ray direction with itself which equals 1. Taking the square root of the discriminant (an expensive function) can only be performed when the discriminant is non-negative.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
bool intersect_sphere(const struct Sphere* sphere, const struct Ray* ray, float* t)
{
 float3 rayToCenter = sphere->pos - ray->origin;

 /* calculate coefficients a, b, c from quadratic equation */

 /* float a = dot(ray->dir, ray->dir); // ray direction is normalised, dotproduct simplifies to 1 */ 
 float b = dot(rayToCenter, ray->dir);
 float c = dot(rayToCenter, rayToCenter) - sphere->radius*sphere->radius;
 float disc = b * b - c; /* discriminant of quadratic formula */

 /* solve for t (distance to hitpoint along ray) */

 if (disc < 0.0f) return false;
 else *t = b - sqrt(disc);

 if (*t < 0.0f){
  *t = b + sqrt(disc);
  if (*t < 0.0f) return false; 
 }

 else return true;
}


Scene initialisation

For simplicity, in this first part of the tutorial the scene will be initialised on the device in the kernel function (in the second part the scene will be initialised on the host and passed to OpenCL which is more flexible and memory efficient, but also requires to be more careful with regards to memory alignment and the use of memory address spaces). Every work item will thus have a local copy of the scene (in this case one sphere).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
__kernel void render_kernel(__global float3* output, int width, int height)
{
 const int work_item_id = get_global_id(0); /* the unique global id of the work item for the current pixel */
 int x_coord = work_item_id % width; /* x-coordinate of the pixel */
 int y_coord = work_item_id / width; /* y-coordinate of the pixel */

 /* create a camera ray */
 struct Ray camray = createCamRay(x_coord, y_coord, width, height);

 /* create and initialise a sphere */
 struct Sphere sphere1;
 sphere1.radius = 0.4f;
 sphere1.pos = (float3)(0.0f, 0.0f, 3.0f);
 sphere1.color = (float3)(0.9f, 0.3f, 0.0f);

 /* intersect ray with sphere */
 float t = 1e20;
 intersect_sphere(&sphere1, &camray, &t);

 /* if ray misses sphere, return background colour 
 background colour is a blue-ish gradient dependent on image height */
 if (t > 1e19){ 
  output[work_item_id] = (float3)(fy * 0.1f, fy * 0.3f, 0.3f);
  return;
 }

 /* if ray hits the sphere, it will return the sphere colour*/
 output[work_item_id] = sphere1.color;
}



Running the ray tracer 

Now we've got everything we need to start ray tracing! Let's begin with a plain colour sphere. When the ray misses the sphere, the background colour is returned:


A more interesting sphere with cosine-weighted colours, giving the impression of front lighting.


To achieve this effect we need to calculate the angle between the ray hitting the sphere surface and the normal at that point. The sphere normal at a specific intersection point on the surface is just the normalised vector (with unit length) going from the sphere center to that intersection point.

1
2
3
4
5
        float3 hitpoint = camray.origin + camray.dir * t;
 float3 normal = normalize(hitpoint - sphere1.pos);
 float cosine_factor = dot(normal, camray.dir) * -1.0f;
 
 output[work_item_id] = sphere1.color * cosine_factor;


Adding some stripe pattern by multiplying the colour with the sine of the height:


Screen-door effect using sine functions for both x and y-directions


Showing the surface normals (calculated in the code snippet above) as colours:



Source code

https://github.com/straaljager/OpenCL-path-tracing-tutorial-2-Part-1-Raytracing-a-sphere


Download demo (works on AMD, Nvidia and Intel)

The executable demo will render the above images.

https://github.com/straaljager/OpenCL-path-tracing-tutorial-2-Part-1-Raytracing-a-sphere/releases/tag/1.0



Part 2: Path tracing spheres

Very quick overview of ray tracing and path tracing

The following section covers the background of the ray tracing process in a very simplified way, but should be sufficient to understand the code in this tutorial. Scratch-a-Pixel provides a much more detailed explanation of ray tracing.  

Ray tracing is a general term that encompasses ray casting, Whitted ray tracing, distribution ray tracing and path tracing. So far, we have only traced rays from the camera (so called "camera rays", "eye rays" or "primary rays") into the scene, a process called ray casting, resulting in plainly coloured images with no lighting. In order to achieve effects like shadows and reflections, new rays must be generated at the points where the camera rays intersect with the scene. These secondary rays can be shadow rays, reflection rays, transmission rays (for refractions), ambient occlusion rays or diffuse interreflection rays (for indirect lighting/global illumination). For example, shadow rays used for direct lighting are generated to point directly towards a light source while reflection rays are pointed in (or near) the direction of the reflection vector. For now we will skip direct lighting to generate shadows and go straight to path tracing, which is strangely enough easier to code, creates more realistic and prettier pictures and is just more fun.

In (plain) path tracing, rays are shot from the camera and bounce off the surface of scene objects in a random direction (like a high-energy bouncing ball), forming a chain of random rays connected together into a path. If the path hits a light emitting object such as a light source, it will return a colour which depends on the surface colours of all the objects encountered so far along the path, the colour of the light emitters, the angles at which the path hit a surface and the angles at which the path bounced off a surface. These ideas form the essence of the "rendering equation", proposed in a paper with the same name by Jim Kajiya in 1986.

Since the directions of the rays in a path are generated randomly, some paths will hit a light source while others won't, resulting in noise ("variance" in statistics due to random sampling). The noise can be reduced by shooting many random paths per pixel (= taking many samples) and averaging the results.


Implementation of (plain) path tracing in OpenCL       

The code for the path tracer is based on smallpt from Kevin Beason and is largely the same as the ray tracer code from part 1 of this tutorial, with some important differences on the host side:

- the scene is initialised on the host (CPU) side, which requires a host version of the Sphere struct. Correct memory alignment in the host struct is very important to avoid shifting of values and wrongly initialised variables in the OpenCL struct, especially when  using OpenCL's built-in data types such as float3 and float4. If necessary, the struct should be padded with dummy variables to ensure memory alignment (the total size of the struct must be a multiple of the size of float4).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
struct Sphere
{
 cl_float radius;
 cl_float dummy1;   
 cl_float dummy2;
 cl_float dummy3;
 cl_float3 position;
 cl_float3 color;
 cl_float3 emission;
};

- the scene (an array of spheres) is copied from the host to the OpenCL device into global memory (using CL_MEM_READ_WRITE) or constant memory (using CL_MEM_READ_ONLY

1
2
3
4
5
6
7
8
9
// initialise scene
 const int sphere_count = 9;
 Sphere cpu_spheres[sphere_count];
 initScene(cpu_spheres);

 // Create buffers on the OpenCL device for the image and the scene
 cl_output = Buffer(context, CL_MEM_WRITE_ONLY, image_width * image_height * sizeof(cl_float3));
 cl_spheres = Buffer(context, CL_MEM_READ_ONLY, sphere_count * sizeof(Sphere));
 queue.enqueueWriteBuffer(cl_spheres, CL_TRUE, 0, sphere_count * sizeof(Sphere), cpu_spheres);

- explicit memory management: once the scene is on the device, its pointer can be passed on to other device functions preceded by the keyword "__global" or "__constant".

- the host code automatically determines the local size of the kernel work group (the number of work items or "threads" per work group) by calling the OpenCL function kernel.getWorkGroupInfo(device)


The actual path tracing function

- iterative path tracing function: since OpenCL does not support recursion, the trace() function traces paths iteratively (instead of recursively) using a loop with a fixed number of bounces (iterations), representing path depth.

- each path starts off with an "accumulated colour" initialised to black and a "mask colour" initialised to pure white. The mask colour "collects" surface colours along its path by multiplication. The accumulated colour accumulates light from emitters along its path by adding emitted colours multiplied by the mask colour.

- generating random ray directions: new rays start at the hitpoint and get shot in a random direction by sampling a random point on the hemisphere above the surface hitpoint. For each new ray, a local orthogonal uvw-coordinate system and two random numbers are generated: one to pick a random value on the horizon for the azimuth, the other for the altitude (with the zenith being the highest point)

- diffuse materials: the code for this tutorial only supports diffuse materials, which reflect incident light almost uniformly in all directions (in the hemisphere above the hitpoint)

- cosine-weighted importance sampling: because diffuse light reflection is not truly uniform, the light contribution from rays that are pointing away from the surface plane and closer to the surface normal is greater. Cosine-weighted importance sampling favours rays that are pointing away from the surface plane by multiplying their colour with the cosine of the angle between the surface normal and the ray direction.

- while ray tracing can get away with tracing only one ray per pixel to render a good image (more are needed for anti-aliasing and blurry effects like depth-of-field and glossy reflections), the inherently noisy nature of path tracing requires tracing of many paths per pixel (samples per pixel) and averaging the results to reduce noise to an acceptable level.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
float3 trace(__constant Sphere* spheres, const Ray* camray, const int sphere_count, const int* seed0, const int* seed1){

 Ray ray = *camray;

 float3 accum_color = (float3)(0.0f, 0.0f, 0.0f);
 float3 mask = (float3)(1.0f, 1.0f, 1.0f);

 for (int bounces = 0; bounces < 8; bounces++){

  float t;   /* distance to intersection */
  int hitsphere_id = 0; /* index of intersected sphere */

  /* if ray misses scene, return background colour */
  if (!intersect_scene(spheres, &ray, &t, &hitsphere_id, sphere_count))
   return accum_color += mask * (float3)(0.15f, 0.15f, 0.25f);

  /* else, we've got a hit! Fetch the closest hit sphere */
  Sphere hitsphere = spheres[hitsphere_id]; /* version with local copy of sphere */

  /* compute the hitpoint using the ray equation */
  float3 hitpoint = ray.origin + ray.dir * t;
  
  /* compute the surface normal and flip it if necessary to face the incoming ray */
  float3 normal = normalize(hitpoint - hitsphere.pos); 
  float3 normal_facing = dot(normal, ray.dir) < 0.0f ? normal : normal * (-1.0f);

  /* compute two random numbers to pick a random point on the hemisphere above the hitpoint*/
  float rand1 = 2.0f * PI * get_random(seed0, seed1);
  float rand2 = get_random(seed0, seed1);
  float rand2s = sqrt(rand2);

  /* create a local orthogonal coordinate frame centered at the hitpoint */
  float3 w = normal_facing;
  float3 axis = fabs(w.x) > 0.1f ? (float3)(0.0f, 1.0f, 0.0f) : (float3)(1.0f, 0.0f, 0.0f);
  float3 u = normalize(cross(axis, w));
  float3 v = cross(w, u);

  /* use the coordinte frame and random numbers to compute the next ray direction */
  float3 newdir = normalize(u * cos(rand1)*rand2s + v*sin(rand1)*rand2s + w*sqrt(1.0f - rand2));

  /* add a very small offset to the hitpoint to prevent self intersection */
  ray.origin = hitpoint + normal_facing * EPSILON;
  ray.dir = newdir;

  /* add the colour and light contributions to the accumulated colour */
  accum_color += mask * hitsphere.emission; 

  /* the mask colour picks up surface colours at each bounce */
  mask *= hitsphere.color; 
  
  /* perform cosine-weighted importance sampling for diffuse surfaces*/
  mask *= dot(newdir, normal_facing); 
 }

 return accum_color;
}



A screenshot made with the code above (also see the screenshot at the top of this post). Notice the colour bleeding (bounced colour reflected from the floor onto the spheres), soft shadows and lighting coming from the background.



Source code

https://github.com/straaljager/OpenCL-path-tracing-tutorial-2-Part-2-Path-tracing-spheres


Downloadable demo (for AMD, Nvidia and Intel platforms, Windows only)

https://github.com/straaljager/OpenCL-path-tracing-tutorial-2-Part-2-Path-tracing-spheres/releases/tag/1.0


Useful resources

- Scratch-a-pixel is an excellent free online resource to learn about the theory behind ray tracing and path tracing. Many code samples (in C++) are also provided. This article gives a great introduction to global illumination and path tracing.

- smallpt by Kevin Beason is a great little CPU path tracer in 100 lines code. It of formed the inspiration for the Cornell box scene and for many parts of the OpenCL code 


Up next

The next tutorial will cover the implementation of an interactive OpenGL viewport with a progressively refining image and an interactive camera with anti-aliasing and depth-of-field.

Tuesday, November 1, 2016

OpenCL path tracing tutorial 1: Firing up OpenCL

This is the first tutorial in a new series of GPU path tracing tutorials which will focus on OpenCL based rendering. The first few tutorials will cover the very basics of getting started with OpenCL and OpenCL based ray tracing and path tracing of simple scenes. Follow-up tutorials will use a cut-down version of AMD's RadeonRays framework (the framework formerly known as FireRays), to start from as a basis to add new features in a modular manner. The goal is to incrementally work up to include all the features of RadeonRays, a full-featured GPU path tracer. The Radeon Rays source also forms the basis of AMD's Radeon ProRender Technology (which will also be integrated as a native GPU renderer in an upcoming version of Maxon's Cinema4D).  In the end, developers that are new to rendering should be able to code up their own GPU renderer and integrate it into their application. 


Why OpenCL?

The major benefit of OpenCL is its platform independence, meaning that the same code can run on CPUs and GPUs made by AMD, Nvidia and Intel (in theory at least, in practice there are quite a few implementation differences between the various platforms). The tutorials in this series should thus run on any PC, regardless of GPU vendor (moreover a GPU is not even required to run the program). 

Another advantage of OpenCL is that it can use all the available CPU and GPUs in a system simultaneously to accelerate parallel workloads (such as rendering or physics simulations).

In order to achieve this flexibility, some boiler plate code is required which selects an OpenCL platform (e.g. AMD or Nvidia) and one or more OpenCL devices (CPUs or GPUs). In addition, the OpenCL source must be compiled at runtime (unless the platform and device are known in advance), which adds some initialisation time when the program is first run.


OpenCL execution model quick overview

This is a superquick overview OpenCL execution model, just enough to get started (there are plenty of more exhaustive sources on OpenCL available on the web). 

In order to run an OpenCL program, the following structures are required (and are provided by the OpenCL API):
  • Platform: which vendor (AMD/Nvidia/Intel)
  • Device: CPU, GPU, APU or integrated GPU
  • Context: the runtime interface between the host (CPU) and device (GPU or CPU) which manages all the OpenCL resources (programs, kernels, command queue, buffers). It receives and distributes kernels and transfers data.
  • Program: the entire OpenCL program (one or more kernels and device functions)
  • Kernel: the starting point into the OpenCL program, analogous to the main() function in a CPU program. Kernels are called from the host (CPU). They represent the basic units of executable code that run on an OpenCL device and are preceded by the keyword "__kernel"
  • Command queue: the command queue allows kernel execution commands to be sent to the device (execution can be in-order or out-of-order)
  • Memory objects: buffers and images
These structures are summarised in the diagram below (slide from AMD's Introduction to OpenCL programming):

OpenCL execution model

OpenCL memory model quick overview

The full details of the memory model are beyond the scope of this first tutorial, but we'll cover the basics here to get some understanding on how a kernel is executed on the device. 

There are four levels of memory on an OpenCL device, forming a memory hierarchy (from large and slow to tiny and fast memory):
  • Global memory (similar to RAM): the largest but also slowest form of memory, can be read and written to by all work items (threads) and all work groups on the device and can also be read/written by the host (CPU).
  • Constant memory: a small chunk of global memory on the device, can be read by all work items on the device (but not written to) and can be read/written by the host. Constant memory is slightly faster than global memory.
  • Local memory (similar to cache memory on the CPU): memory shared among work items in the same work group (work items executing together on the same compute unit are grouped into work groups). Local memory allows work items belonging to the same work group to share results. Local memory is much faster than global memory (up to 100x).
  • Private memory (similar to registers on the CPU): the fastest type of memory. Each work item (thread) has a tiny amount of private memory to store intermediate results that can only be used  by that work item



First OpenCL program

With the obligatory theory out of the way, it's time to dive into the code. To get used to the OpenCL syntax, this first program will be very simple (nothing earth shattering yet): the code will just add the corresponding elements of two floating number arrays together in parallel (all at once).

In a nutshell, what happens is the following:
  1. Initialise the OpenCL computing environment: create a platform, device, context, command queue, program and kernel and set up the kernel arguments
  2. Create two floating point number arrays on the host side and copy them to the OpenCL device
  3. Make OpenCL perform the computation in parallel (by determining global and local worksizes and launching the kernel)
  4. Copy the results of the computation from the device to the host
  5. Print the results to the console
To keep the code simple and readable, there is minimal error checking, the "cl" namespace is used for the OpenCL structures and the OpenCL kernel source is provided as a string in the CPU code. 

The code contains plenty of comments to clarify the new syntax:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
// Getting started with OpenCL tutorial 
// by Sam Lapere, 2016, http://raytracey.blogspot.com
// Code based on http://simpleopencl.blogspot.com/2013/06/tutorial-simple-start-with-opencl-and-c.html

#include <iostream>
#include <vector>
#include <CL\cl.hpp> // main OpenCL include file 

using namespace cl;
using namespace std;

void main()
{
 // Find all available OpenCL platforms (e.g. AMD, Nvidia, Intel)
 vector<Platform> platforms;
 Platform::get(&platforms);

 // Show the names of all available OpenCL platforms
 cout << "Available OpenCL platforms: \n\n";
 for (unsigned int i = 0; i < platforms.size(); i++)
  cout << "\t" << i + 1 << ": " << platforms[i].getInfo<CL_PLATFORM_NAME>() << endl;

 // Choose and create an OpenCL platform
 cout << endl << "Enter the number of the OpenCL platform you want to use: ";
 unsigned int input = 0;
 cin >> input;
// Handle incorrect user input
 while (input < 1 || input > platforms.size()){
  cin.clear(); //clear errors/bad flags on cin
  cin.ignore(cin.rdbuf()->in_avail(), '\n'); // ignores exact number of chars in cin buffer
  cout << "No such platform." << endl << "Enter the number of the OpenCL platform you want to use: ";
  cin >> input;
 }

 Platform platform = platforms[input - 1];

 // Print the name of chosen OpenCL platform
 cout << "Using OpenCL platform: \t" << platform.getInfo<CL_PLATFORM_NAME>() << endl;

 // Find all available OpenCL devices (e.g. CPU, GPU or integrated GPU)
 vector<Device> devices;
 platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);

 // Print the names of all available OpenCL devices on the chosen platform
 cout << "Available OpenCL devices on this platform: " << endl << endl;
 for (unsigned int i = 0; i < devices.size(); i++)
  cout << "\t" << i + 1 << ": " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;

 // Choose an OpenCL device 
 cout << endl << "Enter the number of the OpenCL device you want to use: ";
 input = 0;
 cin >> input;
// Handle incorrect user input
 while (input < 1 || input > devices.size()){
  cin.clear(); //clear errors/bad flags on cin
  cin.ignore(cin.rdbuf()->in_avail(), '\n'); // ignores exact number of chars in cin buffer
  cout << "No such device. Enter the number of the OpenCL device you want to use: ";
  cin >> input;
 }
 
 Device device = devices[input - 1];

 // Print the name of the chosen OpenCL device
 cout << endl << "Using OpenCL device: \t" << device.getInfo<CL_DEVICE_NAME>() << endl << endl;

 // Create an OpenCL context on that device.
 // the context manages all the OpenCL resources 
 Context context = Context(device);

 ///////////////////
 // OPENCL KERNEL //
 ///////////////////

 // the OpenCL kernel in this tutorial is a simple program that adds two float arrays in parallel  
 // the source code of the OpenCL kernel is passed as a string to the host
 // the "__global" keyword denotes that "global" device memory is used, which can be read and written 
 // to by all work items (threads) and all work groups on the device and can also be read/written by the host (CPU)

 const char* source_string =
  " __kernel void parallel_add(__global float* x, __global float* y, __global float* z){ "
  " const int i = get_global_id(0); " // get a unique number identifying the work item in the global pool
  " z[i] = y[i] + x[i];    " // add two arrays 
  "}";

 // Create an OpenCL program by performing runtime source compilation
 Program program = Program(context, source_string);

 // Build the program and check for compilation errors 
 cl_int result = program.build({ device }, "");
 if (result) cout << "Error during compilation! (" << result << ")" << endl;

 // Create a kernel (entry point in the OpenCL source program)
 // kernels are the basic units of executable code that run on the OpenCL device
 // the kernel forms the starting point into the OpenCL program, analogous to main() in CPU code
 // kernels can be called from the host (CPU)
 Kernel kernel = Kernel(program, "parallel_add");

 // Create input data arrays on the host (= CPU)
 const int numElements = 10;
 float cpuArrayA[numElements] = { 0.0f, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f };
 float cpuArrayB[numElements] = { 0.1f, 0.2f, 0.3f, 0.4f, 0.5f, 0.6f, 0.7f, 0.8f, 0.9f, 1.0f };
 float cpuOutput[numElements] = {}; // empty array for storing the results of the OpenCL program

 // Create buffers (memory objects) on the OpenCL device, allocate memory and copy input data to device.
 // Flags indicate how the buffer should be used e.g. read-only, write-only, read-write
 Buffer clBufferA = Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, numElements * sizeof(cl_int), cpuArrayA);
 Buffer clBufferB = Buffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, numElements * sizeof(cl_int), cpuArrayB);
 Buffer clOutput = Buffer(context, CL_MEM_WRITE_ONLY, numElements * sizeof(cl_int), NULL);

 // Specify the arguments for the OpenCL kernel
 // (the arguments are __global float* x, __global float* y and __global float* z)
 kernel.setArg(0, clBufferA); // first argument 
 kernel.setArg(1, clBufferB); // second argument 
 kernel.setArg(2, clOutput);  // third argument 

 // Create a command queue for the OpenCL device
 // the command queue allows kernel execution commands to be sent to the device
 CommandQueue queue = CommandQueue(context, device);

 // Determine the global and local number of "work items"
 // The global work size is the total number of work items (threads) that execute in parallel
 // Work items executing together on the same compute unit are grouped into "work groups"
 // The local work size defines the number of work items in each work group
 // Important: global_work_size must be an integer multiple of local_work_size 
 std::size_t global_work_size = numElements;
 std::size_t local_work_size = 10; // could also be 1, 2 or 5 in this example
 // when local_work_size equals 10, all ten number pairs from both arrays will be added together in one go

 // Launch the kernel and specify the global and local number of work items (threads)
 queue.enqueueNDRangeKernel(kernel, NULL, global_work_size, local_work_size);

 // Read and copy OpenCL output to CPU 
 // the "CL_TRUE" flag blocks the read operation until all work items have finished their computation
 queue.enqueueReadBuffer(clOutput, CL_TRUE, 0, numElements * sizeof(cl_float), cpuOutput);

 // Print results to console
 for (int i = 0; i < numElements; i++)
  cout << cpuArrayA[i] << " + " << cpuArrayB[i] << " = " << cpuOutput[i] << endl;

 system("PAUSE");
}


Compiling instructions (for Visual Studio on Windows)

To compile this code, it's recommended to download and install the AMD App SDK (this works for systems with GPUs or CPUs from AMD, Nvidia and Intel, even if your system doesn't have an AMD CPU or GPU installed) since Nvidia's OpenCL implementation is no longer up-to-date.
  1. Start an empty Console project in Visual Studio (any recent version should work, including Express and Community) and set to Release mode 
  2. Add the SDK include path to the "Additional Include Directories" (e.g. "C:\Program Files (x86)\AMD APP SDK\2.9-1\include") 
  3. In Linker > Input, add "opencl.lib" to "Additional Dependencies" and add the OpenCL lib path to "Additional Library Directories"  (e.g. "C:\Program Files (x86)\AMD APP SDK\2.9-1\lib\x86")
  4. Add the main.cpp file (or create a new file and paste the code) and build the project

Download binaries

The executable (Windows only) for this tutorial is available at 
https://github.com/straaljager/OpenCL-path-tracing-tutorial-1-Getting-started/releases/tag/v1.0

It runs on CPUs and/or GPUs from AMD, Nvidia and Intel.


Useful References

- "A gentle introduction to OpenCL":
http://www.drdobbs.com/parallel/a-gentle-introduction-to-opencl/231002854 

- "Simple start with OpenCL":
http://simpleopencl.blogspot.co.nz/2013/06/tutorial-simple-start-with-opencl-and-c.html 

- Anteru's blogpost, Getting started with OpenCL (uses old OpenCL API)
https://anteru.net/blog/2012/11/03/2009/index.html
 
- AMD introduction to OpenCL programming:
http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/01/Introduction_to_OpenCL_Programming-201005.pdf


Up next

In the next tutorial we'll start rendering an image with OpenCL.

Tuesday, September 20, 2016

GPU path tracing tutorial 4: Optimised BVH building, faster traversal and intersection kernels and HDR environment lighting

For this tutorial, I've implemented a couple of improvements based on the high performance GPU ray tracing framework of Timo Aila, Samuli Laine and Tero Karras (Nvidia research) which is described in their 2009 paper "Understanding the efficiency of ray traversal on GPUs" and the 2012 addendum to the original paper which contains specifically hand tuned kernels for Fermi and Kepler GPUs (which also works on Maxwell). The code for this framework is open source and can be found at the Google code repository (which is about to be phased out) or on GitHub. The ray tracing kernels are thoroughly optimised and deliver state-of-the-art performance (the code from this tutorial is 2-3 times faster than the previous one).  For that reason, they are also used in the production grade CUDA path tracer Cycles:

- wiki.blender.org/index.php/Dev:Source/Render/Cycles/BVH

- github.com/doug65536/blender/blob/master/intern/cycles/kernel/kernel_bvh.h

- github.com/doug65536/blender/blob/master/intern/cycles/kernel/kernel_bvh_traversal.h

The major improvements from this framework are:

- Spatial split BVH: this BVH building method is based on Nvidia's "Spatial splits in bounding volume hierarchies" paper by Martin Stich. It aims to reduce BVH node overlap (a high amount of node overlap lowers ray tracing performance) by combining the object splitting strategy of regular BVH building (according to a surface area heuristic or SAH) with the space splitting method of kd-tree building. The algorithm determines for each triangle whether "splitting" it (by creating duplicate references to the triangle and storing them in its overlapping nodes) lowers the cost of ray/node intersections compared to the "unsplit" case. The result is a very high quality acceleration structure with ray traversal performance which on average is significantly higher than (or in the worst case equal to) a regular SAH BVH.

- Woop ray/triangle intersection: this algorithm is explained in "Real-time ray tracing of dynamic scenes on an FPGA chip". It basically transforms each triangle in the mesh to a unit triangle with vertices (0, 0, 0), (1, 0, 0) and (0, 1, 0). During rendering, a ray is transformed into "unit triangle space" using a triangle specific affine triangle transformation and intersected with the unit triangle, which is a much simpler computation.

- Hand optimised GPU ray traversal and intersection kernels:  these kernels use a number of specific tricks to minimise thread divergence within a warp (a warp is a group of 32 SIMD threads which operate in lockstep, i.e. all threads within a warp must execute the same instructions). Thread divergence occurs when one or more threads within a warp follow a different code execution branch, which (in the absolute worst case) could lead to a scenario where only one thread is active while the other 31 threads in the warp are idling, waiting for it to finish. Using "persistent threads" aims to mitigate this problem: when a predefined number of CUDA threads within a warp is idling, the GPU will dynamically fetch new work for these threads in order to increase compute occupancy. The persistent threads feature is used in the original framework. To keep things simple for this tutorial, it has not been implemented as it requires generating and buffering batches of rays, but it is relatively easy to add. Another optimisation to increase SIMD efficiency in a warp is postponing ray/triangle intersection tests until all threads in the same warp have found a leaf node. Robbin Marcus wrote a very informative blogpost about these specific optimisations. In addition to these tricks, the Kepler kernel also uses the GPUs video instructions to perform min/max operations (see "renderkernel.cu" at the top).

UPDATE: an attentive reader (who knows what he's talking about) corrected a mistake in the above paragraph: "Persistent threading on the GPU was designed to work around the slow dynamic load balancing hardware of the time (GTX 260), not to address branch divergence (totally separate issue). Occupancy is again a different issue, related to how many registers your kernel needs versus how many are present in a SM to spawn threads for latency hiding."

Other new features:
- a basic OBJ loader which triangulates n-sided faces (n-gons, triangle fans)
- simple HDR environment map lighting, which for simplicity does not use any filtering (hence the blockiness) or importance sampling yet. The code is based on http://blog.hvidtfeldts.net/index.php/2012/10/image-based-lighting/

Some renders with the code from this tutorial (the "Roman Settlement" city scene was created by LordGood and converted from a SketchUp model, also used by Mitsuba Render. The HDR maps are available at the HDR Labs website):

 

Source code
 
The tutorial's source code can be found at github.com/straaljager/GPU-path-tracing-tutorial-4

For clarity, I've tried to simplify the code where possible, keeping the essential improvements provided by the framework and cutting out the unnecessary parts. I have also added clarifying comments to the most difficult code parts where appropriate. There is quite a lot of new code, but the most important and interesting files are:

- SplitBVHBuilder.cpp contains the algorithm for building BVH with spatial splits
- CudaBVH.cpp shows the particular layout in which the BVH nodes are stored and Woop's triangle transformation method
- renderkernel.cu demonstrates two methods of ray/triangle intersection: a regular ray/triangle intersection algorithm similar to the one in GPU path tracing tutorial 3, denoted as DEBUGintersectBVHandTriangles() and a method using Woop's ray/triangle intersection named intersectBVHandTriangles()  

Demo 
 
A downloadable demo (which requires an Nvidia GPU) is available from
github.com/straaljager/GPU-path-tracing-tutorial-4/releases

 
Working with and learning this ray tracing framework was a lot of fun, head scratching and cursing (mostly the latter). It has given me a deeper appreciation for both the intricacies and strengths of GPUs and taught me a multitude of ways of how to optimise Cuda code to maximise performance (even to the level of assembly/PTX). I recommend anyone who wants to build a GPU renderer to sink their teeth in it (the source code in this tutorial should make it easier to digest the complexities). It keeps astounding me what GPUs are capable of and how much they have evolved in the last decade. 

The next tutorial(s) will cover direct lighting, physical sky, area lights, textures and instancing.  I've also had a few requests from people who are new to ray tracing for a more thorough explanation of the code from previous tutorials. At some point (when time permits), I hope to create tutorials with illustrations and pseudocode of all the concepts covered.

Thursday, June 9, 2016

Real-time path traced Quake 2

Last week, Edd Biddulph released the code and some videos of a very impressive project he's working on: a real-time path traced version of Quake 2 running on OpenGL 3.3.

Project link with videos: http://amietia.com/q2pt.html
Full source code on Github: https://github.com/eddbiddulph/yquake2/tree/pathtracing

Quake 2, now with real-time indirect lighting and soft shadows
The path tracing engine behind this project is quite astonishing when you consider the number of lightsources in the level and the amount of dynamic characters (each with a unique pose) that are updated every single frame. I had a very interesting talk with Edd on some of the features of his engine, revealing that he used a lot of clever optimisations (some of which are taking advantage of the specific properties of the Quake 2 engine). 

Copying Edd's answers here:
Why Quake 2 instead of Quake 3
I chose Quake 2 because it has area lightsources and the maps were designed with multiple-bounce lighting in mind. As far as I know, Quake 3 was not designed this way and didn't even have area lightsources for the baked lighting. Plus Quake 2's static geometry was still almost entirely defined by a binary space partitioning tree (BSP) and I found that traversing a BSP is pretty easy in GLSL and seems to perform quite well, although I haven't made any comparisons to other approaches. Quake 3 has a lot more freeform geometry such as tessellated Bezier surfaces so it doesn't lend itself so well to special optimisations. I'm a big fan of both games of course :)

How the engine updates dynamic objects
All dynamic geometry is inserted into a single structure which is re-built from scratch on every frame. Each node is an axis-aligned bounding box and has a 'skip pointer' to skip over the children. I make a node for each triangle and build the structure bottom-up after sorting the leaf nodes by morton code for spatial coherence. I chose this approach because the implementation is simple both for building and traversing, the node hierarchy is quite flexible, and building is fast although the whole CPU side is single-threaded for now (mostly because Quake 2 is single-threaded of course). I'm aware that the lack of ordered traversal results in many more ray-triangle intersection tests than are necessary, but there is little divergence and low register usage since the traversal is stackless.

How to keep noise to a minimum when dealing with so many lights
The light selection is a bit more tricky. I divided lightsources into two categories - regular and 'skyportals'. A skyportal is just a light-emitting surface from the original map data which has a special texture applied, which indicates to the game that the skybox should be drawn there. Each leaf in the BSP has two lists of references to lightsources. The first list references regular lightsources which are potentially visible from the leaf according to the PVS (potentially visible set) tables. The second list references skyportals which are contained within the leaf. At an intersection point the first list is used to trace shadow rays and make explicit samples of lightsources, and the second list is used to check if the intersection point is within a skyportal surface. If it's within a skyportal then there is a contribution of light from the sky. This way I can perform a kind of offline multiple importance sampling (MIS) because skyportals are generally much larger than regular lights. For regular lights of course I use importance sampling, but I believe the weight I use is more approximate than usual because it's calculated always from the center of the lightsource rather than from the real sample position on the light.

One big point about the lights right now is that the pointlights that the original game used are being added as 4 triangular lightsources arranged in a tetrahedron so they tend to make quite a performance hit. I'd like to try adding a whole new type of lightsource such as a spherical light to see if that works out better.

Ray tracing specific optimisations
I'm making explicit light samples by tracing shadow rays directly towards points on the lightsources. MIS isn't being performed in the shader, but I'm deciding offline whether a lightsource should be sampled explicitly or implicitly.

Which parts of the rendering process use rasterisation
I use hardware rasterisation only for the primary rays and perform the raytracing in the same pass for the following reasons:
  • Translucent surfaces can be lit and can receive shadows identically to all other surfaces.
  • Hardware anti-aliasing can be used, of course.
  • Quake 2 sorts translucent BSP surfaces and draws them in a second pass, but it doesn't do this for entities (the animated objects) so I would need to change that design and I consider this too intrusive and likely to break something. One of my main goals was to preserve the behaviour of Q2's own renderer.
  • I'm able to eliminate overdraw by making a depth-only pre-pass which even uses the same GL buffers that the raytracer uses so it has little overhead except for a trick that I had to make since I packed the three 16-bit triangle indices for the raytracer into two 32-bit elements (this was necessary due to OpenGL limitations on texture buffer objects).
  • It's nice that I don't need to manage framebuffer stuff and design a good g-buffer format.
The important project files containing the path tracing code
If you want to take a look at the main parts that I wrote, stick to src/client/refresh/r_pathtracing.c and src/client/refresh/pathtracer.glsl. The rest of my changes were mostly about adding various GL extensions and hooking in my stuff to the old refresh subsystem (Quake 2's name for the renderer). I apologise that r_pathtracing.c is such a huge file, but I did try to comment it nicely and refactoring is already on my huge TODO list. The GLSL file is converted into a C header at build time by stringifyshaders.sh which is at the root of the codebase.

More interesting tidbits
- This whole project is only made practical by the fact that the BSP files still contain surface emission data despite the game itself making no use of it at all. This is clearly a by-product of keeping the map-building process simple, and it's a very fortunate one!
- The designers of the original maps sometimes placed pointlights in front of surface lights to give the appearence that they are glowing or emitting light at their sides like a fluorescent tube diffuser. This looks totally weird in my pathtracer so I have static pointlights disabled by default. They also happen to go unused by the original game, so it's also fortunate that they still exist among the map data. 
- The weapon that is viewed in first-person is drawn with a 'depth hack' (it's literally called RF_DEPTHHACK), in which the range of depth values is reduced to prevent the weapon poking in to walls. Unfortunately the pathtracer's representation would still poke in to walls because it needs the triangles in worldspace, and this would cause the tip of the weapon to turn black (completely in shadow). I worked around this by 'virtually' scaling down the weapon for the pathtracer. This is one of the many ways in which raytracing turns out to be tricky for videogames, but I'm sure there can always be elegant solutions.
If you want to mess around with the path traced version of Quake 2 yourself (both AMD and Nvidia cards are supported as the path tracer uses OpenGL), simply follow these steps:
  • on Windows, follow the steps under section 2.3 in the readme file (link: https://github.com/eddbiddulph/yquake2/blob/pathtracing/README). Lots of websites still offer the Quake 2 demo for download (e.g. http://www.ausgamers.com/files/download/314/quake-2-demo)
  • download and unzip the Yamagi Quake 2 source code with path tracing from https://github.com/eddbiddulph/yquake2
  • following the steps under section 2.6 of the readme file, download and extract the premade MinGW build environment, run MSYS32, navigate to the source directory with the makefile, "make" the release build and replace the files "q2ded.exe", "quake2.exe" and "baseq2\game.dll" in the Quake 2 game installation with the freshly built ones
  • start the game by double clicking "quake2", open the Quake2 console with the ~ key (under the ESC key), type "gl_pt_enable 1", hit Enter and the ~ key to close the console
  • the game should now run with path tracing

Edd also said he's also planning to add new special path tracing effects (such as light emitting particles from the railgun) and implementing more optimisations to reduce the path tracing noise.

Tuesday, May 17, 2016

Start your engines: source code for FireRays (AMD's high performance OpenCL based GPU ray tracing framework) available

AMD has just released the full source code of FireRays, their OpenCL based GPU renderer which was first available as a SDK library since August 2015 (see http://raytracey.blogspot.co.nz/2015/08/firerays-amds-opencl-based-high.html). This is an outstanding move by AMD which significantly lowers the threshold for developers to enter the GPU rendering arena and create an efficient OpenCL based path tracing engine that is able to run on hardware from AMD, Intel and Nvidia without extra effort. 

Here's an ugly sample render of FireRays provided by AMD:


And an old video from one of the developers:


Nvidia open sourced their high performance CUDA based ray tracing framework in 2009, but hasn't updated it since 2012 (presumably due to the lack of any real competition from AMD in this area) and has since focused more on developing OptiX, a CUDA based closed source ray tracing library. Intel open sourced Embree in 2011, which is being actively developed and updated with new features and performance improvements. They even released another open source high performance ray tracer for scientific visualisation called OSPRay.

FireRays seems to have some advanced features such as ray filtering, geometry and ray masking (to make certain objects invisible to the camera or selectively ignore effects like shadows and reflections) and support for volumetrics. Hopefully AMD will also release some in-depth documentation and getting started tutorials in order to maximise adoption of this new technology among developers who are new to GPU ray tracing.