Ray Tracey's blog: GPU path tracing tutorial 1: Drawing First Blood

Monday, October 5, 2015

GPU path tracing tutorial 1: Drawing First Blood

In early 2011 I developed a simple real-time path traced Pong game together with Kerrash on top of an open source GPU path tracer called tokaspt (developed by Thierry Berger-Perrin) which could only render spheres, but was bloody fast at it. The physics were bodged, but the game proved that path tracing of very simple scenes at 30 fps was feasible, although a bit noisy. You can still download it from https://code.google.com/p/tokap-the-once-known-as-pong/. Since that time I've always wanted to write a short and simple tutorial about GPU path tracing to show how to make your GPU draw an image with high quality ray traced colour bleeding with a minimum of code and now is a good time to do exactly that.

This tutorial is not meant as an introduction to ray tracing or path tracing as there are plenty of excellent ray tracing tutorials for beginners online such as Scratch-a-Pixel (also check out the old version which contains more articles) and Minilight (more links at the bottom of this article). The goal of this tutorial is simply to show how incredibly easy it is to turn a simple CPU path tracer into a CUDA accelerated version. Being a fan of the KISS principle from design and engineering (Keep It Simple Stupid) and aiming to avoid unnecessary complexity, I've chosen to cudafy Kevin Beason's smallpt, the most basic but still fully functional CPU path tracer around. It's a very short piece of code that doesn't require the user to install any tedious libraries to compile the code (apart from Nvidia's CUDA Toolkit).

The full CPU version of smallpt can be found at http://www.kevinbeason.com/smallpt/ Due to its compactness the code is not very easy to read, but fortunately David Cline made a great Powerpoint presentation explaining what each line in smallpt is doing with references to Peter Shirley's "Realistic Ray Tracing" book.

To keep things simple and free of needless clutter, I've stripped out the code for the tent filter, supersampling, Russian Roulette and the material BRDFs for reflective and refractive materials, leaving only the diffuse BRDF. The 3D vector class from smallpt is replaced by CUDA's own built-in float3 type (built-in CUDA types are more efficient due to automatic memory alignment) which has the same linear algebra math functions as a vector such as addition, subtraction, multiplication, normalize, length, dot product and cross product. For reasons of code clarity, there is no error checking when initialising CUDA. To compile the code, save the code in a file with ".cu" file extension and follow these CUDA installation guides to install Nvidia's GPU Computing Toolkit and configure the programming tools to work with CUDA.

After reading the slides from David Cline, the commented code below should speak for itself, but feel free to drop me a comment below if some things are still not clear.

So without further ado, here's the full CUDA code (also on Github at https://github.com/straaljager/GPU-path-tracing-tutorial-1):


// smallptCUDA by Sam Lapere, 2015
// based on smallpt, a path tracer by Kevin Beason, 2008  
 
#include <iostream>
#include <cuda_runtime.h>
#include <vector_types.h>
#include "device_launch_parameters.h"
#include <cutil_math.h> // from http://www.icmc.usp.br/~castelo/CUDA/common/inc/cutil_math.h

#define M_PI 3.14159265359f  // pi
#define width 512  // screenwidth
#define height 384 // screenheight
#define samps 1024 // samples 

// __device__ : executed on the device (GPU) and callable only from the device

struct Ray { 
 float3 orig; // ray origin
 float3 dir;  // ray direction 
 __device__ Ray(float3 o_, float3 d_) : orig(o_), dir(d_) {} 
};

enum Refl_t { DIFF, SPEC, REFR };  // material types, used in radiance(), only DIFF used here

struct Sphere {

 float rad;            // radius 
 float3 pos, emi, col; // position, emission, colour 
 Refl_t refl;          // reflection type (e.g. diffuse)

__device__ float intersect_sphere(const Ray &r) const { 
          
 // ray/sphere intersection
 // returns distance t to intersection point, 0 if no hit  
 // ray equation: p(x,y,z) = ray.orig + t*ray.dir
 // general sphere equation: x^2 + y^2 + z^2 = rad^2 
 // classic quadratic equation of form ax^2 + bx + c = 0 
 // solution x = (-b +- sqrt(b*b - 4ac)) / 2a
 // solve t^2*ray.dir*ray.dir + 2*t*(orig-p)*ray.dir + (orig-p)*(orig-p) - rad*rad = 0 
 // more details in "Realistic Ray Tracing" book by P. Shirley or Scratchapixel.com

  float3 op = pos - r.orig;    // distance from ray.orig to center sphere 
  float t, epsilon = 0.0001f;  // epsilon required to prevent floating point precision artefacts
  float b = dot(op, r.dir);    // b in quadratic equation
  float disc = b*b - dot(op, op) + rad*rad;  // discriminant quadratic equation
  if (disc<0) return 0;       // if disc < 0, no real solution (we're not interested in complex roots) 
   else disc = sqrtf(disc);    // if disc >= 0, check for solutions using negative and positive discriminant
  return (t = b - disc)>epsilon ? t : ((t = b + disc)>epsilon ? t : 0); // pick closest point in front of ray origin
 }
};

// SCENE
// 9 spheres forming a Cornell box
// small enough to be in constant GPU memory
// { float radius, { float3 position }, { float3 emission }, { float3 colour }, refl_type }
__constant__ Sphere spheres[] = {
 { 1e5f, { 1e5f + 1.0f, 40.8f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { 0.75f, 0.25f, 0.25f }, DIFF }, //Left 
 { 1e5f, { -1e5f + 99.0f, 40.8f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { .25f, .25f, .75f }, DIFF }, //Right 
 { 1e5f, { 50.0f, 40.8f, 1e5f }, { 0.0f, 0.0f, 0.0f }, { .75f, .75f, .75f }, DIFF }, //Back 
 { 1e5f, { 50.0f, 40.8f, -1e5f + 600.0f }, { 0.0f, 0.0f, 0.0f }, { 1.00f, 1.00f, 1.00f }, DIFF }, //Frnt 
 { 1e5f, { 50.0f, 1e5f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { .75f, .75f, .75f }, DIFF }, //Botm 
 { 1e5f, { 50.0f, -1e5f + 81.6f, 81.6f }, { 0.0f, 0.0f, 0.0f }, { .75f, .75f, .75f }, DIFF }, //Top 
 { 16.5f, { 27.0f, 16.5f, 47.0f }, { 0.0f, 0.0f, 0.0f }, { 1.0f, 1.0f, 1.0f }, DIFF }, // small sphere 1
 { 16.5f, { 73.0f, 16.5f, 78.0f }, { 0.0f, 0.0f, 0.0f }, { 1.0f, 1.0f, 1.0f }, DIFF }, // small sphere 2
 { 600.0f, { 50.0f, 681.6f - .77f, 81.6f }, { 2.0f, 1.8f, 1.6f }, { 0.0f, 0.0f, 0.0f }, DIFF }  // Light
};

__device__ inline bool intersect_scene(const Ray &r, float &t, int &id){

 float n = sizeof(spheres) / sizeof(Sphere), d, inf = t = 1e20;  // t is distance to closest intersection, initialise t to a huge number outside scene
 for (int i = int(n); i--;)  // test all scene objects for intersection
  if ((d = spheres[i].intersect_sphere(r)) && d<t){  // if newly computed intersection distance d is smaller than current closest intersection distance
    t = d;  // keep track of distance along ray to closest intersection point 
    id = i; // and closest intersected object
  }
 return t<inf; // returns true if an intersection with the scene occurred, false when no hit
}

// random number generator from https://github.com/gz/rust-raytracer

__device__ static float getrandom(unsigned int *seed0, unsigned int *seed1) {
 *seed0 = 36969 * ((*seed0) & 65535) + ((*seed0) >> 16);  // hash the seeds using bitwise AND and bitshifts
 *seed1 = 18000 * ((*seed1) & 65535) + ((*seed1) >> 16);

 unsigned int ires = ((*seed0) << 16) + (*seed1);

 // Convert to float
 union {
  float f;
  unsigned int ui;
 } res;

 res.ui = (ires & 0x007fffff) | 0x40000000;  // bitwise AND, bitwise OR

 return (res.f - 2.f) / 2.f;
}

// radiance function, the meat of path tracing 
// solves the rendering equation: 
// outgoing radiance (at a point) = emitted radiance + reflected radiance
// reflected radiance is sum (integral) of incoming radiance from all directions in hemisphere above point, 
// multiplied by reflectance function of material (BRDF) and cosine incident angle 
__device__ float3 radiance(Ray &r, unsigned int *s1, unsigned int *s2){ // returns ray color

 float3 accucolor = make_float3(0.0f, 0.0f, 0.0f); // accumulates ray colour with each iteration through bounce loop
 float3 mask = make_float3(1.0f, 1.0f, 1.0f); 

 // ray bounce loop (no Russian Roulette used) 
 for (int bounces = 0; bounces < 4; bounces++){  // iteration up to 4 bounces (replaces recursion in CPU code)

  float t;           // distance to closest intersection 
  int id = 0;        // index of closest intersected sphere 

// test ray for intersection with scene
  if (!intersect_scene(r, t, id))
   return make_float3(0.0f, 0.0f, 0.0f); // if miss, return black

  // else, we've got a hit!
  // compute hitpoint and normal
  const Sphere &obj = spheres[id];  // hitobject
  float3 x = r.orig + r.dir*t;          // hitpoint 
  float3 n = normalize(x - obj.pos);    // normal
  float3 nl = dot(n, r.dir) < 0 ? n : n * -1; // front facing normal

  // add emission of current sphere to accumulated colour
  // (first term in rendering equation sum) 
  accucolor += mask * obj.emi;

  // all spheres in the scene are diffuse
  // diffuse material reflects light uniformly in all directions
  // generate new diffuse ray:
  // origin = hitpoint of previous ray in path
  // random direction in hemisphere above hitpoint (see "Realistic Ray Tracing", P. Shirley)

  // create 2 random numbers
  float r1 = 2 * M_PI * getrandom(s1, s2); // pick random number on unit circle (radius = 1, circumference = 2*Pi) for azimuth
  float r2 = getrandom(s1, s2);  // pick random number for elevation
  float r2s = sqrtf(r2); 

  // compute local orthonormal basis uvw at hitpoint to use for calculation random ray direction 
  // first vector = normal at hitpoint, second vector is orthogonal to first, third vector is orthogonal to first two vectors
  float3 w = nl; 
  float3 u = normalize(cross((fabs(w.x) > .1 ? make_float3(0, 1, 0) : make_float3(1, 0, 0)), w));  
  float3 v = cross(w,u);

  // compute random ray direction on hemisphere using polar coordinates
  // cosine weighted importance sampling (favours ray directions closer to normal direction)
  float3 d = normalize(u*cos(r1)*r2s + v*sin(r1)*r2s + w*sqrtf(1 - r2));

  // new ray origin is intersection point of previous ray with scene
  r.orig = x + nl*0.05f; // offset ray origin slightly to prevent self intersection
  r.dir = d;

  mask *= obj.col;    // multiply with colour of object       
  mask *= dot(d,nl);  // weigh light contribution using cosine of angle between incident light and normal
  mask *= 2;          // fudge factor
 }

 return accucolor;
}


// __global__ : executed on the device (GPU) and callable only from host (CPU) 
// this kernel runs in parallel on all the CUDA threads

__global__ void render_kernel(float3 *output){

 // assign a CUDA thread to every pixel (x,y) 
 // blockIdx, blockDim and threadIdx are CUDA specific keywords
 // replaces nested outer loops in CPU code looping over image rows and image columns 
 unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;   
 unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;

 unsigned int i = (height - y - 1)*width + x; // index of current pixel (calculated using thread index) 

 unsigned int s1 = x;  // seeds for random number generator
 unsigned int s2 = y;

// generate ray directed at lower left corner of the screen
// compute directions for all other rays by adding cx and cy increments in x and y direction
 Ray cam(make_float3(50, 52, 295.6), normalize(make_float3(0, -0.042612, -1))); // first hardcoded camera ray(origin, direction) 
 float3 cx = make_float3(width * .5135 / height, 0.0f, 0.0f); // ray direction offset in x direction
 float3 cy = normalize(cross(cx, cam.dir)) * .5135; // ray direction offset in y direction (.5135 is field of view angle)
 float3 r; // r is final pixel color       
    
 r = make_float3(0.0f); // reset r to zero for every pixel 

 for (int s = 0; s < samps; s++){  // samples per pixel
    
  // compute primary ray direction
  float3 d = cam.dir + cx*((.25 + x) / width - .5) + cy*((.25 + y) / height - .5);
  
  // create primary ray, add incoming radiance to pixelcolor
  r = r + radiance(Ray(cam.orig + d * 40, normalize(d)), &s1, &s2)*(1. / samps); 
 }       // Camera rays are pushed ^^^^^ forward to start in interior 

 // write rgb value of pixel to image buffer on the GPU, clamp value to [0.0f, 1.0f] range
 output[i] = make_float3(clamp(r.x, 0.0f, 1.0f), clamp(r.y, 0.0f, 1.0f), clamp(r.z, 0.0f, 1.0f));
}

inline float clamp(float x){ return x < 0.0f ? 0.0f : x > 1.0f ? 1.0f : x; } 

inline int toInt(float x){ return int(pow(clamp(x), 1 / 2.2) * 255 + .5); }  // convert RGB float in range [0,1] to int in range [0, 255] and perform gamma correction

int main(){

 float3* output_h = new float3[width*height]; // pointer to memory for image on the host (system RAM)
 float3* output_d;    // pointer to memory for image on the device (GPU VRAM)

 // allocate memory on the CUDA device (GPU VRAM)
 cudaMalloc(&output_d, width * height * sizeof(float3));
        
 // dim3 is CUDA specific type, block and grid are required to schedule CUDA threads over streaming multiprocessors
 dim3 block(8, 8, 1);   
 dim3 grid(width / block.x, height / block.y, 1);

 printf("CUDA initialised.\nStart rendering...\n");
 
 // schedule threads on device and launch CUDA kernel from host
 render_kernel <<< grid, block >>>(output_d);  

 // copy results of computation from device back to host
 cudaMemcpy(output_h, output_d, width * height *sizeof(float3), cudaMemcpyDeviceToHost);  
 
 // free CUDA memory
 cudaFree(output_d);  

 printf("Done!\n");

 // Write image to PPM file, a very simple image file format
 FILE *f = fopen("smallptcuda.ppm", "w");          
 fprintf(f, "P3\n%d %d\n%d\n", width, height, 255);
 for (int i = 0; i < width*height; i++)  // loop over pixels, write RGB values
  fprintf(f, "%d %d %d ", toInt(output_h[i].x),
                          toInt(output_h[i].y),
                          toInt(output_h[i].z));
 
 printf("Saved image to 'smallptcuda.ppm'\n");

 delete[] output_h;
 system("PAUSE");
}

Optionally, the following 3D vector algebra functions can be inserted at the top of the file instead of #including "cutil_math.h". Instead of creating a Vector3D class (with 3 floats), CUDA's built-in float3 type is used instead as built-in types have automated memory alignment and provide higher for performance. The "__host__ __device__" keywords in front of the functions allow them to run on both the CPU and GPU.

// 3D vector algebra from cutil_math.h
/*DEVICE_BUILTIN*/
struct float3 {float x, y, z;};
typedef struct float3 float3;
// add
inline __host__ __device__ float3 operator+(float3 a, float3 b){return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);}
inline __host__ __device__ void operator+=(float3 &a, float3 b){a.x += b.x; a.y += b.y; a.z += b.z;}
inline __host__ __device__ float3 operator+(float3 a, float b){return make_float3(a.x + b, a.y + b, a.z + b);}
inline __host__ __device__ float3 operator+(float b, float3 a){return make_float3(b + a.x, b + a.y, b + a.z);}
inline __host__ __device__ void operator+=(float3 &a, float b){a.x += b; a.y += b; a.z += b;}
// subtract
inline __host__ __device__ float3 operator-(float3 a, float3 b){return make_float3(a.x - b.x, a.y - b.y, a.z - b.z);}
inline __host__ __device__ void operator-=(float3 &a, float3 b){a.x -= b.x; a.y -= b.y; a.z -= b.z;}
inline __host__ __device__ float3 operator-(float3 a, float b){return make_float3(a.x - b, a.y - b, a.z - b);}
inline __host__ __device__ float3 operator-(float b, float3 a){return make_float3(b - a.x, b - a.y, b - a.z);}
inline __host__ __device__ void operator-=(float3 &a, float b){a.x -= b; a.y -= b; a.z -= b;}
// multiply
inline __host__ __device__ float3 operator*(float3 a, float3 b){return make_float3(a.x * b.x, a.y * b.y, a.z * b.z);}
inline __host__ __device__ void operator*=(float3 &a, float3 b){a.x *= b.x; a.y *= b.y; a.z *= b.z;}
inline __host__ __device__ float3 operator*(float3 a, float b){return make_float3(a.x * b, a.y * b, a.z * b);}
inline __host__ __device__ float3 operator*(float b, float3 a){return make_float3(b * a.x, b * a.y, b * a.z);}
inline __host__ __device__ void operator*=(float3 &a, float b){a.x *= b; a.y *= b; a.z *= b;}
// divide
inline __host__ __device__ float3 operator/(float3 a, float3 b){return make_float3(a.x / b.x, a.y / b.y, a.z / b.z);}
inline __host__ __device__ void operator/=(float3 &a, float3 b){a.x /= b.x; a.y /= b.y; a.z /= b.z;}
inline __host__ __device__ float3 operator/(float3 a, float b){return make_float3(a.x / b, a.y / b, a.z / b);}
inline __host__ __device__ void operator/=(float3 &a, float b){a.x /= b; a.y /= b; a.z /= b;}
inline __host__ __device__ float3 operator/(float b, float3 a){return make_float3(b / a.x, b / a.y, b / a.z);}
// min
inline __host__ __device__ float3 fminf(float3 a, float3 b){return make_float3(fminf(a.x, b.x), fminf(a.y, b.y), fminf(a.z, b.z));}
// max
inline __host__ __device__ float3 fmaxf(float3 a, float3 b){return make_float3(fmaxf(a.x, b.x), fmaxf(a.y, b.y), fmaxf(a.z, b.z));}
// lerp
inline __device__ __host__ float3 lerp(float3 a, float3 b, float t){return a + t*(b - a);}
// clamp value v between a and b
inline __device__ __host__ float clamp(float f, float a, float b){return fmaxf(a, fminf(f, b));}
inline __device__ __host__ float3 clamp(float3 v, float a, float b){return make_float3(clamp(v.x, a, b), clamp(v.y, a, b), clamp(v.z, a, b));}
inline __device__ __host__ float3 clamp(float3 v, float3 a, float3 b){return make_float3(clamp(v.x, a.x, b.x), clamp(v.y, a.y, b.y), clamp(v.z, a.z, b.z));}
// dot product
inline __host__ __device__ float dot(float3 a, float3 b){return a.x * b.x + a.y * b.y + a.z * b.z;}
// length
inline __host__ __device__ float length(float3 v){return sqrtf(dot(v, v));}
// normalize
inline __host__ __device__ float3 normalize(float3 v){float invLen = rsqrtf(dot(v, v));return v * invLen;}
// floor
inline __host__ __device__ float3 floorf(float3 v){return make_float3(floorf(v.x), floorf(v.y), floorf(v.z));}
// frac
inline __host__ __device__ float fracf(float v){return v - floorf(v);}
inline __host__ __device__ float3 fracf(float3 v){return make_float3(fracf(v.x), fracf(v.y), fracf(v.z));}
// fmod
inline __host__ __device__ float3 fmodf(float3 a, float3 b){return make_float3(fmodf(a.x, b.x), fmodf(a.y, b.y), fmodf(a.z, b.z));}
// absolute value
inline __host__ __device__ float3 fabs(float3 v){return make_float3(fabs(v.x), fabs(v.y), fabs(v.z));}
// reflect 
//returns reflection of incident ray I around surface normal N
// N should be normalized, reflected vector's length is equal to length of I
inline __host__ __device__ float3 reflect(float3 i, float3 n){return i - 2.0f * n * dot(n, i);}
// cross product
inline __host__ __device__ float3 cross(float3 a, float3 b){return make_float3(a.y*b.z - a.z*b.y, a.z*b.x - a.x*b.z, a.x*b.y - a.y*b.x);}

Full code on Github:

https://github.com/straaljager/GPU-path-tracing-with-CUDA-tutorial-1

In this example, it's pretty easy to turn C/C++ code into CUDA code (CUDA is a subset of the C language). The differences with the CPU version of smallpt are as follows:

smallpt's 3D Vector struct is replaced by CUDA's built-in float3 type (linear algebra vector functions for float3 are defined in cutil_math.h)
CUDA specific keyword __device__ before functions that should run on the GPU and are only callable from the GPU
CUDA specific keyword __global__ in front of the kernel that is called from the host (CPU) and which runs in parallel on all CUDA threads
a custom random number generator that runs on the GPU
as GPUs don't handle recursion well, the radiance function needs to be converted from a recursive function to an iterative function (see Richie Sam's blogpost or Karl Li's slides for more details) with a fixed number of bounces (Russian roulette could be implemented here to terminate paths with a certain probability, but I took it out for simplicity)
in a CPU raytracer, you loop over each pixel of the image with two nested loops (one for image rows and one for image columns). On the GPU the loops are replaced by a kernel which runs for each pixel in parallel. A global thread index is computed instead from the grid dimensions, block dimensions and local thread index. See http://www.3dgep.com/introduction-to-cuda-using-visual-studio-2008/ for more details
the main() function calls CUDA specific functions to allocate memory on the CUDA device (cudaMalloc()), launch the CUDA kernel using the "<<< grid, block >>>" syntax and copy the results (in this case the rendered image) from the GPU back to the CPU, where the image is saved in PPM format (a supersimple image format)

When running the code above, we get the following image (1024 samples per pixel, brute force path tracing):

Path traced color bleeding rendered entirely on the GPU! On my laptop's GPU (Geforce 840M) it renders about 24x faster than the multithreaded CPU version (laptop Core-i7 clocked at 2.00 Ghz). The neat thing here is that it only took about 100 lines (if you take out the comments) to get path tracing working on the GPU. The beauty lies in its simplicity.

Even though the path tracing code already works well, it is actually very unoptimized and there are many techniques to speed it up:

explicit light sampling (or next event estimation): sample the light source directly instead of using brute force path tracing. This makes an enormous difference in reducing noise.
jittered sampling (also called stratified sampling): instead of sampling a pixel randomly, divide the pixel up into a number of layers (strata) in which random sampling is performed. According to Peter Shirley's book this way of sampling (which is partly structured and partly random) is one of the most important noise reduction methods
better random number generators
various importance sampling strategies: this code already performs cosine weighted importance sampling for diffuse rays, favouring rays with directions that are closer to the normal (as they contribute more to the final image). See http://www.rorydriscoll.com/2009/01/07/better-sampling/.
ray tracing acceleration structures: kd-trees, octrees, grids, bounding volume hierarchies provide massive speedups

GPU specific optimisations (see http://www.3dgep.com/optimizing-cuda-applications/ and Karl Li's course slides linked below):

using shared memory and registers whenever possible is many times faster than using global/local memory
memory alignment for coalesced reads from GPU memory
thread compaction: since CUDA launches a kernel in groups of 32 threads in parallel ("warps"), threads taking different code paths can give rise to thread divergence which reduces the GPU's occupancy. Thread compaction aims to mitigate the effects of thread divergence by bundling threads following similar code paths

I plan to cover the following topics (with CUDA implementations) in upcoming tutorials whenever I find some time:

an interactive viewport camera with progressive rendering,
textures (and bump mapping),
environment lighting,
acceleration structures,
triangles and triangle meshes
building more advanced features on top of Aila and Laine's GPU ray tracing framework which is also used by Blender's Cycles GPU renderer
dissecting some code snippets from Cycles GPU render or SmallLuxGPU

References used:

smallpt by Kevin Beason
line-for-line explanation of smallpt code by David Cline
"Realistic Ray Tracing" book by Peter Shirley
Scratch-a-Pixel : excellent online rendering tutorials (ray tracing and rasterisation)
tokaspt: a very fast CUDA path tracer by Thierry Berger Perrin (renders only spheres)
CUDA path tracing tutorial on RichieSams blog
Raytracer in Rust
Triers CUDA ray tracing tutorial
Real-time CUDA raytracer (triangle meshes and BVH acceleration structure) by Thanassis Tsiodras
A simple real-time ray tracer in CUDA and OpenGL tutorial (+ code) by Igor Sevo
CUDA path tracer by Peter Kutz and Karl Li
UPenn CUDA path tracing course by Yining Karl Li: part 1, part 2, GitHub code project
University of Pennsylvania GPU Programming and Architecture course: excellent beginner to advanced level course to learn CUDA, path tracing in CUDA, CUDA optimisation and pitfalls, performance profiling with NSight
CUDA programming and optimisation tutorials
CUDA optimisation

33 comments:

tHe.mT said...: Nice!; October 5, 2015 at 1:10 AM
Anonymous said...: Hey, what's the current status of real-time ray tracing? Is good-looking real-time ray-tracing possible on current top GPUs?; October 5, 2015 at 6:12 AM
John said...: Since your main target is high-performance path-tracing (an MP3 of real rendering), you should invest as much money in improving ECL gate speed (bipolar transistors are already near-perfect since they are inherently superior to field-effect ones), this hugely imports. Performance losses due to the speed of light limit estimate even microseconds per integrated circuit. Thus, as you increase the number of components, the losses are also increased. However, that does not mean that CISC-RISC hardware translator, caches, multithreading, superscalarity, pipelining, out-of-order execution, etc., are clever solutions. The memories cannot get faster otherwise than they get more advanced basis in the form of a faster logic.; October 5, 2015 at 9:02 AM
John said...: Don't worry. ECL is at least four times as fast. Today.; October 6, 2015 at 3:31 PM
John said...: Shrinking transistors will come to end yet very soon and performance increase will simply stop (well, for you exactly). Have fun.; October 7, 2015 at 1:15 PM
Anonymous said...: Jensen - most guys here are software guys, can you explain why ECL would be justified in consumer PCs with its power envelope?; October 7, 2015 at 3:40 PM
John said...: Seems like you took me for that Jensen of Nvidia, although we were born the same day, perhaps we are even more alike. However, thanks for the question.

The answer is :

1} from approx. 500 MHz on, CMOS power dissipation is becoming ever higher than ECL ;

2} sufficiently low-power bipolar transistors have been around for twenty years now .; October 7, 2015 at 7:41 PM
Anonymous said...: Jenson: It doesn't seem like anyone has anything to say about using ECL on modern computers, from searching Google. There must be some reason people aren't using that technology, if it is indeed so much faster than CMOS. Why is nobody using it or talking about it?; October 8, 2015 at 12:09 AM
John said...: This is not my opinion, but is quite interesting:

'IC manufacturers continue to announce new generations, 45 nanometer, 32 nanometer, 22 nanometer, and coming soon, 14 nanometer, though in fact we have been stuck at 64 nanometer for quite some time. There have been process improvements, but these are incremental improvements. The line pitch remains 64 nanometers. We cannot actually build circuits smaller than we could eight years ago. That these are improvements is no lie, but to label them by a size is a lie. If they had said “second generation 64 nanometer” instead of “45 nanometer”, if they had said “third generation 64 nanometer instead of “32 nanometer”, then they would have been speaking truth, or at least speaking hype rather than lies.'; October 8, 2015 at 9:29 AM
John said...: No, actually, it's still in extensive use where speed is crucial, formally with a solo supplier. Why it's not considered to be used even by IBM is that once it was unable to compete with CMOS in terms of manufacturing costs, but Intel with its InGaAs initiative just will show what is truly expensive.; October 8, 2015 at 1:47 PM
John said...: Now, since the speed of light limits the performance of integrated circuits, logically the ideal for gate speed would be almost 10^-44 a second, that is finite, which is good. Current state is -10. If they could be capable of reducing this gap, then you might get a rendering capability superior to path-tracing.; October 9, 2015 at 9:20 AM
Anon said...: Many thanks for producing this code - I look forward to the remaining blog posts you have planned. However, I am struggling to get the code above compiling. The math util class has been removed from the latest version of cuda (7.5). I am unfamiliar with CUDA and have looked around but finding myself reimplementing the math functions. Do you know what libraries/function calls we can use with the latest version of cuda?; October 12, 2015 at 3:22 AM
Sam Lapere said...: Anon: thanks! Yes, cutil_math has been removed from later versions of the CUDA toolkit (it's from the CUDA SDK samples in the toolkit versions released before 2010), but you can still find the file here: http://bullet.googlecode.com/svn/trunk/Extras/CUDA/cutil_math.h (or http://www.icmc.usp.br/~castelo/CUDA/common/inc/cutil_math.h since Google code is about to go the way of the dodo soon). To make things easier to compile, I'll build the next tutorial using the CUDA 7.5 toolkit (CUDA 6.5 has some nonsensical bug when using sqrt()).

You can also have a look at thttp://www.3dgep.com/introduction-to-cuda-using-visual-studio-2008/, to set up your environment variables in Visual Studio to work with CUDA. Or the "Getting Started with CUDA" guide on Nvidia's own website: https://developer.nvidia.com/how-to-cuda-c-cpp

Let me know if you still have trouble compiling and I'll take a look.

Btw, I've updated the code in the post and it should run twice as fast as before, mainly by scheduling more blocks/CUDA threads (with dim3 block). It's about 40x faster than the multithreaded CPU version now and I'm sure I can get even more speed. The cool thing about GPUs is that their performance is so unpredictable and occasionally mind blowing (more often than not actually).; October 12, 2015 at 3:41 PM
Sam Lapere said...: To everyone posting about Brigade, I've removed your comments as the engine is no longer relevant to this blog.

If you want to know more about the technology, visit:

http://www.hindawi.com/journals/ijcgt/2013/578269/

or check out this thesis: http://arauna2.nhtv.nl/files/thesis_jbikker.pdf; October 12, 2015 at 6:57 PM
Unknown said...: Brigade not relevant to your blog?? Sure, why would it be? It's only a real time path tracer that you advertized as (one of) the best in your past posts. I don't know what beef you have with Otoy or the Brigade team which led you to leave Otoy, but your disregard of Brigade that stands like the elefant in the room significantly impairs the quality of your blog.; November 9, 2015 at 7:53 AM
Sam Lapere said...: Jey Hey: nice metaphor :) just to clarify, there are no more updates on Brigade because I stopped working on it about two years ago (my last post about it, "Le Brigade nouveau est arrivé" dates from October 2013).; November 9, 2015 at 3:25 PM
raul said...: this is very cool..it looks like raytrace in real time...it is just a mod. jesus

http://imgur.com/a/ib8nW#nnn; November 25, 2015 at 3:30 AM
Anonymous said...: Hey it seems Imagination Tech has finally something to show (in Tokyo).
http://www.4gamer.net/games/144/G014402/20151117095/

Here are some videos :
https://www.youtube.com/watch?v=A0keJiuEKdg
https://www.youtube.com/watch?v=NGWKcpJOdLk
https://www.youtube.com/watch?v=hsOuj3V4Adw

I found out that Samsung explores something similar for their mobile gpu´s.
http://dl.acm.org/citation.cfm?id=2818442

Is this kind of technology useful for desktop too or obsolete because of new
features like mixed precision ?; November 26, 2015 at 6:03 PM
Sam Lapere said...: raul: those are probably the best real-time graphics I've seen. The colour bleeding on the characters looks really nice as well. I wonder how well it holds up in motion.

CPFUUU: nice find, thanks for the links. What ImgTec is showing in these videos looks more like Whitted ray tracing to me, something which could be done in real-time about ten years ago on a hefty PC. Current game engines like Frostbite and Unreal Engine 4 can produce a much higher level of realism, so I doubt they're going to convince many game devs to switch to ray tracing with these videos. They need to show something that actually looks substantially better than what can be done already.; November 27, 2015 at 1:26 PM
Unknown said...: Thanks for starting this series, Sam! When I was in high school, after learning about projections and rotations in Algebra class, I made a triangle mesh viewer (just wireframes). I kind of always wanted to learn more about rendering techniques, so these posts will be very helpful.

I programmed with Cuda a bit a number of years ago, but haven't used Cuda recently, so I was happy you posted the links for the set-up process in Visual Studio.

Some notes to other beginners:
* If you use the 3dgep.com link to figure out how to configure Visual Studio and start projects, note that copying and pasting the additional libraries directory for the linker will (at least in some cases--it did for me) copy things that look like quotation marks, but are not parsed by Visual Studio correctly. To avoid annoying errors and frustration, just type the quotation marks manually and cudart.lib should be findable.

* If you copy the code that replaces cutil_math.h, note that the float3 bits (struct float3 ... and typedef struct float3 float3) lines aren't needed, and the first line can actually cause errors. Just comment them out.

* Irfanview and XnView are two (of probably very many) free programs that can open the ppm images created in this post.

* When you start to get curious about changing dimensions and sample sizes, execution time may increase and you may get messages like: "Display driver stopped responding and has recovered." This has to do with a watchdog timeout timer (set with a default of 2 s). You may have to modify your registry in Windows to avoid this problem... (Search for TdrDelay); December 23, 2015 at 11:44 AM
Sam Lapere said...: Thanks Peter. At some point I will rewrite this tutorial and make a step-by-step guide, so even people with minimal or no programming experience can get started with GPU path tracing, a bit like what Khan Academy is doing with their Rendering 101 course (https://www.khanacademy.org/partner-content/pixar/rendering/).; December 28, 2015 at 3:22 PM
MMMovania said...: Running this demo on the latest CUDA 8 sdk crashes the display driver. I think it is due to the cutil_math.h shared with this demo. Instead in CUDA 5 and above the math functions are all resident in the file helper_math.h in C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0\common\inc on Windows and /usr/local/cuda/samples/common/inc on linux. Just include that file and the demo works fine.; October 22, 2016 at 10:24 AM
Belgin said...: Hello, Samuel. Your postings are highly interesting and informative.
I've recently ported the program you wrote here to the CPU and ran some tests.
Firstly, I've noticed you don't need to normalize when you compute (float3 u)
and (float3 d) inside the radiance function, as the vectors are already normalized,
and the image is the same either way.
Secondly, If I return (0,0,0), when a ray doesn't intersect the scene, the
image looks weird, so instead of "return make_float3(0, 0, 0) // if miss, return black"
I return accucolor instead.
Thirdly, the offset 0.05f is a bit too high for me (artifacts around sphere-sphere intersections appear), so I just set it to 0.0001f.

That's about it.
Thanks for your wonderful post.; December 23, 2016 at 4:17 AM
Unknown said...: I compiled it in CUDA 6.5 project with VS2013 and all I get is all black output file.
What should i do to get it working?
Thanks in advance.; March 24, 2017 at 6:33 AM
Unknown said...: I compiled it in a CUDA 6.5 project with VS2013 and the output .ppm is all black.
What should I do to get it working?
Thanks in advance.; March 24, 2017 at 6:35 AM
Sam Lapere said...: Kamil, a couple of things you could try (asuming you have an Nvidia card):

- make sure you compile the code in Release mode
- reduce the number of samples or turn off Windows watchdog, it crashes the Nvidia driver after 10 seconds of inactivity
- make all the spheres in the scene emissive; March 25, 2017 at 12:32 AM
Unknown said...: Hello Sam! I've got the same problem as Kamil's, and with reducing the number of samples, I can get the correct result.
You said "turn off windows watchdog", I googled a lot but still don't know how to do that. So what does the "watchdog" mean?
Thanks for your wonderful posts!; March 26, 2017 at 11:25 PM
Unknown said...: Hello Sam! I've got the same problem as Kamil's, and with reducing the number of samples, I can get the correct result.
You said "turn off windows watchdog", I googled a lot but still don't know how to do that. So what does the "watchdog" mean?
Thanks for your wonderful posts!; March 26, 2017 at 11:26 PM
Sam Lapere said...: Hi Berry, this thread may help: http://stackoverflow.com/questions/497685/cuda-apps-time-out-fail-after-several-seconds-how-to-work-around-this

The watchdog kicks in when the CUDA device (the GPU) takes more than a certain amount of time (usually 10 seconds) to return the result of a kernel:

https://devtalk.nvidia.com/default/topic/761381/cuda-programming-and-performance/this-code-crashes-the-driver-any-ideas-why-/post/4260886/#4260886; March 26, 2017 at 11:38 PM
Unknown said...: Really appreciate your fast response! I followed these intructions and added Tdr(Timeout detection) level and delay items in Registry, and it works!; March 27, 2017 at 12:08 AM
Sam Lapere said...: Good to know :); March 27, 2017 at 12:12 AM
Unknown said...: Thanks a lot, relase mode and reducing number of samples helped.; March 31, 2017 at 5:40 AM
Anonymous said...: Yo is there a reason your spheres are not mirror textured like in Kevin's?; April 21, 2020 at 7:48 PM