Overview

Thrust is C++ template library (much like STL, the standard template library) that attempts to encapsulate the details the host-device interface on systems with a GPU or coprocessor, thus freeing the programmer from having to deal with these details. Originally developed as a front-end to CUDA, it currently supports several back-ends, including OpenMP. Today you're going to create a Thrust version of a program to compute π using a Monte-Carlo approach.

Setup

To begin, make sure your repository is up-to-date with

git fetch && git pull

and then change into the 10-thrust directory.

Examine the file pi_sequential.cc, including carefully reading the comments that explain what the program does. Notice that the main() function does little more than process the command line, gather random samples, and then compute and display the estimate of π. The genRandomSamples() function initializes the PRNG (pseudo-random number generator) and then calls genSamples() to generate random ordered pairs and count those that lie within the unit circle.

The steps below will lead you through the process of converting this program to use Thrust. We'll make multiple small changes and use the compiler to verify that our intermediate code compiles correctly after each change. It won't, however, produce correct results until the final step.

Convert to use NVIDIA CUDA compiler

Make a copy of the source code with

cp pi_sequential.cc pi_thrust.cu

and note that the file name extension is now .cu.

Edit the file and make the following changes:

In the comments at the top of the file:
- Change Sequential version to Thrust version.
- Change the $Smake: line to be
```
$Smake: nvcc -O2 -o %F %f wtime.c
```
- Change the usage statement to reflect the program name pi_thrust.

Replace the #include <gsl/gsl_rng.h> line with the lines

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform_reduce.h>
#include <thrust/random.h>
#include <thrust/functional.h>

Remove all the code and comments for the genSamples() and genRandomSamples() functions.
Seven lines above the bottom of the main() function, replace genRandomSamples(numSamples) with 0L.

Verify that your program compiles using the nvcc compiler by typing

smake pi_thrust.cu

You can try running the code, but it will only report 0 as the value of π.

Add a command-line option to set number of samples per thread

To make our program as simple as possible we will require that the threads each compute the same number of random samples. We'll set the default to 100 but allow the user to change this on the command line. We'll also adjust the total number of samples to be a multiple of this value.

Edit pi_thrust.cu and make the following changes.

In the comments at the top of the file, change the usage statement to be be
```
pi_thrust [-n NUM_SAMPLES] [-m SAMPLES_PER_THREAD] [-q]
```
This indicates that the user can use the -m switch to specify the number of samples for each thread to compute.
Just inside the top of the main() function's body add the line
```
long samplesPerThread = 100L;
```
In the while statement a few lines further down, change the string argument to the getopt() function to be “n:m:q”. This tells the option parser to look for the command-line switch -m followed by an argument.
In the switch statement just below, copy the case 'n': block to make a case 'm': block to set the variable samplesPerThread. Make changes to the fprintf() statements as necessary.
Change the usage statement in the fprintf() function in the default block of the switch statement to include the string [-m SAMPLES_PER_THREAD] so it matches the usage statement in the preamble comments.

Save the source code and recompile to make sure the changes don't cause any errors.

Create a functor to generate and count random samples

We need to create a functor to generate random samples on the GPU device, and also to count the number of them that lie within the unit circle. Our functor will be called genRandomSamples and have two class variables; a seed for the random number generator and the number of samples each thread is responsible for computing.

If necessary, refer to our textbook, lecture slides, or the SAXPY example found at Thrust Quick Start Guide and create a functor with the following structure. This functor will not do anything (yet) except return 0.

the name should be genRandomSamples.
there should be an unsigned int class variable named seed.
there should be a long class variable named samplesPerThread.
the constructor should accept two arguments of the appropriate types and initialize the class variables with them.
the operator() function should have the signature long operator()(int segment) const and initialize a long variable named count to 0L. It should also return count.

Save and recompile the program. It should run, but still only returns 0 as the estimate of π.

Filling out the functor

Now we can fill out the body of the operator() method. Since we'll need lots of random samples, we first need to know how to use Thrust to generate pseudorandom numbers. We can initialize Thrust's random number generator (RNG) with

  thrust::default_random_engine rng(seed);

to instantiate rng as a random number generator state variable. We can set the interval on which uniDist(rng) returns uniformly distributed random numbers to be [0,1) with

thrust::uniform_real_distribution<double> uniDist(0.0, 1.0);

Add the lines

// Set up random number generator to provide uniformly distributed
// pseudorandom numbers in the range [0.0, 1.0)
thrust::default_random_engine rng(seed);
thrust::uniform_real_distribution<double> uniDist(0.0, 1.0);

before the return statement in the operator() function.

One challenge with generating pseudorandom numbers in a parallel environment is that we don't want each thread or process to obtain the same pseudorandom numbers; we want each one to get a different sequence of numbers.

A Thrust application will generate a single sequence of numbers, but Thrust provides a way to tell each thread to select a different portion of the sequence to use. The method rng.discard(N) will cause the thread to ignore the first N numbers of the sequence. Assuming that

segment is a number 0, 1, 2,... indicating the block of the random sequence to use,
each thread generates samplesPerThread random samples, and
each sample consists of two random numbers,

the statement rng.discard(segment*samplesPerThread*2) will result in each thread getting it's own sequence of pseudorandom values.

Add the lines

// Skip parts of sequence of random numbers used by prior threads
rng.discard(segment*samplesPerThread*2);

to the body of the operator() method just after the lines that configure the PRNG.

Finally we need to generate samplesPerThread random points in the unit square and count the number of them that are contained in the unit circle. This can be done similarly to how it's done in pi_sequential.cc except that we'll only generate samplesPerThread points and use uniDist(rng) rather than gsl_rng_uniform(rng).

Adapt the body of the genSamples() function in pi_sequential.cc to the body of your functor.

Recompile and make sure there are no errors.

Calling the functor

We're almost done, we just need to make some changes in the main() function.

We have seen that when working directly with CUDA we should carefully determine the grid and block sizes as they can have a dramatic impact on performance. When using Thrust we only have indirect control on this, but we can control it somewhat by the amount of work we assign to each thread. Given that Thrust also works with OpenMP, where the number of threads is often much smaller than the number of CUDA threads, it is very helpful to have a way to set the amount of work assigned to each thread. It was for these reasons that we've already modified the source code to set the variable samplesPerThread.

Because of the way Thrust uses the functor, it is useful for each thread to process the same number of samples. So that our calculations are correct, we should make sure that the total number of samples is rounded up to be a multiple of the number of samples per thread. We can do this with code like

numSamples = ((numSamples - 1 + samplesPerThread) / samplesPerThread) * samplesPerThread

As Captain Picard says, “Make it so Number One!” Modify your code to compute numSamples using the code above. This line of code should appear just after the while loop that processes the command line.

Only one thing left to do; call the appropriate Thrust transform and reduction method to invoke our functor. Study the following statement and do not work any further until you have explained each part of it to someone else in our class (you should probably also listen nicely to their explanation).

long count = thrust::transform_reduce(
    thrust::counting_iterator<long>(0),
    thrust::counting_iterator<long>(numSamples / samplesPerThread),
    genRandomSamples(time(NULL), samplesPerThread),
    0,
    thrust::plus<long>()
    );

Before continuing, explain the purpose of each part of the thrust::transform_reduce() function call to the instructor.

Okay, last step.

replace the line long count = 0L; in the timed block near the bottom of main() with the statement shown above.

Running your program

Compile your program one more time and try running it. If all has gone well you should find that it gives reasonable estimates of π.

Try using the -n flag to change the total number of samples used, up to 500,000,000 or so. Notice that the program reports the number of samples actually used, which may be different than the number requested. Why?

Try using the -m flag to change the number of samples generated by each thread; try values between 10 and 100,000. Do you find any significant changes in execution time as this number gets small or large?

Using the OpenMP back-end

To compile your program to use OpenMP rather than CUDA, do the following:

Create a copy of your program with the extension .cc or .cpp. Alternatively, rather than copying the source, you can create a symbolic link with something like
```
ln -s pi_thrust.cu pi_thrust.cc
```
The first filename listed should be your existing program name.

You can now compile the “C++” version using OpenMP with

g++ -O2 -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP \
  -I/usr/local/cuda/include -o pi_thrust_omp pi_thrust.cc wtime.c

Test this version of your program with 500,000,000 samples and a range of samples per thread.

What to turn in

On Friday please turn in

A printed listing of your program.
The run-time data you gathered, including the number of samples per thread that gave you the shortest run-time.