Thrust is C++ template library (much like STL, the standard template library) that attempts to encapsulate the details the host-device interface on systems with a GPU or coprocessor, thus freeing the programmer from having to deal with these details. Originally developed as a front-end to CUDA, it currently supports several back-ends, including OpenMP. Today you're going to create a Thrust version of a program to compute π using a Monte-Carlo approach.
To begin, make sure your repository is up-to-date with
git fetch && git pulland then change into the
10-thrust
directory.
Examine the file pi_sequential.cc
, including carefully
reading the comments that explain what the program does. Notice that the
main()
function does little more than process the
command line, gather random samples, and then compute and display
the estimate of π. The genRandomSamples()
function
initializes the PRNG (pseudo-random number generator) and then
calls genSamples()
to generate random ordered pairs and
count those that lie within the unit circle.
The steps below will lead you through the process of converting this program to use Thrust. We'll make multiple small changes and use the compiler to verify that our intermediate code compiles correctly after each change. It won't, however, produce correct results until the final step.
Make a copy of the source code with
cp pi_sequential.cc pi_thrust.cu
and note that the file name extension is now .cu
.
Edit the file and make the following changes:
In the comments at the top of the file:
Sequential version
to Thrust
version
.
$Smake:
line to be
$Smake: nvcc -O2 -o %F %f wtime.c
pi_thrust
.
Replace the #include <gsl/gsl_rng.h>
line
with the lines
#include <thrust/host_vector.h> #include <thrust/device_vector.h> #include <thrust/transform_reduce.h> #include <thrust/random.h> #include <thrust/functional.h>
Remove all the code and comments for
the genSamples()
and genRandomSamples()
functions.
Seven lines above the bottom of the main()
function, replace genRandomSamples(numSamples)
with 0L
.
Verify that your program compiles using the nvcc
compiler by typing
smake pi_thrust.cu
You can try running the code, but it will only report 0 as the value of π.
To make our program as simple as possible we will require that the threads each compute the same number of random samples. We'll set the default to 100 but allow the user to change this on the command line. We'll also adjust the total number of samples to be a multiple of this value.
Edit pi_thrust.cu
and make the following changes.
In the comments at the top of the file, change the usage statement to be be
pi_thrust [-n NUM_SAMPLES] [-m SAMPLES_PER_THREAD] [-q]
This indicates that the user can use the -m
switch
to specify the number of samples for each thread to compute.
Just inside the top of the main()
function's body
add the line
long samplesPerThread = 100L;
In the while
statement a few lines further down,
change the string argument to the getopt()
function
to be “n:m:q”
. This tells the option
parser to look for the command-line switch -m
followed by an argument.
In the switch
statement just below, copy
the case 'n':
block to make a case 'm':
block to set the variable samplesPerThread
. Make
changes to the fprintf()
statements as necessary.
Change the usage statement in the fprintf()
function in the default
block of the switch statement
to include the string [-m SAMPLES_PER_THREAD]
so it
matches the usage statement in the preamble comments.
Save the source code and recompile to make sure the changes don't cause any errors.
We need to create a functor to generate random samples on the GPU
device, and also to count the number of them that lie within the
unit circle. Our functor will be
called genRandomSamples
and have two class variables;
a seed for the random number generator and the number
of samples each thread is responsible for computing.
If necessary, refer to our textbook, lecture slides, or the SAXPY example found at Thrust Quick Start Guide and create a functor with the following structure. This functor will not do anything (yet) except return 0.
unsigned int
class variable
named seed.long
class variable
named samplesPerThread.operator()
function should have the signature
long operator()(int segment) const
and initialize
a long
variable named count
to 0L
. It should also return count.Save and recompile the program. It should run, but still only returns 0 as the estimate of π.
Now we can fill out the body of the operator()
method.
Since we'll need lots of random samples, we first need to know how
to use Thrust to generate pseudorandom numbers. We can initialize
Thrust's random number generator (RNG) with
thrust::default_random_engine rng(seed);
to instantiate rng
as a random number generator state
variable. We can set the interval on
which uniDist(rng)
returns uniformly distributed random
numbers to be [0,1) with
thrust::uniform_real_distribution<double> uniDist(0.0, 1.0);
Add the lines
// Set up random number generator to provide uniformly distributed // pseudorandom numbers in the range [0.0, 1.0) thrust::default_random_engine rng(seed); thrust::uniform_real_distribution<double> uniDist(0.0, 1.0);
before the return
statement in
the operator()
function.
One challenge with generating pseudorandom numbers in a parallel environment is that we don't want each thread or process to obtain the same pseudorandom numbers; we want each one to get a different sequence of numbers.
A Thrust application will generate a single sequence of numbers, but
Thrust provides a way to tell each thread to select a
different portion of the sequence to use. The
method rng.discard(N)
will cause the thread to ignore
the first N numbers of the sequence. Assuming that
segment
is a number 0, 1, 2,... indicating the
block of the random sequence to use,samplesPerThread
random
samples, andthe statement rng.discard(segment*samplesPerThread*2)
will result in each thread getting it's own sequence of pseudorandom
values.
Add the lines
// Skip parts of sequence of random numbers used by prior threads rng.discard(segment*samplesPerThread*2);
to the body of the operator()
method just after the lines that configure the PRNG.
Finally we need to generate samplesPerThread
random
points in the unit square and count the number of them that are
contained in the unit circle. This can be done similarly to how
it's done in pi_sequential.cc
except that we'll only
generate samplesPerThread
points and
use uniDist(rng)
rather
than gsl_rng_uniform(rng)
.
Adapt the body of the genSamples()
function
in pi_sequential.cc
to the body of your functor.
Recompile and make sure there are no errors.
We're almost done, we just need to make some changes in
the main()
function.
We have seen that when working directly with CUDA we should
carefully determine the grid and block sizes as they can have a
dramatic impact on performance. When using Thrust we only have
indirect control on this, but we can control it somewhat by the
amount of work we assign to each thread. Given that Thrust also
works with OpenMP, where the number of threads is often much smaller
than the number of CUDA threads, it is very helpful to have a way
to set the amount of work assigned to each thread. It was for these
reasons that we've already modified the source code to set the
variable samplesPerThread
.
Because of the way Thrust uses the functor, it is useful for each thread to process the same number of samples. So that our calculations are correct, we should make sure that the total number of samples is rounded up to be a multiple of the number of samples per thread. We can do this with code like
numSamples = ((numSamples - 1 + samplesPerThread) / samplesPerThread) * samplesPerThread
As Captain Picard says, “Make it so Number One!”
Modify your code to compute numSamples
using the code
above. This line of code should appear just after
the while
loop that processes the command line.
Only one thing left to do; call the appropriate Thrust transform and reduction method to invoke our functor. Study the following statement and do not work any further until you have explained each part of it to someone else in our class (you should probably also listen nicely to their explanation).
long count = thrust::transform_reduce( thrust::counting_iterator<long>(0), thrust::counting_iterator<long>(numSamples / samplesPerThread), genRandomSamples(time(NULL), samplesPerThread), 0, thrust::plus<long>() );
Before continuing, explain the purpose of each part of
the thrust::transform_reduce()
function call to the
instructor.
Okay, last step.
replace the line long count = 0L;
in the timed block
near the bottom of main()
with the statement shown
above.
Compile your program one more time and try running it. If all has gone well you should find that it gives reasonable estimates of π.
Try using the -n
flag to change the total number of
samples used, up to 500,000,000 or so. Notice that the program
reports the number of samples actually used, which may be
different than the number requested. Why?
Try using the -m
flag to change the number of
samples generated by each thread; try values between 10 and
100,000. Do you find any significant changes in execution time as
this number gets small or large?
To compile your program to use OpenMP rather than CUDA, do the following:
.cc
or .cpp
. Alternatively, rather than
copying the source, you can create a symbolic link with something
like
ln -s pi_thrust.cu pi_thrust.cc
The first filename listed should be your existing program name.
g++ -O2 -fopenmp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP \ -I/usr/local/cuda/include -o pi_thrust_omp pi_thrust.cc wtime.c
Test this version of your program with 500,000,000 samples and a range of samples per thread.
On Friday please turn in