DidaWiki

Indice

TBB Lab exercises

TBB Lab exercises

Lab time 06/05/2016

Practical info

You can download current TBB on ottavinareale with

wget https://www.threadingbuildingblocks.org/sites/default/files/software_releases/linux/tbb44_20160413oss_lin.tgz

Install TBB in your home directory on ottavinareale by unpacking the archive
Set up your install dir of TBB inside the tbbvars.sh script or tbbvars.csh (depending on your shell)
call tbbvars.sh intel64 linux from your shell profile script

Simple "farm" with mandelbrot

Compute the Mandelbrot set like in the MPI lab farm exercise. Program parameters define a square n*n (or rectangular n*m) subset of the complex plane, the number of samples i,j for each coordinate, the max number of iterations per point MaxI.

The first exaple will not really be a farm: generate the input stream of points with a parallel for (either two nested or a single one with a 2D range). You will need to create appropriate range(s) for that.

The computation for each point is actually given by the mandelbrot function, which returns the number of iteration until the point diverges, or MaxI if the limit is exceeded.

Choose a plane region near the border of the mandelbrot set so that you get both some of the points of the set and some outside in your computation.

measure the execution time and the speedup which varies with the iteration limit parameter;
check what is the amount of parallelism chosen by TBB
check if the load is balanced
check if the computation lenght per task is balanced (e.g. measure the amount of tasks that reach MaxI, or even better compute a cumulative histogram of the number of iteration for all processed tasks and print it in output).
check if you need to tune the grain size in order to achieve a good speedup when MaxI is low.
try manually choosing a different amount of parallelism from that selected by TBB

Can you dynamically optimize the grain according to the input parameters?

Things to do

decide how you will provide the sequential Mandelbrot function as loop body (via lambda or via a loop body class with () operator)
decide if you want to set up two nested parallel_for loops or a single loop with a 2D range (it is useful to try both and compare)
decide how to perform a rough check of the balancing of the load per task (histogram, percentage of lengthy tasks…)

Useful places around the Mandelbrot set

X,Y = square center	R = square size
X = -0.7463 Y = 0.1102	R = 0.005

Actual farm with Mandelbrot

Restructure the program to work with a stream of points (or stream of sets of points) that are generated and enter a parallel_do; we still compute the Mandelbrot set function, but the sequential function is modified such that it can take in input (and produce as output) a partial computation on a point of the plane.

an input parameter iter_grain is specified in the program
the input of the sequential function contains
- the original point (either as a couple of indexes or as a complex number)
- the current coordinates (as a complex number)
- the current interation number
the function each time computes up to iter_grain iterations on the point (less, if MaxI is reached); if MaxI is reached, the point computation is completed, otherwise the point is emitted in output anyway (with its values updated)
the output of the sequential function is the same as its input (better define some data structure)

You can compute on points in the stream and recycle them if the computation takes too long. Use the parallel_do methods to reinsert in the loop those points that are not completed, and let those that are completed flow out of the do_loop.

does this require a parallel_do that generates new tasks?
Basic use of parallel do implies the grain is always 1. Design a data structure that can aggregate more points into a single parallel_do task. Minimize data copying required by managing the structure, design a way to repack unfinished data points into new tasks while allowing finished computations to exit the loop. Does the new structure require inserting new tasks in the parallel_do?
examine two solutions:
1. the new tasks are inserted from within the loop itself, requiring a specific kind of parallel_do
2. the new tasks are generated outside of the loop; how can you manage the synchronization between the loop and the task generators in order to prevent the loop from exiting when new tasks are about to be added?

Extensions

If you consider the computation can have a specific grain size (size of the sets of points that are provided to each parallel_do inner function) the approach lends itself to become a sort of divide and conquer, where the long computation of some points are considered “big tasks” that are fed back in input after some time. These tasks can be repacked to allow completed points in the same task to not uselessly flow back into the loop. Proper design of the task structure minimizes data copying in this phase too.