==== TBB Lab exercises ====

Lab time 06/05/2016

=== Practical info ===

  * You can download current TBB on ottavinareale with
''wget https://www.threadingbuildingblocks.org/sites/default/files/software_releases/linux/tbb44_20160413oss_lin.tgz'' 
  * Install TBB in your home directory on ottavinareale by unpacking the archive
  * Set up your install dir of TBB inside the tbbvars.sh script or tbbvars.csh (depending on your shell)
  * call ''tbbvars.sh intel64 linux'' from your shell profile script


=== Simple "farm" with mandelbrot ===

Compute the Mandelbrot set like in the MPI lab farm exercise.
Program parameters define a square //n*n// (or rectangular //n*m//) subset of the complex plane, the number of samples //i,j// for each coordinate, the max number of iterations per point //MaxI//.
 
The first exaple will not really be a farm: generate the input stream of points with a parallel for (either two nested or a single one with a 2D range). You will need to create appropriate range(s) for that.

The computation for each point is actually given by the mandelbrot function, which returns the number of iteration until the point diverges, or //MaxI// if the limit is exceeded.

Choose a plane region near the border of the mandelbrot set so that you get both some of the points of the set and some outside in your computation.
  * measure the execution time and the speedup which varies with the iteration limit parameter;
  * check what is the amount of parallelism chosen by TBB
  * check if the load is balanced
  * check if the computation lenght per task is balanced (e.g. measure the amount of tasks that reach //MaxI//, or even better compute a cumulative  histogram of the number of iteration for all processed tasks and print it in output).
  * check if you need to tune the grain size in order to achieve a good speedup when //MaxI// is low. 
  * try manually choosing a different amount of parallelism from that selected by TBB

Can you dynamically optimize the grain according to the input parameters?

== Things to do ==
  * decide how you will provide the sequential Mandelbrot function as loop body (via lambda or via a loop body class with () operator)
  * decide if you want to set up two nested parallel_for loops or a single loop with a 2D range (it is useful to try both and compare)
  * decide how to perform a rough check of the balancing of the load per task (histogram, percentage of lengthy tasks...)

== Useful places around the Mandelbrot set ==
^ X,Y = square center ^ R = square size ^
| X = -0.7463 Y = 0.1102 | R = 0.005 |

=== Actual farm with Mandelbrot ===

Restructure the program to work with a stream of points (or stream of sets of points) that are generated and enter a parallel_do; we still compute the Mandelbrot set function, but the sequential function is modified such that it can take in input (and produce as output) a partial computation on a point of the plane.
  * an input parameter //iter_grain// is specified in the program
  * the input of the sequential function contains 
    * the original point (either as a couple of indexes or as a complex number)
    * the current coordinates (as a complex number)
    * the current interation number
  * the function each time computes up to //iter_grain// iterations on the point (less, if //MaxI// is reached); if //MaxI// is reached, the point computation is completed, otherwise the point is emitted in output anyway (with its values updated) 
  * the output of the sequential function is the same as its input (better define some data structure)

You can compute on points in the stream and recycle them if the computation takes too long. Use the parallel_do methods to reinsert in the loop those points that are not completed, and let those that are completed flow out of the do_loop.

  * does this require a parallel_do that generates new tasks?
  * Basic use of parallel do implies the grain is always 1. Design a data structure that can aggregate more points into a single parallel_do task. Minimize data copying required by managing the structure, design a way to repack unfinished data points into new tasks while allowing finished computations to exit the loop. Does the new structure require inserting new tasks in the parallel_do?
  * examine two solutions: 
    - the new tasks are inserted from within the loop itself, requiring a specific kind of parallel_do
    - the new tasks are generated outside of the loop; how can you manage the synchronization between the loop and the task generators in order to prevent the loop from exiting when new tasks are about to be added?

== Extensions ==

If you consider the computation can have a specific grain size (size of the sets of points that are provided to each parallel_do inner function) the approach lends itself to become a sort of divide and conquer, where the long computation of some points are considered "big tasks" that are fed back in input after some time. These tasks can be repacked to allow completed points in the same task to not uselessly flow back into the loop. Proper design of the task structure minimizes data copying in this phase too.