Class work 1

Instructions to use the remote machine

Once you have implemented on your own laptop the program, in order to test it on andromeda (the 8 core, 16 contexts machine remotely available, do the following steps:

ssh spm@andromeda.di.unipi.it
create a subdirectory with your name, e.g. mkdir marcodanelutto
exit
scp localfile.cpp spm@andromeda.di.unipi.it:marcodanelutto/
ssh spm@andromeda.di.unipi.it
cd marcodanelutto
then compile with gcc/g++ and run as you did on your own laptop

Pthread routines of interests

pthread_mutex_t the type of the mutexes (to be initialized with a PTHREAD_MUTEX_INITIALIZER)
pthread_cont_t the type of the condition variables (to be initialized with a PTHREAD_COND_INITIALIZER)
pthread_mutex_lock(pthread_mutex_t *)
pthread_mutex_unlock(pthread_mutex_t *)
pthread_cond_signal(pthread_cond_t *)
pthread_cond_wait(pthread_cond_t *, pthread_mutex_lock *)
pthread_create(pthread_t *tid, pthread_attr_t attributes, void *(*routine) (void *) , void * param) and pthread_join(pthread_t tid, void * *retval) should be used to create a thread and await termination

More grain in a computation

To increase the time spent in the f(x) computed by the pipeline stage or by the map workers, use a loop such as

dummyload.c

for(int i=0; i<LARGEVALUE; i++) 
   x = sin(x);

For large values of iterations (millions) it is easy to get a dely ranging in fractions of seconds.

Sample pipeline

Using C (or C++) and pthread (only), implement a three stage pipeline where:

the first stage generates a stream of floats from 0.00 to max (max given through command line parameters)
the second stage squares the items appearing onto the input stream and delivers them to the output stream
the last stage accumulates (summing them up to 0.00) all the values received onto the input stream.

Each stage should be implemented as a concurrent activity (a pthread).

The stream should be implemented in such a way a special tag “EOS” is propagate when no more stream items are available. Concurrent activities receiving (or generating, in case of the first stage) the EOS should propagate (if the case) the EOS and then terminate.

The communication channels implementing the stream should be properly implemented using any of the standard data structures available in C/C++ or in the standard libraries.

Proper “timing” code should be used to print the amount of time spent computing the whole pipeline.

Sample code

Sample map

Using a code similar to the one used to implement the farm (the one discussed in class lessons), implement a map where a floating point vector of N items is processed by *NW* workers to obtain the vector with each item being the square of its original value. As for the other exercise, you should implement the map using (only) pthreads. Once it works, modify the code in such a way a stream of vectors is processed, rather than a single one.

Class work 2

Crivello di Eratostene

Set up a pipeline filtering a stream of integers from 2 to MAX, such that:

a first stage generates integers towards the rest of the pipeline stages
the I-th stage:
- if has no number stored, stores the first number received and does not forward it
- if it has a number stored
  - passes the received number if it is not multiple of the stored integer
  - discards the received number if it is a multiple of the stored integer
after EOS, each stage prints its stored number (a prime number)

       2
[GEN] ---> [CHK( )] ---> [CHK( )] ---> [CHK( )] ---> [DISCARD]
       3     
[GEN] ---> [CHK(2)] ---> [CHK( )] ---> [CHK( )] ---> [DISCARD]
       4       3
[GEN] ---> [CHK(2)] ---> [CHK( )] ---> [CHK( )] ---> [DISCARD]
       5       (4)   3                                           4 discarded
[GEN] ---> [CHK(2)] ---> [CHK( )] ---> [CHK( )] ---> [DISCARD]
       6       5             
[GEN] ---> [CHK(2)] ---> [CHK(3)] ---> [CHK( )] ---> [DISCARD]
       7       (6)   5                                           6 discarded
[GEN] ---> [CHK(2)] ---> [CHK(3)] ---> [CHK( )] ---> [DISCARD]
       8       7                   5             
[GEN] ---> [CHK(2)] ---> [CHK(3)] ---> [CHK(5)] ---> [DISCARD]

Adding stages to a pipeline

A pipeline may be created and stages may be added by invoking the method add_stage, e.g. as in

myPipe.add_stage(new Stage(...))

Prime numbers in an interval

Consider the possibility to list all the prime numbers in an interval [MIn, MAX] by:

splitting the interval in a number nw of subintervals
assigning each subinterval to a farm worker
the worker
- for each i in the interval
  - checks if i is prime
    - either using the test a^(i-1) mod i == 1? (this may give false positives)
    - or using the test at this page

This implies using a completely different parallel pattern with respect to the previous exercise, of course.

Sample solutions

Sample solutions for the prime number examples here

Class work 3

Matrix multiplication

Implement a C++ program computing the standard ijk matrix multiplication algorithm between two matrices A and B of size MxP and PxN, respectively.

Then parallelize the code using the FastFlow ParallelFor and ParalleForReduce patterns. Implement three versions:

ParallelFor applied to the first (external) loop (index i)
ParallelFor applied to the second loop (index j)
ParallelForReduce applied to the third loop (index k)

The seq ijk matrix multiplication code looks as follows:

for(size_t i=0; i<M; i++) 
  for(size_t j=0; j<P; j++) {
    C[i][j] = 0.0;
    for(size_t k=0; k<P; k++) 
       C[i][j] += A[i][k]*B[k][j];
  }

Sample solutions

See the matmul example in the FastFlow tutorial source code.

Class work 4

Macro Data-Flow matrix multiplication

Implement by using the FastFlow MDF pattern (ff_mdf), the matrix multiplication algorithm starting from the following sequential code (matrices are of size NxN, double precision elements) :

for(size_t i=0;i<N; ++i)
  for(size_t j=0;j<N;++j) {
    A[i*N+j] = i+j;
    B[i*N+j] = i*j;
  }
for(size_t i=0;i<N; ++i)
  for(size_t j=0;j<N;++j) {
    C[i*N +j] = 0;
    for(size_t k=0;k<N;++k) {
      C[i*N + j] += A[ i*N+k ]*B[ j*N+k ];  // B is transposed !
    }
  }

The initialization of the two matrices A and B has to be overlapped with the computation of the elements of the C matrix.

Indice

Class work 1

Instructions to use the remote machine

Pthread routines of interests

More grain in a computation

Sample pipeline

Sample map

Class work 2

Crivello di Eratostene

Adding stages to a pipeline

Prime numbers in an interval

Sample solutions

Class work 3

Matrix multiplication

Sample solutions

Class work 4

Macro Data-Flow matrix multiplication

Sample solutions

Class work 5

Map skeleton with dynamic scheduling

Sample solutions