A very general (and generic) introduction
CNRS
IMAG
Paul-Valéry Montpellier 3 University
Objective: accelerate computations <=> reduce computation time
Idea: run multiple tasks in parallel instead of sequentially
different tasks to complete
one or more workers to complete the tasks
n tasks to complete (n>1)
1 worker
Total time =
┌────────┐
│worker 1│
└────────┘
┌─
│ task 1
│ │
│ ▼
│ task 2
│ │
│ ▼
│ task 3
│ .
│ .
▼ .
Time
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│worker 1│ │worker 2│ │worker 3│ │worker 4│ ...
└────────┘ └────────┘ └────────┘ └────────┘
┌─
│ task 1 task 2 task 3 task 4 ...
▼
Time
n tasks to complete (n>1)
p workers (p>=n)
Total time (exercise)
\underset{i=1,\dots,n}{\text{max}}\{t_i\}\sim O(1)\ with t_i time to complete task i
Potential bottleneck? (exercise)
not enough workers to complete all tasks
n tasks to complete (n>1)
p workers (p<n)
Need: assign multiple tasks to each worker (and manage this assignment)
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│worker 1│ │worker 2│ │worker 3│ │worker 4│ ...
└────────┘ └────────┘ └────────┘ └────────┘
┌─
│ task 1 task 2 task 3 task 4 ...
│ │ │ │ │
│ ▼ ▼ ▼ ▼
│ task p+1 task p+2 task p+3 task p+4
│ │ │ │ │
│ ▼ ▼ ▼ ▼
│ . . . .
│ . . . .
▼ . . . .
Time
Total time (exercise)
\underset{k=1,\dots,p}{\text{max}}\{T_k\}\sim O(n/p)\ with T_k = \sum_{i\in I_k} t_i, total time to complete all tasks assigned to worker k (where I_k is the set of indexes of tasks assigned to worker k)
a task = “wait 1 \mus”
Objective: run 100 tasks
Number of workers: 1, 2, 4, 6, 8
Why is the time gain not linear?
different tasks to complete
multiple workers to complete the tasks
one or more working resources1
Potential bottleneck? (exercise)
not enough resources for all workers
Need:
Total time = ? (exercise)
Potential issues? (exercise)
┌──────────┐ ┌──────────┐
│resource 1│ │resource 2│ ...
└──────────┘ └──────────┘
┌─
│ task 1 task 2 ...
│ (worker 1) (worker 2)
│ │ │
│ ▼ ▼
│ task 3 task 4
│ (worker 3) (worker 4)
│ │ │
│ ▼ ▼
│ task p+1 task p+2
│ (worker 1) (worker 2)
│ │ │
│ ▼ ▼
│ task p+3 task p+4
│ (worker 3) (worker 4)
│ │ │
│ ▼ ▼
│ . .
│ . .
▼ . .
Time
Total time = \text{max}_{\ell=1,\dots,q}\{\tau_\ell\}\sim O(n/q)
with \tau_\ell = \sum_{i\in J_\ell} t_i = total time to complete all tasks done on resource \ell (where J_\ell is the set of indexes of tasks assigned done on resource \ell)
Potential issues? multiple workers want to use the same working resources
a task = “wait 1 \mus”
Objective: run 100 tasks
8 computing units
Number of workers: 1, 2, 4, 8, 16, 32
different tasks to complete
multiple workers to complete the tasks
one or more working resources
Input/Output (I/O)
Input: each task requires some materials (data) to be completed, these materials are stored in a storage area (memory)
Output: each task returns a result that need to be put in the storage area (memory)
Examples: vector/matrix/array operations, process the content of multiple files
Need:
Total time = ? (exercise)
┌──────────┐
│resource 1│ load task 1 write load task 3 write
└──────────┘ data 1 ──► (worker 1) ──► result 1 ──► data 3 ──► (worker 3) ──► result 3 ──► . . .
┌──────────┐
│resource 2│ load task 2 write load task 4 write
└──────────┘ data 2 ──► (worker 2) ──► result 2 ──► data 4 ──► (worker 4) ──► result 4 ──► . . .
.
.
.
└─────────────────────────────────────────────────────────────────────────────────────►
Time
Total time = \text{max}_{\ell=1,\dots,q}\{\tau_\ell\}
with \tau_\ell = \sum_{i\in J_\ell} t_{i,\text{in}} + t_i + t_{i,\text{out}} = total time to complete all tasks done on resource \ell (where J_\ell is the set of indexes of tasks done on resource \ell)
Potential bottlenecks:
Overhead on memory access
a task = “compute the sum of a given row in a matrix”
Objective: compute all row-wise sums for a 10000 \times 1000 matrix (i.e. 10000 tasks)
Resources: 8 computing units
Number of workers: 1, 2, 4, 6, 8
Attention: “worker” may sometimes refer to a working resource in the literature
Sometimes tasks cannot be done in parallel
multi-core CPU: multiple computing units (called “cores”) in a single processor
different level of local memory called “cache”
to run a computation: transfer data from shared memory to local cache (and vice-versa for results) \rightarrow potential bottleneck
┌─────────────────┬───────────────────────┐
│ │ │
┌──────────┴──┐ ┌─────── │ ────────┐ ┌─────── │ ────────┐
│ MEMORY │ │ CPU1 │ │ │ CPU2 │ │
│ │ │ ┌──────┴───────┐ │ │ ┌──────┴───────┐ │
│ │ │ │ Local Memory │ │ │ │ Local Memory │ │
│ │ │ └──────┬───────┘ │ │ └──────┬───────┘ │
│ │ │ │ │ │ │ │
│ │ │ ┌───┐ │ ┌───┐ │ │ ┌───┐ │ ┌───┐ │
│ │ │ │ C ├──┼──┤ C │ │ │ │ C ├──┼──┤ C │ │
│ │ │ └───┘ │ └───┘ │ │ └───┘ │ └───┘ │
└─────────────┘ │ │ │ │ │ │
│ ┌───┐ │ ┌───┐ │ │ ┌───┐ │ ┌───┐ │
│ │ C ├──┴──┤ C │ │ │ │ C ├──┴──┤ C │ │
│ └───┘ └───┘ │ │ └───┘ └───┘ │
│ │ │ │
└──────────────────┘ └──────────────────┘
“many-core” computing card
local memory
slower connection to shared memory than CPUs
to run a computation: transfer data from host shared memory to local memory (and vice-versa for results)
\rightarrow potential bottleneck
┌───────────────────────────────────────────┐
│ │
│ ┌─────────────────┐ │
│ │ │ │
┌────────┴─┴──┐ ┌─────── │ ────────┐ ┌─────── │ ─────────┐
│ MEMORY │ │ CPU1 │ │ │ GPU │ │
│ │ │ ┌──────┴───────┐ │ │ ┌──────┴───────┐ │
│ │ │ │ Local Memory │ │ │ │ Local Memory │ │
│ │ │ └──────┬───────┘ │ │ └──────┬───────┘ │
│ │ │ │ │ │ │ │
│ │ │ ┌───┐ │ ┌───┐ │ │ ┌─┬─┬─┼─┬─┬─┬─┐ │
│ │ │ │ C ├──┼──┤ C │ │ │ │C│C│C│C│C│C│C│ │
│ │ │ └───┘ │ └───┘ │ │ ├─┼─┼─┼─┼─┼─┼─┤ │
└─────────────┘ │ │ │ │ │C│C│C│C│C│C│C│ │
│ ┌───┐ │ ┌───┐ │ │ ├─┼─┼─┼─┼─┼─┼─┤ │
│ │ C ├──┴──┤ C │ │ │ │C│C│C│C│C│C│C│ │
│ └───┘ └───┘ │ │ └─┴─┴─┴─┴─┴─┴─┘ │
│ │ │ │
└──────────────────┘ └───────────────────┘
CPU | GPU |
---|---|
tens (10x) of computing units (“cores”) | thousand (1000x) of computing units (“cores”) |
computing units capable of more complex operations | computing units only capable of more simple operations |
larger cache memory per computing unit | very small cache memory per computing unit |
faster access to RAM | slower access to RAM |
\rightarrow efficient for general purpose parallel programming (e.g. check conditions) | \rightarrow fast for massively parallel computations based on simple elementary operations (e.g. linear algebra) |
\begin{bmatrix} \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & a_{ij} & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \\\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \end{bmatrix}_{N \times P}
Row-wise sum: vector C = [c_{i}]_{i=1:N} of size N where c_{i} = \sum_{j=1}^P a_{ij}
Column-wise sum: vector D = [d_{j}]_{j=1:P} of size P where d_{j} = \sum_{i=1}^N a_{ij}
\begin{bmatrix} \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & a_{ij} & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \\\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \end{bmatrix}_{N \times P}\ \ \rightarrow \ \ \begin{bmatrix} \vdots \\ \vdots \\ \sum_{j=1}^{P} a_{ij} \\ \vdots\\ \vdots\\ \end{bmatrix}_{N \times 1}
\begin{array}{c} \begin{bmatrix} \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & a_{ij} & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \\\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ & \ \ \ \cdot \ \ \ \\ \end{bmatrix}_{N \times P}\\ \downarrow \ \ \ \ \ \ \\ \begin{bmatrix} \ \dots \ & \dots & \sum_{i=1}^{N} a_{ij} & \dots & \ \dots \ \\ \end{bmatrix}_{1\times P}\\ \end{array}
Row-wise sum:
Exercise: parallel algorithm?
Solution 1?
Exercise: any concurrent access to memory by the parallel tasks ? in input (reading) ? in output (writing) ?
Solution 1:
Solution 2:
vecD[i]
?\rightarrow need for synchronization (with a time cost)
Solution 3?
Any other issue ?
1 launch of N parallel tasks running each P operations
\rightarrow N “long” parallel tasks
Cost (in time) to launch parallel tasks \sim O(N)
Solution 1?
Parallel column-wise vs parallel row-wise matrix sum algorithms
Matrix 10000 \times 10000
Objective: run 10000 tasks
Resources: 64 computing units
Number of workers: 1, 2, 4, 8, 16, 32
Exercise 1: why the performance degradation?
{\rightarrow overhead for memory access}
Exercise 2: why the performance difference?
{\rightarrow impact of array storage order}
Matrix in memory = a big array of contiguous rows or columns
Memory: \begin{array}{|c|c|c|} \hline a_{11} & a_{12} & a_{13} \\ \hline \end{array}\ \begin{array}{|c|c|c|} \hline a_{21} & a_{22} & a_{23} \\ \hline \end{array}\ \begin{array}{|c|c|c|} \hline a_{31} & a_{32} & a_{33} \\ \hline \end{array}
Memory: \begin{array}{|c|c|c|} \hline a_{11} & a_{21} & a_{31} \\ \hline \end{array}\ \begin{array}{|c|c|c|} \hline a_{12} & a_{22} & a_{32} \\ \hline \end{array}\ \begin{array}{|c|c|c|} \hline a_{13} & a_{23} & a_{33} \\ \hline \end{array}
Memory access: read data from memory by block
To access a_{11}: load \begin{array}{|c|c|c|} \hline a_{11} & a_{12} & a_{13} \\ \hline \end{array} into cache
To access a_{11}: load \begin{array}{|c|c|c|} \hline a_{11} & a_{21} & a_{31} \\ \hline \end{array} into cache
To compute a_{11} + a_{12} + a_{13} ?
res
=0res
= res
+ a_{11}res
= res
+ a_{12}res
= res
+ a_{13}To compute a_{11} + a_{12} + a_{13} ?
res
=0res
= res
+ a_{11}res
= res
+ a_{12}res
= res
+ a_{13} More memory accesses \rightarrow time consumingExample: “big” matrix (4 \times 6) \tiny \begin{array}{|c|c|c|c|c|c|c|}\hline a_{11} & a_{12} & a_{13} & a_{14} & a_{15} & a_{16} \\ \hline a_{21} & a_{22} & a_{23} & a_{24} & a_{25} & a_{26} \\ \hline a_{31} & a_{32} & a_{33} & a_{34} & a_{35} & a_{36} \\ \hline a_{41} & a_{42} & a_{43} & a_{44} & a_{45} & a_{46} \\ \hline\end{array}
Storage in memory (row major):
\tiny \begin{array}{|c|c|c|c|c|c|c|} \hline a_{11} & a_{12} & a_{13} & a_{14} & a_{15} & a_{16} \\ \hline \end{array} \begin{array}{|c|c|c|c|c|c|c|} \hline a_{21} & a_{22} & a_{23} & a_{24} & a_{25} & a_{26} \\ \hline \end{array} \begin{array}{|c|c|c|c|c|c|c|} \hline a_{31} & a_{32} & a_{33} & a_{34} & a_{35} & a_{36} \\ \hline \end{array} \begin{array}{|c|c|ccc} \hline a_{41} & a_{42} & \cdot & \cdot & \cdot \\ \hline \end{array}
Access by sub-blocks1 of data (e.g. sub-block of rows or columns)
Sum of row 1 (row major):
Sum of row 1 (column major):
access block \begin{array}{|c|c|c|} \hline a_{11} & a_{21} & a_{31} \\ \hline \end{array} res = res + a_{11}
access block \begin{array}{|c|c|c|} \hline a_{12} & a_{22} & a_{32} \\ \hline \end{array} res = res + a_{12}
access block \begin{array}{|c|c|c|} \hline a_{13} & a_{23} & a_{33} \\ \hline \end{array} res = res + a_{13}
access block \begin{array}{|c|c|c|} \hline a_{14} & a_{24} & a_{34} \\ \hline \end{array} res = res + a_{14}
access block \begin{array}{|c|c|c|} \hline a_{15} & a_{25} & a_{35} \\ \hline \end{array} res = res + a_{14}
access block \begin{array}{|c|c|c|} \hline a_{16} & a_{26} & a_{36} \\ \hline \end{array} res = res + a_{16}
Matrix A = [a_{ij}]_{i=1:N}^{j=1:P} of dimension N\times{}P
Matrix B = [b_{jk}]_{j=1:P}^{k=1:Q} of dimension N\times{}P
Matrix product: C = A \times B = [c_{ik}]_{i=1:N}^{k=1:Q} of dimension N\times{}Q where
c_{ik} = \sum_{j=1}^P a_{ij} \times b_{jk}
# input
matA = np.array(...).reshape(N,P)
matA = np.array(...).reshape(P,Q)
# output
matC = np.zeros((N,Q))
# algorithm
for i in range(N):
for k in range(Q):
for j in range(P):
matC[i,k] += matA[i,j] * matB[j,k]
Exercise: parallel algorithm?
C = \begin{bmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \\ \end{bmatrix},\, A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \\ \end{bmatrix},\, B = \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \\ \end{bmatrix}
\begin{darray}{rcl} \begin{bmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \\ \end{bmatrix} & = & \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \\ \end{bmatrix} \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \\ \end{bmatrix} \\ & = & \begin{bmatrix} A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22}\\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22}\\ \end{bmatrix} \end{darray}
\rightarrow see also “tiled implementation”
Initialization: f_0 = 0, f_1 = 1
Iteration: f_i = f_{i-1} + f_{i-2} for any i \geq 2
Sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, …
Markov chain: sequence of random variables (X_i)_{i>1} such that \mathbb{P}(X_i = x_i \vert X_1 = x_1, X_2 = x_2, \dots , X_{i-1} = x_{i-1}) = \mathbb{P}(X_i = x_i \vert X_{i-1} = x_{i-1})
X_i\in S where S is the state space
Example:
two states E and A, i.e S = \{A, E\}
transition probability matrix:
\begin{array}{cl} \begin{array}{cc} A & E \end{array} \\ \left(\begin{array}{cc} 0.6 & 0.4 \\ 0.3 & 0.7 \\ \end{array}\right) & \begin{array}{l} A \\ E \end{array} \end{array}
Pick an initial state X_0 = x with x\in S
For in in 1,\dots, N:
For the simulation:
Exercise: parallel algorithm?
NO!
Parallel computing can be used to run computations faster (i.e. save time)
Relationship between time gain and number of tasks run in parallel is not linear
Potential bottlenecks leading to potential performance loss:
Introduction to parallel computingAdvanced Programming and Parallel Computing, Master 2 MIASHS