The World of big clusters and complex message passing
CNRS
IMAG
Paul-Valéry Montpellier 3 University
GPU = graphical process unit
“(…) a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles.” (Wikipedia)
Original purpose: image creation and manipulation for graphical rendering (video games, video edition, 3D conception, etc.)
“Modern GPUs are very efficient at manipulating computer graphics and image processing. Their highly parallel structure makes them more efficient than general-purpose central processing units (CPUs) for algorithms that process large blocks of data in parallel.” (Wikipedia)
Nowadays usage: more general massive computations based on matrix operations
Basic principle (present in most modern computing units) with three interconnected parts:
Principle: control units are large and expensive, while arithmetic units are more straightforward and cheaper
Design: a sub-processor = one control unity commanding several ALUs (= GPU cores) to operate on larger chunks of data
Limitation: all ALUs (connected to the same control unity) must obey the same commands (i.e. do the same thing at the same moment)
ALU blocks (= gpu sub-processors) not as versatile as CPU cores but can operate over large amounts of data in batches
Central control unit syncs all the sub-processors (each one can do a different task) but the same set of ALUs cannot do different things in parallel (e.g. if
statements costly for GPUs)
Example: each block of 16 ALUs is limited to processing the same instruction over 16 pairs of operands
CPU cores optimized for latency: to finish a task as fast as possible.
\rightarrow scales with more complex operations but not with larger data
GPU cores optimized for throughput: individually slower, but they operate on bulks of data at once.
\rightarrow scales with larger data but not with more complex operations
Travelling by road:
Total time to transport 5 persons over 100km (latency):
Total time to transport 160 persons over 100km (throughput):
Fastest way to learn?
Latency: private teacher or small classroom
Throughput: big classroom or online course
\rightarrow none is arguably better than the others, all work and fill a specific need
Type | Nb of cores | Memory (cache1) | Memory (device) | Power per core | Clock rate2 |
---|---|---|---|---|---|
CPU | 10\times | 10\times - 100\times KB | 10\times - 100\times GB3 | 10\times W | 3 - 5 GHz |
GPU | 1000\times | 100\times B | 10\times GB4 | 0.01\times - 0.1\times W | 0.5 - 1 GHz |
CPU | GPU |
---|---|
Task parallelism | Data parallelism |
A few “heavyweight” cores | Many “ligthweight” cores |
High memory size | High memory throughput |
Many diverse instruction sets | A few highly optimized instruction sets |
Software thread management | Hardware thread management |
CPU directly mounted on the motherboard
<5\times 5 cm dimension1
GPU is a HUGE chip (\sim 30\times 10 cm)
mounted through PCI-express connection
AMD Ryzen Threadripper 2990X processor: 32cores (each capable of running two independent threads), 3 GHz up to 4.2 GHz (boost) clock rate, 2 instructions per clock cycle, each thread processes 8 floating-point values at the same time
Nvidia RTX 2080 TI GPU: 4352 ALUs, 1.35 GHz up to 1.545 GHz (boost) clock rate, 2 operations per clock cycle, one floating point operation per ALU
Throughput:
Maximum theoretical number of floating-point operations they can handle per second (FLOPS)
Latency: the CPU clocks at 4.2 GHz, while the GPU clocks at \sim 1.5 GHz, nearly three times slower
\rightarrow one is not better than the other, they serve different purposes
image processing = large matrix operations
\rightarrow why not using GPU for general matrix operations?
Example: deep-learning = combination of linear transformation (=matrix product) and simple non-linear operations
General-Purpose Graphics Processing Units (GPGPU): computations on GPU not dedicated to image processing
Nvidia (with CUDA drivers)
AMD (with AMD drivers or openCL)
See this page1
Example with gpuR
:
See the dedicated notebook1
Cache = on-board memory unit for CPU cores
Very small
Example:
Wikipedia (K8 core in the AMD Athlon 64 CPU)
data transferred from memory to cache by blocks of contiguous data
to be efficient: necessary to use contiguous data in computations
See the dedicated notebook1
a front-end server: user interface to submit computations, accessible from the internet (e.g. by ssh
connection or through a web interface)
many computing servers (also called computing nodes or workers) that run the computations
one or more storage servers: to store the data and results
single-node multi-core computations (multi-threading or multi-processing)
multi-node computations (distributed computations)
GPU computing
combinations of two or more
MUSE cluster at MESO@LR computing center in Montpellier
How to share the computing resources (cores, memory, full nodes) between user?
\rightarrow a resource management system (also called a scheduler) to assign resources to users depending on their request
Functions:
SLURM = a job scheduler for clusters and supercomputers
Job submission:
GPU computing, CPU cache, Computing clusterAdvanced Programming and Parallel Computing, Master 2 MIASHS