GPU-based Heterogeneous Systems [PCs (CPU + GPU) = Heterogeneous Systems] Leonel Sousa and Lídia Kuan and Aleksandar Ili!
General Outline GPU-based Heterogeneous Systems CHPS: Collaborative-execution-environment for Heterogeneous Parallel Systems CudaMPI and CaravelaMPI: APIs for GPU computation in distributed systems Conclusions and future work 2
CHPS: Introduction Commodity desktop computers = Heterogeneous systems Multi-core general-purpose processors (CPUs) Many-core graphic processing units (GPUs) Special accelerators, co-processors, FPGAs, DSPs! Huge collaborative computing power - Not yet explored - In most research done target the usage of one these devices at time, mainly for domain-specific computations Heterogeneity makes problems much more complex 3
CHPS: Desktop Heterogeneous Systems Master-slave execution paradigm Distributed-memory programming techniques CPU (Master) Global execution controller Access the whole global memory Interconnection Busses Distributed-memory system with restrictions in communication bandwidth Underlying Devices (Slaves) Different architectures and programming models Computation performed using local memories 4
CHPS: Unified Execution Model Primitive jobs - Minimal problem portions for parallel execution (1 st agglomeration) - Balanced granularity: fine enough partition to sustain execution on every device in the system, but coarse enough to reduce data movement problems/overheads Job Queue - Classifies the primitive jobs to support specific parallel execution scheme - Accommodates primitive jobs batching into a single job according to the individual device demands (2 nd agglomeration) 5
CHPS: Unified Execution Model II Device Query Identifies all underlying devices Holds per-device information (device type, current status, requested workload and performance history) Scheduler Adopts a scheduling scheme according to devices capabilities, availability and restrictions (currently, FCFS) Forms execution pair: <device, computational_portion> Upon the execution completion, the data is transferred back to the host, and the device is free to support another request from the Scheduler 6
CHPS: Case study Dense Matrix Multiplication Matrix multiplication is based on sub-matrix partitioning by creating a set of primitive jobs Devices can request several primitive jobs at once Single updates are performed atomically High performance libraries: ATLAS CBLAS [CPU], CUBLAS [GPU] Problem size M N K Primitive Job parameters P Q R Job Queue size 7
CHPS: Case study 3D Fast Fourier Transform Problem size N 1 N 2 N 3 Primitive Job parameters P 1 P 2 P 3 Job Queue Size 2D FFT Batch N 1 / P 1! 1D FFT Batch! (N 2 / P 2 ) x (N 3 / P 3 ) H = FFT 1D (FFT 2D (h)) Traditional parallel implementation requires transpositions - between FFTs applied on different dimensions and after executing the final FFT Our implementation Only portions of data which are assigned to the device are transposed, and the results are retrieved on adequate positions after the final FFT High performance libraries: FFTW [CPU], CUFFT [GPU] 8
CHPS: Experimental Results Optimized Dense Matrix Multiplication RESULTS Exhaustive search 32x32 tests Optimal load-balancing GPU 2x13 CPU 3x2 Obtained speedup (comparing to): 1 CPU core 4.5 2 CPU cores 2.3 1 GPU 1.2 Single precision floating-point arithmetic (Largest problem executable by the GPU) Problem size Primitive Job parameters M N K 4096 4096 4096 P Q R 4096 128 4096 Job Queue size 32 Experimental Setup CPU Intel Core 2 GPU NVIDIA GeForce8600GT Speed/Core (GHz) 2.33 0.54 Global Memory (MB) 2048 256 High Performance Software Matrix Multiplication ATLAS 3.8.3 CUBLAS 2.1 3D FFT FFTW 3.2.1 CUFFT 2.1 9
CHPS: Experimental Results Complex 3D FFT RESULTS FFT Batch 2D FFT 1D FFT Complex 3D fast Fourier transform (Problem not possible to execute on the GPU) Exhaustive search 160 tests 416 tests GPU CPU GPU CPU Optimal load-balancing 4 1 12 4 Obtained speedup (comparing to): 1 CPU core 3.2 2.2 2 CPU cores 1.6 ~1! (1.9)* 1 GPU x x Problem size Primitive Job parameters Job Queue size 2D FFT Batch 1D FFT Batch N 1 N 2 N 3 512 512 512 P 1 P 2 P 3 16 16 16 32 x 2D FFT (512x512) 32x32 1D FFTs (512) * - Instituto without integrated de Engenharia transposition de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 10
CHPS: Experimental Results Total complex 3D FFT Transposition time dominates the execution for higher workloads! Memory transfers are significant! Matrix Multiplication GPU execution surpass the CPU execution 11
Work CHPS - Collaborative Execution Environment for Heterogeneous Parallel Systems CudaMPI and CaravelaMPI - APIs for GPU computation in distributed systems 12
CudaMPI and CaravelaMPI: Motivation Distributed system for computation Take advantage of GPU s computational power Use several platforms containing GPUs to solve one single problem Programming challenges: Algorithm parallelization Perform computation in GPUs Execution in a distributed system where platforms have their own memory Network communication 20/10/09 13
CudaMPI and CaravelaMPI: Proposal GPU general-purpose computation on a distributed system without being concerned with the network communication Implementation of two libraries using MPI for communication in a distributed system: CudaMPI: CUDA for computation on NVIDIA s Tesla architecture GPUs Execution kernel passed as parameter CaravelaMPI: OpenGL and DirectX for computation on any vendor s GPU Execution kernel set by a XML file 20/10/09 14
CudaMPI and CaravelaMPI: Libraries CudaMPI CaravelaMPI System functions initialize and finalize the involved environments Communication and Execution functions provide send and receive calls and execute a kernel Non-blocking functions allow to execute an independent function while waiting for data to be received / send Synchronization function provide the guarantee that the data sending has finished 15
CudaMPI and CaravelaMPI: Case Study Dense Matrix Multiplication The matrices are distributed by P nodes Each step calculates a partial of the final result 20/10/09 16
CudaMPI: Case study 3D Fast Fourier Transform Our implementation H = FFT 1D (FFT 2D (h)) Portions of data are assigned to different nodes of the cluster 2D FFT: P1 = N1/(number of nodes) 1D FFT: P3 = N3/(number of nodes) High performance libraries: CUFFT [GPU] 17
CudaMPI and CaravelaMPI: Experimental Results Dense Matrix Multiplication Cluster environment with 4 computers (Intel Core 2 Quad@2.4Ghz, NVIDIA GeForce 8800GT) Gigabit Ethernet network 256x256! 512x512! 1024x1024! 2048x2048! Nº of Nodes! 1! 4! 1! 4! 1! 4! 1! 4! CPU + MPI (sec)! Network! "#""""! "#"$%&! "#""""! "#'%""! "#""""! (#"'&%! "#""""! )#"$))! Kernel! "#*$&"! "#($&"! *#+&*"! (#'"$"! *&#,&""! %#)")"! ("+)#)"""! ('*#')""! Total! "#*$&"! "#'*$&! *#+&*"! (#)%$"! *&#,&""! ("#)+'%! ("+)#)"""! ('%#+())! CudaMPI (sec)! Network! "#""""! "#"$$$! "#""""! "#'$,'! "#""""! (#")$%! "#""""! )#"%+'! Kernel! "#""()! "#""",! "#"",)! "#""'*! "#"+,(! "#"('&! "#'*()! "#"$()! Total! "#""()! "#"$&)! "#"",)! "#'$&$! "#"+,(! (#",",! "#'*()! )#(,)$! Network! "#""""! "#()')! "#""""! "#$'*&! "#""""! '#*+,"! "#""""! ("#),$(! CaravelaMPI (sec)! Kernel! "#'(),! "#"$'$! (#)',$! "#))+"! ("#$%")! '#%(,"!,*#$$%)! ''#")(%! Total! "#'(),! "#'(*(! (#)',$! (#(,&&! ("#$%")! *#)*'"!,*#$$%)! +'#*"%"! 18
CudaMPI and CaravelaMPI: Experimental Results Dense Matrix Multiplication The lowest execution time is achieve with the CudaMPI library Results show that the bottleneck for CudaMPI library is the lag of the network Using 4 nodes improves the execution time. Speed-up up to 3.5 is achieved GPU execution has better performance than CPU 19
CudaMPI: Experimental Results 3D Fast Fourier Transform CudaMPI library using CUFFT for computation With network time, up to 2,5x speed-up is achieved Up to 4x speed-up is achieved (4 nodes vs 1 node) without network time 256x256x256! 512x512x64! 1024x1024x32 2048x2048x4 1! 4! 1! 4! 1 4 1 4 Kernel (sec)! 1,4736! 0,3675! 5,0341! 1,2582! 19,2262 4,8491 68,1124 17,3955 PCI-E Bus (sec)! 1,8494! 0,4397! 5,7882! 1,4318! 21,9203 5,5027 87,0106 21,1429 Network (sec)! 0,0000! 12,6753! 0,0000! 12,2511! 0,0000 12,3946 0,0000 24,2128 Total (sec)! 3,3230! 13,4825! 10,8222! 14,9412! 41,1465 22,7464 155,1230 62,7511 20
Conclusions and Future work: CHPS The proposed unified execution environment achieves significant speedups relatively to the single CPU core execution: 4.5 for dense matrix multiplication 2.8 for complex 3D fast Fourier transform Future work: Implementation in more heterogeneous systems (more GPUs, more CPU cores, or special-purpose accelerators) Asynchronous memory transfers Tackle dependencies and adopt advanced scheduling policies Performance prediction and application self-tuning To identify limits in performance to choose the processor to use (e.g GPU versus CPU) 21
Conclusions and Future work: CudaMPI The two considered algorithms When considering only the kernel execution time, speed-up is always achieved Network communication is the bottleneck on CudaMPI library Future Work Test the libraries using Infiniband network Use the libraries for solving single different independent applications and considering that data is replicated 22
Questions? Thank you 20/10/09 23