GPU-based Heterogeneous Systems [PCs (CPU + GPU) = Heterogeneous Systems]



Documentos relacionados
Collaborative Execution Environment for Heterogeneous Parallel Systems CHPS*

Cooperative Execution on Heterogeneous Multi-core Systems

Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory

Transparent application acceleration by intelligent scheduling of shared library calls on heterogeneous systems September 9, 2013

Leonel Sousa 1, Aleksandar Ilić 1, Frederico Pratas 1 and Pedro Trancoso 2 work performed in the scope of HiPEAC FP7 project

Extending the application of GPUs beyond processing accelerators

Heterogeneous multi-core computer architectures and

Efficient Locally Trackable Deduplication in Replicated Systems. technology from seed

Performance-aware task management and frequency scaling in embedded systems

GPU Accelerated Stochastic Inversion of Deep Water Seismic Data

Tese / Thesis Work Análise de desempenho de sistemas distribuídos de grande porte na plataforma Java

Análise de desempenho e eficiência energética de aceleradores NVIDIA Kepler

A Cloud Computing Architecture for Large Scale Video Data Processing

Interoperability through Web Services: Evaluating OGC Standards in Client Development for Spatial Data Infrastructures

Placa de vídeo em CUDA

Unifying Stream Based and Reconfigurable Computing to Design Application Accelerators

Avaliação de Desempenho do Método de Lattice Boltzmann em Arquiteturas multi-core e many-core

Caravela: A Distributed Stream-based Computing Platform

GUIÃO A. Ano: 9º Domínio de Referência: O Mundo do Trabalho. 1º Momento. Intervenientes e Tempos. Descrição das actividades

CHPC Computational Platforms

Aula 02. Disciplina Sistemas de Computação

Programação em Paralelo. N. Cardoso & P. Bicudo. Física Computacional - MEFT 2012/2013

Processamento de Alto Desempenho utilizando Unidade de Processamento Gráfico - GPU

Virtual Network Embedding with Coordinated Node and Link Mapping

Manual de Docência para a disciplina de Algoritmia e Programação 2005/2006 Engenharia Informática, 1º ano José Manuel Torres

HMI Caracteristicas e extensões utilizando FT View ME v6.1 e PanelView Plus 6

Wiki::Score A Collaborative Environment For Music Transcription And Publishing

ICS-GT INTEGRATED CONTROL SYSTEM FOR GAS TURBINE

Programação Paralela Híbrida em CPU e GPU: Uma Alternativa na Busca por Desempenho

hdd enclosure caixa externa para disco rígido

Hitachi Unified Storage. Família HUS 100. Henrique Leite! Tuesday, 4 de September de 12! Solutions Consultant!

COMPUTAÇÃO PARALELA. uma visão geral. Guilherme Galante. v.2.0

COMITÊ DO ESPECTRO PARA RADIODIFUSÃO - CER SPECTRUM DAY A REVISÃO DA REGULAMENTAÇÃO DO USO DA FAIXA DE 3,5 GHZ UMA NECESSIDADE COMPROVADA.

Scientific data repositories: the USP experience

TDI ELECTRONICS DO BRASIL

SISTEMAS DISTRIBUÍDOS 1º EXAME

Felipe Beltrán Rodríguez 1, Eng., Master Student Prof. Erlon Cristian Finardi 1, D. Eng., Advisor Welington de Oliveira 2, D.Sc.

Automated Control in Cloud Computing: Challenges and Opportunities

T Ã O B O M Q U A N T O N O V O

Parallel Computing Paradigms

ÍNDICE PORTUGUÊS INDEX ENGLISH

Course Review for Midterm Exam 1. Cpt S 223 Fall 2010

Parallel Algorithms for Multicore Game Engines

HadoopDB. Edson Ie Serviço Federal de Processamento de Dados - SERPRO

Computação Heterogênea Programação paralela, clusters e GPUs

Biologically Inspired Compu4ng: Neural Computa4on. Lecture 5. Patricia A. Vargas

What is so special in the Tomasulo Hardware Algorithm?

Using Big Data to build decision support tools in

Controle - 1. Monitorar e Controlar o Trabalho do Projeto Verificação do Escopo Controle do Escopo. Mauricio Lyra, PMP

gssjoin: a GPU-based Set Similarity Join Algorithm

Compilando o Kernel Linux

Computação de Alto Desempenho uma pequena introdução

YouTube. e a CampusTV. Fernando Birra, 2014

Regulador Integrado TA Rev.0. Regulador Integrado TA-956 Integrated Regulator TA-956

Enplicaw Documentation

VGM. VGM information. ALIANÇA VGM WEB PORTAL USER GUIDE June 2016

Redes de Próxima Geração

Métodos Formais em Engenharia de Software. VDMToolTutorial

VMWare Horizon e a NVIDIA GRID vgpu Possibilitando gráficos 3D na nuvem Marcio Aguiar, ENTERPRISE Latin America Manager

Software reliability analysis by considering fault dependency and debugging time lag Autores

Company Presentation COMPANY

Técnicas de Processamento Paralelo na Geração do Fractal de Mandelbrot

CIS 500 Software Foundations Fall September(continued) IS 500, 8 September(continued) 1

OVERVIEW DO EAMS. Enterprise Architecture Management System 2.0

Programação Paralela e Distribuída

VMware vsphere: Install, Configure, Manage [v6.5] (VWVSICM6.5)

Research Institute: Experience of Surviving Transferring Knowledge to Society

Performance Evaluation of Software Architectures. Outline. José Costa Software architectures - exercises. Software for Embedded Systems

Contribution of the top boat game for learning production engineering concepts

INFORMATION SECURITY IN ORGANIZATIONS

Live Show Gerenciamento de Ambientes. Danilo Bordini ( ) Rodrigo Dias (

Cigré/Brasil. CE B5 Proteção e Automação. Seminário Interno de Preparação para o Colóquio do SC B5 2009

Dealing with Device Data Overflow in the Cloud

Como Mudar a Senha do Roteador Pelo IP o.1.1. Configure e Altere a Senha do seu Roteador acessando o IP Acesse o Site e Confira!

Session 8 The Economy of Information and Information Strategy for e-business

A Tool to Evaluate Stuck-Open Faults in CMOS Logic Gates

CSE 521: Design and Analysis of Algorithms I

DESENVOLVIMENTO DE MODELOS, TÉCNICAS E APLICAÇÕES PARA ARRANJO DE SENSORES JOÃO PAULO CARVALHO LUSTOSA DA COSTA

CANape/vSignalyzer. Data Mining and Report Examples Offline Analysis V

Interacção Homem-Máquina Interfaces Tangíveis e Realidade Aumentada

Algoritmo de Regras de Associação Paralelo para Arquiteturas Multicore e Manycore

Presentation: MegaVoz Contact Center Tool

manualdepsiquiatriainfant il manual de psiquiatria infantil

Performance and Power Consumption Analysis of Full Adders Designed in 32nm Technology

Marcelo Nery dos Santos. GridFS Um Servidor de Arquivos para Grades e Ambientes Distribuídos Heterogêneos. Dissertação de Mestrado

Interactive Internet TV Architecture Based on Scalable Video Coding

Select a single or a group of files in Windows File Explorer, right-click and select Panther Print

GESTÃO DE RECURSOS NATURAIS. Ano letivo 2011/2012. Exercício: Sistema de apoio à decisão para eucalipto (Aplicação de Programação Linear)

Elbio Renato Torres Abib. Escalonamento de Tarefas Divisíveis em Redes Estrela MESTRADO. Informática DEPARTAMENTO DE INFORMÁTICA

ENGENHARIA DE SERVIÇOS SERVICES ENGINEERING

ANDERSON UILIAN KAUER ESCALONAMENTO DE TAREFAS EM ARQUITETURAS HETEROGÊNAS APUS

Protective circuitry, protective measures, building mains feed, lighting and intercom systems

:: COMO ESCOLHER UMA ESCOLA IDIOMAS PDF ::

Digital Cartographic Generalization for Database of Cadastral Maps

WORKING CHILDREN. a) How many children in Britain have part-time jobs?. b) What do many Asian children do to make money in Britain?.

SATA 3.5. hd:basic. hdd enclosure caixa externa para disco rígido

Easy Linux! FUNAMBOL FOR IPBRICK MANUAL. IPortalMais: a «brainware» company Manual

LICENCIATURA EM ENG. DE SISTEMAS E INFORMÁTICA Redes e Serviços de Banda Larga. Laboratório 4. OSPF Backbone

Solicitação de Mudança 01

Transcrição:

GPU-based Heterogeneous Systems [PCs (CPU + GPU) = Heterogeneous Systems] Leonel Sousa and Lídia Kuan and Aleksandar Ili!

General Outline GPU-based Heterogeneous Systems CHPS: Collaborative-execution-environment for Heterogeneous Parallel Systems CudaMPI and CaravelaMPI: APIs for GPU computation in distributed systems Conclusions and future work 2

CHPS: Introduction Commodity desktop computers = Heterogeneous systems Multi-core general-purpose processors (CPUs) Many-core graphic processing units (GPUs) Special accelerators, co-processors, FPGAs, DSPs! Huge collaborative computing power - Not yet explored - In most research done target the usage of one these devices at time, mainly for domain-specific computations Heterogeneity makes problems much more complex 3

CHPS: Desktop Heterogeneous Systems Master-slave execution paradigm Distributed-memory programming techniques CPU (Master) Global execution controller Access the whole global memory Interconnection Busses Distributed-memory system with restrictions in communication bandwidth Underlying Devices (Slaves) Different architectures and programming models Computation performed using local memories 4

CHPS: Unified Execution Model Primitive jobs - Minimal problem portions for parallel execution (1 st agglomeration) - Balanced granularity: fine enough partition to sustain execution on every device in the system, but coarse enough to reduce data movement problems/overheads Job Queue - Classifies the primitive jobs to support specific parallel execution scheme - Accommodates primitive jobs batching into a single job according to the individual device demands (2 nd agglomeration) 5

CHPS: Unified Execution Model II Device Query Identifies all underlying devices Holds per-device information (device type, current status, requested workload and performance history) Scheduler Adopts a scheduling scheme according to devices capabilities, availability and restrictions (currently, FCFS) Forms execution pair: <device, computational_portion> Upon the execution completion, the data is transferred back to the host, and the device is free to support another request from the Scheduler 6

CHPS: Case study Dense Matrix Multiplication Matrix multiplication is based on sub-matrix partitioning by creating a set of primitive jobs Devices can request several primitive jobs at once Single updates are performed atomically High performance libraries: ATLAS CBLAS [CPU], CUBLAS [GPU] Problem size M N K Primitive Job parameters P Q R Job Queue size 7

CHPS: Case study 3D Fast Fourier Transform Problem size N 1 N 2 N 3 Primitive Job parameters P 1 P 2 P 3 Job Queue Size 2D FFT Batch N 1 / P 1! 1D FFT Batch! (N 2 / P 2 ) x (N 3 / P 3 ) H = FFT 1D (FFT 2D (h)) Traditional parallel implementation requires transpositions - between FFTs applied on different dimensions and after executing the final FFT Our implementation Only portions of data which are assigned to the device are transposed, and the results are retrieved on adequate positions after the final FFT High performance libraries: FFTW [CPU], CUFFT [GPU] 8

CHPS: Experimental Results Optimized Dense Matrix Multiplication RESULTS Exhaustive search 32x32 tests Optimal load-balancing GPU 2x13 CPU 3x2 Obtained speedup (comparing to): 1 CPU core 4.5 2 CPU cores 2.3 1 GPU 1.2 Single precision floating-point arithmetic (Largest problem executable by the GPU) Problem size Primitive Job parameters M N K 4096 4096 4096 P Q R 4096 128 4096 Job Queue size 32 Experimental Setup CPU Intel Core 2 GPU NVIDIA GeForce8600GT Speed/Core (GHz) 2.33 0.54 Global Memory (MB) 2048 256 High Performance Software Matrix Multiplication ATLAS 3.8.3 CUBLAS 2.1 3D FFT FFTW 3.2.1 CUFFT 2.1 9

CHPS: Experimental Results Complex 3D FFT RESULTS FFT Batch 2D FFT 1D FFT Complex 3D fast Fourier transform (Problem not possible to execute on the GPU) Exhaustive search 160 tests 416 tests GPU CPU GPU CPU Optimal load-balancing 4 1 12 4 Obtained speedup (comparing to): 1 CPU core 3.2 2.2 2 CPU cores 1.6 ~1! (1.9)* 1 GPU x x Problem size Primitive Job parameters Job Queue size 2D FFT Batch 1D FFT Batch N 1 N 2 N 3 512 512 512 P 1 P 2 P 3 16 16 16 32 x 2D FFT (512x512) 32x32 1D FFTs (512) * - Instituto without integrated de Engenharia transposition de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa! 10

CHPS: Experimental Results Total complex 3D FFT Transposition time dominates the execution for higher workloads! Memory transfers are significant! Matrix Multiplication GPU execution surpass the CPU execution 11

Work CHPS - Collaborative Execution Environment for Heterogeneous Parallel Systems CudaMPI and CaravelaMPI - APIs for GPU computation in distributed systems 12

CudaMPI and CaravelaMPI: Motivation Distributed system for computation Take advantage of GPU s computational power Use several platforms containing GPUs to solve one single problem Programming challenges: Algorithm parallelization Perform computation in GPUs Execution in a distributed system where platforms have their own memory Network communication 20/10/09 13

CudaMPI and CaravelaMPI: Proposal GPU general-purpose computation on a distributed system without being concerned with the network communication Implementation of two libraries using MPI for communication in a distributed system: CudaMPI: CUDA for computation on NVIDIA s Tesla architecture GPUs Execution kernel passed as parameter CaravelaMPI: OpenGL and DirectX for computation on any vendor s GPU Execution kernel set by a XML file 20/10/09 14

CudaMPI and CaravelaMPI: Libraries CudaMPI CaravelaMPI System functions initialize and finalize the involved environments Communication and Execution functions provide send and receive calls and execute a kernel Non-blocking functions allow to execute an independent function while waiting for data to be received / send Synchronization function provide the guarantee that the data sending has finished 15

CudaMPI and CaravelaMPI: Case Study Dense Matrix Multiplication The matrices are distributed by P nodes Each step calculates a partial of the final result 20/10/09 16

CudaMPI: Case study 3D Fast Fourier Transform Our implementation H = FFT 1D (FFT 2D (h)) Portions of data are assigned to different nodes of the cluster 2D FFT: P1 = N1/(number of nodes) 1D FFT: P3 = N3/(number of nodes) High performance libraries: CUFFT [GPU] 17

CudaMPI and CaravelaMPI: Experimental Results Dense Matrix Multiplication Cluster environment with 4 computers (Intel Core 2 Quad@2.4Ghz, NVIDIA GeForce 8800GT) Gigabit Ethernet network 256x256! 512x512! 1024x1024! 2048x2048! Nº of Nodes! 1! 4! 1! 4! 1! 4! 1! 4! CPU + MPI (sec)! Network! "#""""! "#"$%&! "#""""! "#'%""! "#""""! (#"'&%! "#""""! )#"$))! Kernel! "#*$&"! "#($&"! *#+&*"! (#'"$"! *&#,&""! %#)")"! ("+)#)"""! ('*#')""! Total! "#*$&"! "#'*$&! *#+&*"! (#)%$"! *&#,&""! ("#)+'%! ("+)#)"""! ('%#+())! CudaMPI (sec)! Network! "#""""! "#"$$$! "#""""! "#'$,'! "#""""! (#")$%! "#""""! )#"%+'! Kernel! "#""()! "#""",! "#"",)! "#""'*! "#"+,(! "#"('&! "#'*()! "#"$()! Total! "#""()! "#"$&)! "#"",)! "#'$&$! "#"+,(! (#",",! "#'*()! )#(,)$! Network! "#""""! "#()')! "#""""! "#$'*&! "#""""! '#*+,"! "#""""! ("#),$(! CaravelaMPI (sec)! Kernel! "#'(),! "#"$'$! (#)',$! "#))+"! ("#$%")! '#%(,"!,*#$$%)! ''#")(%! Total! "#'(),! "#'(*(! (#)',$! (#(,&&! ("#$%")! *#)*'"!,*#$$%)! +'#*"%"! 18

CudaMPI and CaravelaMPI: Experimental Results Dense Matrix Multiplication The lowest execution time is achieve with the CudaMPI library Results show that the bottleneck for CudaMPI library is the lag of the network Using 4 nodes improves the execution time. Speed-up up to 3.5 is achieved GPU execution has better performance than CPU 19

CudaMPI: Experimental Results 3D Fast Fourier Transform CudaMPI library using CUFFT for computation With network time, up to 2,5x speed-up is achieved Up to 4x speed-up is achieved (4 nodes vs 1 node) without network time 256x256x256! 512x512x64! 1024x1024x32 2048x2048x4 1! 4! 1! 4! 1 4 1 4 Kernel (sec)! 1,4736! 0,3675! 5,0341! 1,2582! 19,2262 4,8491 68,1124 17,3955 PCI-E Bus (sec)! 1,8494! 0,4397! 5,7882! 1,4318! 21,9203 5,5027 87,0106 21,1429 Network (sec)! 0,0000! 12,6753! 0,0000! 12,2511! 0,0000 12,3946 0,0000 24,2128 Total (sec)! 3,3230! 13,4825! 10,8222! 14,9412! 41,1465 22,7464 155,1230 62,7511 20

Conclusions and Future work: CHPS The proposed unified execution environment achieves significant speedups relatively to the single CPU core execution: 4.5 for dense matrix multiplication 2.8 for complex 3D fast Fourier transform Future work: Implementation in more heterogeneous systems (more GPUs, more CPU cores, or special-purpose accelerators) Asynchronous memory transfers Tackle dependencies and adopt advanced scheduling policies Performance prediction and application self-tuning To identify limits in performance to choose the processor to use (e.g GPU versus CPU) 21

Conclusions and Future work: CudaMPI The two considered algorithms When considering only the kernel execution time, speed-up is always achieved Network communication is the bottleneck on CudaMPI library Future Work Test the libraries using Infiniband network Use the libraries for solving single different independent applications and considering that data is replicated 22

Questions? Thank you 20/10/09 23