Cooperative Execution on Heterogeneous Multi-core Systems

Documentos relacionados

Leonel Sousa 1, Aleksandar Ilić 1, Frederico Pratas 1 and Pedro Trancoso 2 work performed in the scope of HiPEAC FP7 project

Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory

GPU-based Heterogeneous Systems [PCs (CPU + GPU) = Heterogeneous Systems]

Collaborative Execution Environment for Heterogeneous Parallel Systems CHPS*

Transparent application acceleration by intelligent scheduling of shared library calls on heterogeneous systems September 9, 2013

Heterogeneous multi-core computer architectures and

Performance-aware task management and frequency scaling in embedded systems

Efficient Locally Trackable Deduplication in Replicated Systems. technology from seed

A Cloud Computing Architecture for Large Scale Video Data Processing

GPU Accelerated Stochastic Inversion of Deep Water Seismic Data

Tese / Thesis Work Análise de desempenho de sistemas distribuídos de grande porte na plataforma Java

Virtual Network Embedding with Coordinated Node and Link Mapping

ICS-GT INTEGRATED CONTROL SYSTEM FOR GAS TURBINE

Digital Cartographic Generalization for Database of Cadastral Maps

75, 8.º DTO LISBOA

GPON-IN-A-BOX. QREN - I&D em Co-Promoção. Co-financiado por:

OVERVIEW DO EAMS. Enterprise Architecture Management System 2.0

Parallel Computing Paradigms

Unifying Stream Based and Reconfigurable Computing to Design Application Accelerators

VGM. VGM information. ALIANÇA VGM WEB PORTAL USER GUIDE June 2016

Software reliability analysis by considering fault dependency and debugging time lag Autores

Elbio Renato Torres Abib. Escalonamento de Tarefas Divisíveis em Redes Estrela MESTRADO. Informática DEPARTAMENTO DE INFORMÁTICA

Análise de desempenho e eficiência energética de aceleradores NVIDIA Kepler

Serviços: API REST. URL - Recurso

ÍNDICE PORTUGUÊS INDEX ENGLISH

CMDB no ITIL v3. Miguel Mira da Silva

Mathematical Foundation I: Fourier Transform, Bandwidth, and Band-pass Signal Representation PROF. MICHAEL TSAI 2011/10/13

GESTÃO DE RECURSOS NATURAIS. Ano letivo 2011/2012. Exercício: Sistema de apoio à decisão para eucalipto (Aplicação de Programação Linear)

Easy Linux! FUNAMBOL FOR IPBRICK MANUAL. IPortalMais: a «brainware» company Manual

CHPC Computational Platforms

Caravela: A Distributed Stream-based Computing Platform

Interoperability through Web Services: Evaluating OGC Standards in Client Development for Spatial Data Infrastructures

Parallel Algorithms for Multicore Game Engines

User Guide Manual de Utilizador

gssjoin: a GPU-based Set Similarity Join Algorithm

hdd enclosure caixa externa para disco rígido

Mil-Spec Numbering System Defined

Interactive Internet TV Architecture Based on Scalable Video Coding

T Ã O B O M Q U A N T O N O V O

Software product lines. Paulo Borba Informatics Center Federal University of Pernambuco

SELEÇÃO E CÁLCULO DE TRANSMISSÃO POR CORREIAS V V BELT TRANSMISSION SELECTION AND CALCULATION TR02

Placa de vídeo em CUDA

UNIDADE DE PESQUISA CLÍNICA Centro de Medicina Reprodutiva Dr Carlos Isaia Filho Ltda. SAMPLE SIZE DETERMINATION FOR CLINICAL RESEARCH

Erasmus Student Work Placement

2 Categorias Categories Todas as categorias de actividade são apresentadas neste espaço All activity categories are presented in this space

DIBELS TM. Portuguese Translations of Administration Directions

Welcome to Lesson A of Story Time for Portuguese

LICENCIATURA EM ENG. DE SISTEMAS E INFORMÁTICA Redes e Serviços de Banda Larga. Laboratório 4. OSPF Backbone

CIS 500 Software Foundations Fall September(continued) IS 500, 8 September(continued) 1

Biologically Inspired Compu4ng: Neural Computa4on. Lecture 5. Patricia A. Vargas

Tipos de Redes. Dois tipos fundamentais de redes

Analysis, development and monitoring of business processes in Corporate environment

Extending the application of GPUs beyond processing accelerators

User interface evaluation experiences: A brief comparison between usability and communicability testing

Enplicaw Documentation

Como testar componentes eletrônicos - volume 1 (Portuguese Edition)

e-lab: a didactic interactive experiment An approach to the Boyle-Mariotte law

Manual de Docência para a disciplina de Algoritmia e Programação 2005/2006 Engenharia Informática, 1º ano José Manuel Torres

ENGENHARIA DE SERVIÇOS SERVICES ENGINEERING

Implementing Data Models and Reports with SQL Server 2014 (20466)

MT BOOKING SYSTEM BACKOFFICE. manual for management

Tipos de Redes. Redes de Dados. Comunicação em Rede Local. Redes Alargadas. Dois tipos fundamentais de redes

Cutting Behavior and Process Monitoring During Grinding of Ceramics Using CVD-Tools - GRINDADVCER -

5/10/10. Implementação. Building web Apps. Server vs. client side. How to create dynamic contents?" Client side" Server side"

Project Management Activities

Writing Good Software Engineering Research Papers

Programação em Paralelo. N. Cardoso & P. Bicudo. Física Computacional - MEFT 2012/2013

Engenharia de Requisitos. Professor: Dr. Eduardo Santana de Almeida Universidade Federal da Bahia

Session 8 The Economy of Information and Information Strategy for e-business

Priority Queues. Problem. Let S={(s1,p1), (s2,p2),,(sn,pn)} where s(i) is a key and p(i) is the priority of s(i).

Designing and Deploying Microsoft SharePoint 2010 (10231)

Dealing with Device Data Overflow in the Cloud

Wiki::Score A Collaborative Environment For Music Transcription And Publishing

Provably Good Quality Multi-triangulation of Surfaces

Protective circuitry, protective measures, building mains feed, lighting and intercom systems

Scientific data repositories: the USP experience

Métodos Formais em Engenharia de Software. VDMToolTutorial

VGM. VGM information. ALIANÇA VGM WEB PORTAL USER GUIDE September 2016

Avaliação de Desempenho do Método de Lattice Boltzmann em Arquiteturas multi-core e many-core

Second Exam 13/7/2010

UERJ Programa de Pós-graduação em Engenharia Mecânica (PPGEM) Seminary Class

KREIOS G1 - Acessórios

Neutron Reference Measurements to Petroleum Industry

Processamento de áudio em tempo real em dispositivos computacionais de alta disponibilidade e baixo custo

CANape/vSignalyzer. Data Mining and Report Examples Offline Analysis V

PMR 5237 Modelagem e Design de Sistemas Discretos em Redes de Petri

Proposta de Criação do Mestrado em Gestão Logística. III - Informação Relativa ao Suplemento ao Diploma

A MÁQUINA ASSÍNCRONA TRIFÁSICA BRUSHLESS EM CASCATA DUPLAMENTE ALIMENTADA. Fredemar Rüncos

CAPLE EXAMS 2018 WHAT ARE CAPLE EXAMS?

Universidade do Minho. Escola de Engenharia. UC transversais Programas Doutorais 1º semestre de outubro 2012

Banca examinadora: Professor Paulo N. Figueiredo, Professora Fátima Bayma de Oliveira e Professor Joaquim Rubens Fontes Filho

INFORMATION SECURITY IN ORGANIZATIONS

egovernment The Endless Frontier

Ganhar Dinheiro Em Network Marketing (Portuguese Edition)

BRIGHAM AND EHRHARDT PDF

ASIP Conference 2007

Sugestão de Leitura. Artigo "Painless Software Schedules" do Joel Spolski.

Um olhar que cura: Terapia das doenças espirituais (Portuguese Edition)

Meditacao da Luz: O Caminho da Simplicidade

Transcrição:

Cooperative Execution on Heterogeneous Multi-core Systems Leonel Sousa and Aleksandar Ilić INESC-ID/IST, TU Lisbon

Motivation COMMODITY COMPUTERS = HETEROGENEOUS SYSTEMS Multi-core General-Purpose Processors (CPUS) Many-core Graphic Processing Units (GPUS) Special accelerators, co-processors, FPGAS => HUGE COMPUTING POWER Not yet completely explored for COLLABORATIVE COMPUTING HETEROGENEITY MAKES PROBLEMS MUCH MORE COMPLEX! Performance modeling and load balancing Different programming models and languages 2

Outline COLLABORATIVE ENVIRONMENT FOR HETEROGENEOUS COMPUTERS PERFORMANCE MODELING AND LOAD BALANCING State of the art for heterogeneous systems for CPU+GPU CASE STUDY: 2D BATCH FAST FOURIER TRANSFORM CONCLUSIONS AND FUTURE WORK 3

Desktop Heterogeneous Systems MASTER-SLAVE SLAVE paradigm CPU (Master) Global l execution controller Access the whole global memory INTERCONNECTION BUSSES Limited and asymmetric communication bandwidth Potential execution bottleneck UNDERLYING DEVICES (Slaves) Different architectures and programming models Computation performed using local memories 4

Redefining Tasks and Primitive Jobs TASK basic programming unit (coarser-grained) CONFIGURATION PARAMETERS Task: application and task dependency information Environment: device type, number of devices PRIMITIVE JOB WRAPPER DIVISIBLE TASK comprise several finer-grained Primitive Jobs AGGLOMERATIVE TASK allows grouping of Primitive Jobs Coarser- grained PRIMITIVE JOB minimal program portion for parallel execution CONFIGURATION PARAMETERS Primitive Job Granularity I/O and performance specifics, Coarser CARRIES PER-DEVICE-TYPE IMPLEMENTATIONS Vendor-specific programming models and tools Specific optimization techniques Divisible Task Type Agglomerative NO -- Balanced YES NO Finer/Balanced YES YES 5

Collaborative Execution Environment for Heterogeneous Systems* Task Level Parallelism Task Type Divisible Agglomerative NO -- YES NO YES YES TASK SCHEDULER submits independent tasks to JOB DISPATCHER in respect to task and environment configuration parameters and current platform state from DEVICE QUERY structure t Data Level Parallelism PRIMITIVE JOBS may be arranged into JOB QUEUES (currently, 1D-3D grid organization) for DIVISIBLE (AGGLOMERATIVE) TASKS JOB DISPATCHER uses DEVICE QUERY and JOB QUEUE information to map p( (agglomerated) PRIMITIVE JOBS to the requested devices; then initiates and controls further execution; Nested Parallelism If provided, JOB DISPATCHER can be configured to perceive certain number of cores of a multi-core device as a single device *Aleksandar Ilic and Leonel Sousa. Collaborative Execution Environment for Heterogeneous Parallel Systems, In 12th Workshop on Advances in Parallel and Distributed Computational Models (APDCM/IPDPS 2010), April 2010. 6

Collaborative Execution Environment for Heterogeneous Systems* PROBLEM Task Type Divisible Agglomerative NO -- YES NO YES YES How to make good DYNAMIC LOAD BALANCING decisions using PERFORMANCE MODELS of the devices aware of : - application demands - implementation specifics - platform / device heterogeneity - complex memory hierarchies - limited asymmetric communication bandwidth - 7

Heterogeneous Performance Modeling and Computation Distribution CONSTANT PERFORMANCE MODELS (CPM) DEVICE PERFORMANCE (SPEED) : constant positive number Typically represents relative speed when executing a serial benchmark of a given size COMPUTATION DISTRIBUTION : proportional to the speed of device FUNCTIONAL PERFORMANCE MODELS (FPM) DEVICE PERFORMANCE (SPEED) : continuous function of the problem size Typically require several benchmark runs and significant amount of time for building it COMPUTATION DISTRIBUTION : relies on the functional speed of the processor FPM VS. CPM MORE REALISTIC : integrates features of heterogeneous processor Processor heterogeneity, the heterogeneity of memory structure, and the other effects (such as paging) MORE ACCURATE DISTRIBUTION of computation across heterogeneous devices APPLICATION-CENTRIC approach characterize speed for different applications with different functions 8

Case Study : 2D FFT Batches Part of a PARALLEL 3D FFT PROCEDURE : H = FFT 1D (FFT 2D (h)) Very HIGH COMMUNICATION-TO-COMPUTATION ratio PROBLEM DEFINITION N in =N out = N 1 N 2 N 3 * sizeof(data) Performance [FLOPS][ ] : N 1N 2N 3 * log(n 1N 2N 3) N 2 = const; N 3 = const; N 1 total number of computational chunks (PRIMITIVE JOBS) 9

What we already know * (1) METRIC : ABSOLUTE SPEED p devices: P 1, P 2,, P P N 1 total #Primitive Jobs (chunks) Device Load [chunks]: n 1, n 2,, n P Absolute speed: s i (n i ) = n i /t i (n i ), 1 i p SOLUTION: OPTIMAL LOAD BALANCING Lies on the straight line that passes through the origin of coordinate system, such that: x 1 /s 1 (x 1 )= x 2 /s 2 (x 2 )= = x p /s p (x p ) x 1 + x 2 + + x p = N 1 10

What we already know * (2) 1 All the P computational units execute N 1 /p 2D FFT Batches in parallel = N 1 /p, 1 i p n i 2 Record execution times: t i (N 1 /p) 3 IF max 1 i,j p {((t i (N 1 /p)-t j (N 1 /p))/t i (N 1 /p)} ε THEN even distribution ib ti solves the problem and the algorithm stops; ELSE absolute speeds of devices are calculated, such that: s i (N 1 /p) = (N 1 /p)/t i (N 1 /p), 1 i p *Lastovetsky, A., and R. Reddy, "Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of their Functional Performance Models", HeteroPar 2009, Netherlands, Lecture Notes in Computer Science, vol. 6043, Springer, pp. 91-101, Instituto 25/9/2009, de Engenharia 2010. de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 11

What we already know * (3) 1 Performance of each device is modeled as a constant s i (x) = s i (N 1 /p), 1 i p *Lastovetsky, A., and R. Reddy, "Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of their Functional Performance Models", HeteroPar 2009, Netherlands, Lecture Notes in Computer Science, vol. 6043, Springer, pp. 91-101, Instituto 25/9/2009, de Engenharia 2010. de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa 12

What we already know ** (4) 1 Draw Upper U and Lower L lines through the following points: (0,0), (N 1 /p, max i {s i (N 1 /p)}) (0,0), (N 1 /p, min i {s i (N 1 /p)}) 2 Let x i (U), x i (L) be the intersections with s i (x) IF exists x i (L) -x i (U) 1 THEN go to 3 ELSE go to 5 **Lastovetsky, A., and R. Reddy, "Data Partitioning with a Functional Performance Model of Heterogeneous Processors", International Journal of High Performance Computing Applications, vol. 21, issue 1: Sage, pp. 76-90, 2007 13

What we already know ** (5) 1 Draw Upper U and Lower L lines through the following points: (0,0), (N 1 /p, max i {s i (N 1 /p)}) (0,0), (N 1 /p, min i {s i (N 1 /p)}) 2 Let x i (U), x i (L) be the intersections with s i (x) IF exists x i (L) -x i (U) 1 THEN go to 3 ELSE go to 5 3 Bisect the angle between U and L by the line M, and calculate intersections x i (M) line M, and calculate intersections x i **Lastovetsky, A., and R. Reddy, "Data Partitioning with a Functional Performance Model of Heterogeneous Processors", International Journal of High Performance Computing Applications, vol. 21, issue 1: Sage, pp. 76-90, 2007 14

What we already know ** (6) 1 Draw Upper U and Lower L lines through the following points: (0,0), (N 1 /p, max i {s i (N 1 /p)}) (0,0), (N 1 /p, min i {s i (N 1 /p)}) 2 Let x i (U), x i (L) be the intersections with s i (x) IF exists x i (L) -x i (U) 1 THEN go to 3 ELSE go to 5 3 Bisect the angle between U and L by the line M, and calculate intersections x i (M) line M, and calculate intersections x i 4 IF Σ i x i (M) N 1 THEN U=M ELSE L=M REPEAT 2 **Lastovetsky, A., and R. Reddy, "Data Partitioning with a Functional Performance Model of Heterogeneous Processors", International Journal of High Performance Computing Applications, vol. 21, issue 1: Sage, pp. 76-90, 2007 15

What we already know (7) A. Speeds of each device are modeled as constants equal to the observed values [1] s i (x) = s i (n i ), 1 i p 16

What we already know (8) A. Speeds of each device are modeled as constants equal to the observed values [1] s i (x) = s i (n i ), 1 i p B. Functional performance modeling using piecewise linear approximation [2] [1] Galindo I., Almeida F., and Badía-Contelles J.M., "Dynamic Load Balancing on Dedicated Heterogeneous Systems", In EuroPVM/MPI 2008, Springer, pp. 64-74, 2008. [2] Lastovetsky, A., and R. Reddy, "Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of their Functional Performance Models", HeteroPar 3 rd ''Scheduling 2009, vol. in 6043, Aussois'' Springer, Workshop pp. 91-101, 25/9/2009, 2010 17

Goals in the CPU + GPU Environment ONLINE PERFORMANCE MODELING PERFORMANCE ESTIMATION of all heterogeneous devices DURING THE EXECUTION No prior knowledge on the performance of an application is available on any of the devices Modeling of the overall CPU and GPU performance for different problem sizes (+ kernel-only GPU performance) DYNAMIC LOAD BALANCING OPTIMAL DISTRIBUTION OF COMPUTATIONS (PRIMITIVE JOBS) Partial estimations of the performance should be built and used to decide on optimal mapping Returned solution should provide load balancing within a given accuracy COMMUNICATION AWARENESS MODELING THE BANDWIDTH for interconnection busses DURING THE EXECUTION To select problem sizes that maximize the interconnection bandwidth The algorithm should be aware of asymmetric bandwidth for Host-To-Device and Device-To-Host transfers CPU+GPU ARCHITECTURAL SPECIFICS Make use of ENVIRONMENT-SPECIFIC FUNCTIONS to ease performance modeling Asynchronous transfers and CUDA streams to overlap communication with computation Be aware of diverse capabilities of different devices, but also for devices of the same type (e.g. GT200 vs. Fermi) 18

Case Study: Building Full Performance Models FULL PERFORMANCE MODELS: PER-DEVICE REAL PERFORMANCE Experimentally obtained using CPU+ GPU platform specifics (Pageable/Pinned Memory) Exhaustive search on the full range of problem sizes High cost of building it!!! CPU GPU Experimental Setup nvidia GeForce Intel Core 2 Quad 285GTX Speed/Core (GHz) 2.83 1.476 2D FFT Batch CPU GPU FFT Type Complex; Double Precision Total Size (N 1 xn 2 xn 3 ) 256x256x256 High Performance Software Global Memory (MB) 4096 1024 FFT Intel MKL 10.2 CUFFT 3.1 19

in CPU + GPU Environment (1) PROBLEM DEFINITION: PRIMITIVE JOB n in input data requirements n out - output data requirements n size - problem size n flops - #float-point operations METRIC : FLOPS p devices: P 1, P 2,, P P N 1 total #Primitive Jobs (chunks) Device Load [chunks]: n 1, n 2,, n P n i = {n i in, n i out, n i size, n i flops } FLOPS: Perf i (n i ) = n i flops /t i (n i ), 1 i p SOLUTION: OPTIMAL LOAD BALANCING Lies on the straight line that passes through the origin of coordinate system, such that: x i flops = nf(x i ), x i as n i size nf(x 1 )/Perf 1 (x 1 ) = = nf(x p )/Perf p (x p ) x 1 + x 2 + + x p = N 1 20

in CPU + GPU Environment (2) 1 All the P computational units execute N 1 /p 2D FFT Batches in parallel = N 1 /p, 1 i p n i 2 IF (device is GPU) AND (task is Divisible and Agglomerative) THEN go to 3 ELSE go to 4 3 Split given computational load into streams and use asynchronous transfers to overlap communication with computation 21

in CPU + GPU Environment (3) Streaming 1 All the P computational units execute N 1 /p 2D FFT Batches in parallel SUBDIVIDE n i computational chunks using DIV2 STRATEGY n i = N 1 /p, 1 i p No prior knowledge on the performance of an application! 2 IF (device is GPU) AND (task is Divisible The next stream has half the load of a previous stream and Agglomerative) Algorithm may continue splitting the workload until the last stream is assigned with load equal to 1 THEN go to 3 ELSE go to 4 BANDWIDTH-AWARE DIV2 STRATEGY 3 Split given computational load into streams Interconnection bandwidth is a subject to the amount of data that and use asynchronous transfers to overlap should be transferred and not to the application-specific demands Run small pre-calibration tests for HOST-TO-DEVICE AND DEVICE- communication with computation TO-HOST transfers Tests can be stopped when saturation points are detected, or when transfers reach desired value (e.g. 60% of its theoretical) CASE STUDY: n min_size = 4 Stream 0 Stream 1 Stream 2 22

in CPU + GPU Environment (4) 1 All the P computational units execute N 1 /p 2D FFT Batches in parallel = N 1 /p, 1 i p n i 2 IF (device is GPU) AND (task is Divisible and Agglomerative) THEN go to 3 ELSE go to 4 3 Split given computational load into streams and use asynchronous transfers to overlap communication with computation 4 Execute & record execution times: t i (N 1 /p) 5 IF max 1 i,j p {((t i (N 1 /p)-t j (N 1 /p))/t i (N 1 /p)} ε THEN even distribution solves the problem and the algorithm stops; ELSE performance of devices is calculated, such that: Perf i (N 1 /p) = nf(n 1 /p)/t i (N 1 /p), 1 i p 23

in CPU + GPU Environment (5) 1 Traditional approach: Performance of each device is modeled as a constant Perf i (x) = Perf i (N 1 /p), 1 i p 24

in CPU + GPU Environment (6) 1 Traditional approach: Performance of each device is modeled as a constant Perf i (x) = Perf i (N 1 /p), 1 i p 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth 25

in CPU + GPU Environment (7) 1 Traditional approach: Performance of each device is modeled as a constant Perf i (x) = Perf i (N 1 /p), 1 i p 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth 26

in CPU + GPU Environment (8) 1 Traditional approach: Performance of each device is modeled as a constant Perf i (x) = Perf i (N 1 /p), 1 i p 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 27

in CPU + GPU Environment (9) 1 Traditional approach: Performance of each device is modeled as a constant Perf i (x) = Perf i (N 1 /p), 1 i p 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results 28

in CPU + GPU Environment (10) 1 Traditional approach: Performance of each device is modeled as a constant Perf i (x) = Perf i (N 1 /p), 1 i p 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results 29

in CPU + GPU Environment (11) 1 Draw Upper U and Lower L lines through the following points: (0,0), (N 1 /p, max i {Perf i (N 1 /p)}) (0,0), (N 1 /p, min i {Perf i (N 1 /p)}) 2 Let x i (U) and x i (L) be the intersections with Perf i (x) IF exists x i (L) -x i (U) 1 THEN go to 3 ELSE go to 5 3 Bisect the angle between U and L by the line M, and calculate intersections x (M) i 4 IF Σ i x i (M) N 1 THEN U=M ELSE L=M REPEAT 2 30

in CPU + GPU Environment (12) 1 Draw Upper U and Lower L lines through the following points: (0,0), (N 1 /p, max i {Perf i (N 1 /p)}) (0,0), (N 1 /p, min i {Perf i (N 1 /p)}) 2 Let x i (U) and x i (L) be the intersections with Perf i (x) IF exists x i (L) -x i (U) 1 THEN go to 3 ELSE go to 5 3 Bisect the angle between U and L by the line M, and calculate intersections x (M) i 4 IF Σ i x i (M) N 1 THEN U=M ELSE L=M REPEAT 2 5 Employ streaming strategy on the calculated workload value 31

in CPU + GPU Environment (13) Streaming 1 Draw Upper U and Lower L lines through the following points: STREAMING STRATEGY Results obtained using DIV2 STRATEGY consider application characterization (e.g. communication-to-computation ratio) (0,0), (N 1 /p, max i {Perf i (N 1 /p)}) (0,0), (N 1 /p, min i {Perf i (N 1 /p)}) Workload size for the next stream should be chosen in order to 2 Let x (U) i and x (L) i be the intersections with OVERLAP TRANSFERS WITH COMPUTATION in the previous stream Perf i (x) IF exists x (L) i -x (U) i 1 BANDWIDTH-AWARE STREAMING STRATEGY Reuses the MINIMAL WORKLOAD SIZE FROM DIV2 STRATEGY THEN go to 3 (obtained via HOST-TO-DEVICE and DEVICE-TO-HOST tests) ELSE go to 5 IF (n curr n min_size ) 3 Bisect the angle between U and L by the THEN use strategy (cont. overlapping) line M, and calculate intersections x (M) i ELSE restart strategy on n curr load 4 IF Σ i x (M) i N 1 In case that load drops under n min_size, strategy is restarted on THEN U=M remaining i load ELSE L=M REPEAT 2 5 Employ streaming strategy on the calculated workload value 32

in CPU + GPU Environment (14) Streaming 1 Draw Upper U and Lower L lines through CASE STUDY: 2D FFT BATCH About 83% of total execution time goes on data transfers HOSTTODEVICE Transfers 46% KERNEL Execution 17% the following points: (0,0), (N 1 /p, max i {Perf i (N 1 /p)}) (0,0), (N 1 /p, min i {Perf i (N 1 /p)}) DEVICETOHOST Transfers 37% BANDWIDTH-AWARE STREAMING STRATEGY 2 Let x (U) i and x (L) i be the intersections with Perf i (x) n min_size = 4, overlap HOSTTODEVICE transfers and KERNEL IF exists x (L) i -x (U) i 1 execution of two streams THEN go to 3 ELSE go to 5 3 Bisect the angle between U and L by the line M, and calculate intersections x (M) i 4 IF Σ i x i (M) N 1 THEN U=M ELSE L=M REPEAT 2 5 Employ streaming strategy on the calculated workload value 33

in CPU + GPU Environment (15) 1 Draw Upper U and Lower L lines through the following points: (0,0), (N 1 /p, max i {Perf i (N 1 /p)}) (0,0), (N 1 /p, min i {Perf i (N 1 /p)}) 2 Let x i (U) and x i (L) be the intersections with Perf i (x) IF exists x i (L) -x i (U) 1 THEN go to 3 ELSE go to 5 3 Bisect the angle between U and L by the line M, and calculate intersections x (M) i 4 IF Σ i x i (M) N 1 THEN U=M ELSE L=M REPEAT 2 5 Employ streaming strategy on the calculated workload value 34

in CPU + GPU Environment (16) 1 Refine performance models with the newly obtained results 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results 35

in CPU + GPU Environment (17) 1 Refine performance models with the newly obtained results 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results for each stream 36

in CPU + GPU Environment (18) 1 Refine performance models with the newly obtained results 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results for each stream for each stream restart t 37

in CPU + GPU Environment (19) 1 Refine performance models with the newly obtained results 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results for each stream for each stream restart t 38

in CPU + GPU Environment (20) 1 Refine performance models with the newly obtained results 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results for each stream for each stream restart t for every stream combination 39

in CPU + GPU Environment (21) 1 Refine performance models with the newly obtained results 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results for each stream for each stream restart t for every stream combination 40

in CPU + GPU Environment (22) 1 Draw Upper U and Lower L lines through the following points: (0,0), (N 1 /p, max i {Perf i (N 1 /p)}) (0,0), (N 1 /p, min i {Perf i (N 1 /p)}) 2 Let x i (U) and x i (L) be the intersections with Perf i (x) IF exists x i (L) -x i (U) 1 THEN go to 3 ELSE go to 5 3 Bisect the angle between U and L by the line M, and calculate intersections x (M) i 4 IF Σ i x i (M) N 1 THEN U=M ELSE L=M REPEAT 2 5 Employ streaming strategy on the calculated workload value 41

in CPU + GPU Environment (23) 1 Refine performance models with the newly obtained results 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results for each stream for each stream restart t for every stream combination 42

in CPU + GPU Environment (24) 1 Refine performance models with the newly obtained results 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results for each stream for each stream restart t for every stream combination 43

in CPU + GPU Environment (25) 1 Refine performance models with the newly obtained results 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results for each stream for each stream restart t for every stream combination 44

in CPU + GPU Environment (26) 1 Refine performance models with the newly obtained results 2 GPU-specific Modeling: Using the obtained values from streaming execution HostToDevice Bandwidth DeviceToHost Bandwidth GPU Kernel Performance 3 Incorporate streaming results for each stream for each stream restart t for every stream combination 45

Summary and Future Work COLLABORATIVE ENVIRONMENT FOR HETEROGENEOUS COMPUTERS TRADITIONAL APPROACHES FOR PERFORMANCE MODELING Approximate the performance using number of points equal to the number of iterations In this case, 3 POINTS per each device PRESENTED APPROACH FOR PERFORMANCE MODELING Models the performance using MORE THAN 30 POINTS, in this case COMMUNICATION-AWARE schedules in respect to limited and asymmetric interconnection bandwidth Employs STREAMING STRATEGIES to overlap communication with computation across devices BUILDS SEVERAL PER-DEVICE MODELS AT THE SAME TIME OVERALL PERFORMANCE for each device + STREAMING GPU PERFORMANCE HOSTTODEVICE BANDWIDTH Modeling DEVICETOHOST BANDWIDTH Modeling GPU KERNEL PERFORMANCE Modeling 46

Questions? Thank you 47