Extending the application of GPUs beyond processing accelerators

Documentos relacionados

Collaborative Execution Environment for Heterogeneous Parallel Systems CHPS*

GPU-based Heterogeneous Systems [PCs (CPU + GPU) = Heterogeneous Systems]

Unifying Stream Based and Reconfigurable Computing to Design Application Accelerators

Transparent application acceleration by intelligent scheduling of shared library calls on heterogeneous systems September 9, 2013

A Tool to Evaluate Stuck-Open Faults in CMOS Logic Gates

CHPC Computational Platforms

Heterogeneous multi-core computer architectures and

Virtual Network Embedding with Coordinated Node and Link Mapping

Scientific data repositories: the USP experience

CIS 500 Software Foundations Fall September(continued) IS 500, 8 September(continued) 1

Parallel Computing Paradigms

Software Testing with Visual Studio 2013 (20497)

Course Review for Midterm Exam 1. Cpt S 223 Fall 2010

Performance and Power Consumption Analysis of Full Adders Designed in 32nm Technology

VGM. VGM information. ALIANÇA VGM WEB PORTAL USER GUIDE June 2016

Power Estimation FPGA ASIC

CSE 521: Design and Analysis of Algorithms I

Service quality in restaurants: an experimental analysis performed in Brazil

Eng. Thársis T. P. Souza

gssjoin: a GPU-based Set Similarity Join Algorithm

Cooperative Execution on Heterogeneous Multi-core Systems

Performance-aware task management and frequency scaling in embedded systems

DALI TECHNOLOGY. Sistemas e Planeamento Industrial DOMÓTICA. Eng.º Domingos Salvador dos Santos.

Técnicas de Processamento Paralelo na Geração do Fractal de Mandelbrot

Mathematical Foundation I: Fourier Transform, Bandwidth, and Band-pass Signal Representation PROF. MICHAEL TSAI 2011/10/13

Placas Gerenciadoras Gráficas

Evolução das Aquiteturas Stream Computing: Desempenho de Algoritmos Dwarf Mine em GPGPU

CANape/vSignalyzer. Data Mining and Report Examples Offline Analysis V

A Cloud Computing Architecture for Large Scale Video Data Processing

In-Cache Streaming: Morphable Infrastructure for Many-Core Processing Systems

Aula 21 Ordenação externa

VGM. VGM information. ALIANÇA VGM WEB PORTAL USER GUIDE September 2016

Introdução à Programação Paralela através de Padrões. Denise Stringhini Calebe Bianchini Luciano Silva

Designing Solutions for Microsoft SQL Server 2014 (20465)

Frustum Culling Híbrido Utilizando CPU e GPU. Aluno: Eduardo Telles Carlos Orientador: Alberto Raposo Co-Orientador: Marcelo Gattass

Geospatial Information for Integration and Development in Rio de Janeiro. Luiz Roberto Arueira City Information Director Marcus Silva GIS Consultant

Resolvendo os Gargalos do Edge FPGAs: De monstros a solução

Turn the lead weight of data into a golden opportunity: manage it, unlock it and extract value from it

Transcript name: 1. Introduction to DB2 Express-C

Parallel Algorithms for Multicore Game Engines

Programação Paralela e Distribuída

Introdução. Arquitetura e Organização de Computadores I. Programa. Arquitetura e Organização de. Computadores. Capítulo 1.

Implementação de um escalonador de processos em GPU

Autor 1 Orientador: 1. dia de mês de ano

Felipe Beltrán Rodríguez 1, Eng., Master Student Prof. Erlon Cristian Finardi 1, D. Eng., Advisor Welington de Oliveira 2, D.Sc.

Inflation Expectations and Behavior: Do Survey Respondents Act on their Beliefs? O. Armantier, W. Bruine de Bruin, G. Topa W. VanderKlaauw, B.

Biologically Inspired Compu4ng: Neural Computa4on. Lecture 5. Patricia A. Vargas

GILBERTO IGARASHI Estudo da IEC e o seu Impacto no Sistema de Automação de Subestações

Miguel Branco Palhas An Evaluation of the GAMA/StarPU Frameworks for Heterogeneous Platforms: the Progressive Photon Mapping Algorithm

Processamento Paralelo Utilizando GPU

Fabrício Gomes Vilasbôas

Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory

TAIL. Research Data Management. from creation to deposit and sharing

Análise de desempenho e eficiência energética de aceleradores NVIDIA Kepler

PROGRAMAÇÃO EM JOGOS DIGITAIS

Prinfor - A sua loja de confiança

Aula 12 - Correção de erros

Aula 02. Disciplina Sistemas de Computação

Comparação de eficiência entre OpenCL e CUDA

English version at the end of this document

English version at the end of this document

Efficient Locally Trackable Deduplication in Replicated Systems. technology from seed

Installation, Storage, and Compute with Windows Server 2016 (20740)

School Performance Evaluation in Portugal: A Data Warehouse Implementation to Automate Information Analysis

5 Unidades de Processamento Gráfico GPUs

Addition of Fields in Line Item Display Report Output for TCode FBL1N/FBL5N

As 100 melhores piadas de todos os tempos (Portuguese Edition)

Compilation of OpenCL Programs for Stream Processing Architectures

Lucas de Assis Soares, Luisa Nunes Ramaldes, Taciana Toledo de Almeida Albuquerque, Neyval Costa Reis Junior. São Paulo, 2013

Artigos de dados. Publicação de dados de biodiversidade através do GBIF. Rui Figueira. Curso de formação :

20740: Installation, Storage, and Compute with Windows Server 2016

Sparse Matrix-Vector Multiplication on GPU: When Is Rows Reordering Worthwhile?

VMware vsphere: Install, Configure, Manage [v6.5] (VWVSICM6.5)

Heuristic for resources allocation on utility computing infrastructures

6 Só será permitido o uso de dicionário INGLÊS/INGLÊS.

NORMAS DE FUNCIONAMENTO DOS CURSOS DE LÍNGUAS (TURMAS REGULARES E INTENSIVAS) 2015/2016

Caravela: A Distributed Stream-based Computing Platform

Comportamento Organizacional: O Comportamento Humano no Trabalho (Portuguese Edition)

Impressão do Formulário

Generation and Analysis of Android Benchmarks with Different Algorithm Design Paradigms

PCS 5869 INTELIGÊNCIA ARTIFICIAL

Biscuit - potes (Coleção Artesanato) (Portuguese Edition)

English version at the end of this document

CARLA DANIELA CABRAL DIAS. A utilização das ferramentas informáticas de apoio à auditoria interna nas empresas portuguesas

JOSÉ RICARDO SANCHEZ FILHO ANALYSIS OF THE LONG-TERM EFFECTS OF THE VOLUNTARY OFFER OF THE BID RULE ON STOCKS LISTED IN THE BRAZILIAN STOCK EXCHANGE

Introdução a CUDA. Esteban Walter Gonzalez Clua. Medialab - Instituto de Computação Universidade Federal Fluminense NVIDIA CUDA Research Center START

20480 Programming in HTML5 with JavaScript and CSS3

Installation, Storage, and Compute with Windows Server 2016 (20740)

Uma introdução para computação paralela de modelos massivos. Adriano Brito Pereira inf.puc-rio.br

eposters evita impressões avoid printing reduz a pegada ecológica reduce the ecological footprint

Research Data Management at INESC TEC using Dendro, CKAN and EUDAT

ELETRÔNICA DIGITAL I

Compilando o Kernel Linux

Future Trends: Global Perspective. Christian Kjaer Chief Executive Officer European Wind Energy Association

Leonel Sousa 1, Aleksandar Ilić 1, Frederico Pratas 1 and Pedro Trancoso 2 work performed in the scope of HiPEAC FP7 project

Writing Good Software Engineering Research Papers

ENGENHARIA DE SERVIÇOS SERVICES ENGINEERING

English version at the end of this document

CEPEL s new PMU Laboratory LabPMU

Transcrição:

Extending the application of GPUs beyond processing accelerators Leonel Sousa and Aleksandar Ilic and Frederico Pratas CE Lab Colloquium TUDelft 2009 August 26

Motivation: GPUs evolution Graphics Processing Units (GPUs) in integrated circuits since 80 s Alpha-Numericp Television Interface Circuit (ANTIC) ) LSI Atari μc GPUs development mainly thrust by the gamming industry & market (e.g. Prince of Persia- 1 st version1989; 2 nd version 2008) Not only high computational power but growing at a high rate 2

Motivation: CPUs vs GPUs performance Originally GPUs only dedicated for graphics but the high computational capacities woke up interest for using them to other applications GFLOPS 3

Motivation: GPGPU General purpose Computation on Graphics Processing Units (GPGPU: www.gpgpu.org) Acceleration of specific functions, mainly operating as a coprocessor in host computers Scientific computation (e.g., Fluid Dynamics) Signal Processing (FFT, Filtering, Image and video processing, ) High Performance libraries (BLAS, ). 4

Motivation Input Data GPUs hosted in computers through system buses (PCIe), no real speedup is achieved Output Data To use computational power of GPUs in commodity computers: CPU + GPU = heterogeneous system with slow communication or as a computer component to design hardware processing structures t 5

General Outline GPU architectures and programming models Extending the application of GPUs CHPS: Collaborative-execution-environment e ec tion en ironment for Heterogeneous Parallel Systems SDHA: applying the Stream-based-computing-model model to Design Hardware Accelerators Conclusions and future work 6

Graphics Operations: Rasterizationbased Rendering Textures for objects Vertex processor Display to screen Mapping vertices into a reference system Pixel processor (Coloring objects with textures) Rasterizer 7

Architecture of Traditional GPU Vertex processor Rasterizer 12 x pixel processors FEUP 2/3/07 8

Flow-model and Caravela Programming Tools for GPGPU Input Data Stream(s) Input6 Input5 Constant Values Input4 "Kernel" "Processor" Input3 Op Constant0 Memory effect Output2 Output1 Output0 Output Data Stream(s) Flow-model and Caravela programming tools (OpenGL, DirectX) S. Yamagiwa and L. Sousa, Caravela: A Novel Stream-Based Distributed Computing Environment, IEEE Computer, May 2007 9

GPU based ontesla Architecture Streaming Multiprocessor (SM): 8 cores (SP) + shared memory Texture/Processor Cluster - 2 SMPs SP: unified vertex/pixel programmable processor 10

GPU based ontesla Architecture Warp groups of 32 parallel threads Each SM manages a pool of 24 warps SIMT controls execution and branching behavior of one thread Large number of threads scheduled by hardware masks memory latency NVIDIA GT200 GPUs A collection of SIMD SMs- up to 30 High memory bandwidth-up to 150GB/s Connected via PCIe bus, with 2GB/s bandwidth 11

CUDA Programming Model Compute Unified Device Architecture t (CUDA) Divide a kernel into thousands of parallel threads Organize these threads into blocks, which h run independently d and in parallel in different SMs and share 16KB of memory each Group blocks into grids All threads have access to global memory Programming interface Standard C programming languages with minimal extensions for parallel l programming 12

Work CHPS - Collaborative Execution Environment for Heterogeneous Parallel Systems SDHA - Applying the Stream-Based Computing Model to Design Hardware Accelerators 13

CHPS Commodity computers = Heterogeneous systems Multi-core general-purpose processors (CPUs) Many-core graphic processing units (GPUs) Special accelerators, co-processors, FPGAs, DSPs Huge collaborative computing power - Not yet explored - In most research done target the usage of one these devices at time, mainly domain-specific computations Heterogeneity makes problems much more complex many programming challenges 14

CHPS: Heterogeneous Systems Master-slave slave execution paradigm Distributed-memory programming techniques CPU (Master) Global execution controller Access the whole global memory Interconnection Busses Reduced communication bandwidth comparing to distributed-memory systems Underlying Devices (Slaves) Different architectures and programming models Computation performed using local memories 15

CHPS: Programming Challenges Computation Partitioning To fulfill device capabilities/limitations and achieve optimal load-balancing Data Migration Significant and usually asymmetric Require detailed d modeling Potential execution bottleneck Synchronization Different programming models Application Optimization Per device type and vendor-specific Vendors provide high performance libraries and software Very large set of parameters and solutions affects performance 16

CHPS: Unified Execution Model Primitive jobs - Minimal problem portions for parallel execution (1 st agglomeration) - Balanced granularity: fine enough partition to sustain execution on every device in the system, but coarse enough to reduce data movement problems/overheads Job Queue - Classifies the primitive jobs to support specific parallel execution scheme - Accommodates primitive jobs batching into a single job according to the individual device demands (2 nd agglomeration) 17

CHPS: Unified Execution Model Device Query Identifies all underlying devices Holds per-device information (device type, current status, requested workload and performance history) Scheduler Adopts a scheduling scheme according to devices capabilities, availability and restrictions (currently, FCFS) Forms execution pair: <device, computational_portion> Upon the execution completion, the data is transferred back to the host, and the device is free to support another request from the Scheduler 18

CHPS: Case study Dense Matrix Multiplication Matrix multiplication is based on sub-matrix partitioning C M N = A M K B K N KR c i, j = a i,l b l, j, 0 i M P ;0 j N Q l= 0 by creating a set of primitive jobs a i,l b l, j Devices can request several primitive jobs at once c i, j Single updates are performed atomically High performance libraries: ATLAS CBLAS [CPU], CUBLAS [GPU] Problem size M N K Primitive Job parameters Job Queue size P Q R M N K P Q R 19

CHPS: Case study 3D Fast Fourier Transform H = FFT 1D (FFT 2D (h)) Traditional parallel implementation requires transpositions - between FFTs applied on different dimensions and after executing the final FFT Our implementation Only portions of data which are assigned to the device are transposed, and the results are retrieved on adequate positions after the final FFT High performance libraries: FFTW [CPU], CUFFT [GPU] 20

CHPS: Case study 3D Fast Fourier Transform Problem size N 1 N 2 N 3 Primitive Job parameters P 1 P 2 P 3 Total Size N 1 2D FFTs of a size N 2 xn 3 Primitive Job Job Queue Size N 1 / P 1 Total Size N 2 xn 3 1D FFTs of a size N 1 Pi Primitive iti Job Job Queue Size (N 2 / P 2 ) x (N 3 / P 3 ) 21

CHPS: Experimental Results Optimized Dense Matrix Multiplication RESULTS Exhaustive search Optimal load-balancing 32x32 tests GPU CPU 13 1 Obtained speedup (comparing to): 1 CPU core 4.5 2 CPU cores 2.3 1 GPU 1.2 Single precision floating-point arithmetic (Largest problem executable by the GPU) Problem size M N K 4096 4096 4096 Primitive Job P Q R parameters 4096 16 4096 Job Queue size 32 CPU GPU Experimental Setup NVIDIA Intel Core 2 GeForce8600GT Speed/Core (GHz) 2.33 0.54 Global Memory (MB) 2048 256 High Performance Software Matrix Multiplication ATLAS 3.8.3 CUBLAS 2.1 3D FFT FFTW 3.2.1 CUFFT 2.1 22

CHPS: Experimental Results Complex 3D FFT [2D FFT Batch] RESULTS Exhaustive search Optimal load-balancing 160 tests GPU CPU 4 1 Obtained speedup (comparing to): 1 CPU core 3.2 2CPU cores 16 1.6 1 GPU x Complex 3D fast Fourier transform (Problem not possible to execute on the GPU) GPU maximal workload is 5x2D FFTs (512x512) Problem size Primitive Job parameters Job Queue size N 1 N 2 N 3 512 512 512 P 1 P 2 P 3 16 x x 32 x 2D FFT (512x512) 23

CHPS: Experimental Results Complex 3D FFT [1D FFT Batch] RESULTS Exhaustive search Optimal load-balancing 416 tests GPU CPU 12 4 Obtained speedup (comparing to): 1 CPU core 2.2 2 CPU cores ~1! 1GPU x GPU maximal workload is 13 x 1D Problem size FFTs (512) Primitive Job parameters Results include transpose time! Complex 3D fast Fourier transform (Problem not possible to execute on the GPU) Job Queue size N 1 N 2 N 3 512 512 512 P 1 P 2 P 3 x 16 16 32x32 1D FFTs (512) 24

CHPS: Experimental Results Complex 3D FFT [1D FFT Batch] 3D FFT RESULTS (TOTAL) Obtained speedup (comparing to): 1 CPU core 2.8 2 CPU cores 1.3 1 GPU x Transposition time dominates the execution for higher workloads! Memory transfers are significant! 25

General Outline CHPS - Collaborative Execution Environment for Heterogeneous Parallel Systems SDHA - Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study 26

SDHA: Motivation Co-processing hardware can be a very efficient solution to accelerate certain types of applications However, mapping algorithms into hardware is not straightforward Relatively easier to program according to the stream- based computing model (many stream-based algorithms have been proposed for different applications, e.g. for running on GPUs) 27

SDHA: Proposed Work The stream-based description seems to be suitable as an approximation to hardware description: decouples computation from memory accesses exposes maximum parallelism available in the application Stream-based description can be used as an intermediate step: besides being easier and faster to implement it also provides the HW designer with early-stage programmable prototyping platforms 28

SDHA: Overview Stream-based Computation A data stream is a sequence of data items. In stream computing a kernel,, comprising several operations in sequence, is applied to a single or multiple input data streams to produce an output data stream. 29

SDHA: Overview Hardware Design Efficient solutions but complex design Customizable Differently optimized for each application Adaptable/Reconfigurable Different applications can be dynamically loaded Hard to: automatically We try overcome map these algorithms difficulties: into hardware stream-based computing (GPUs) Debug, regarding hardware software accelerators solutions 30

SDHA: Case Study MrBayes MrBayes is a bioinformatics o application that performs Bayesian inference of evolutionary (phylogenetic) trees. Phylogenetic trees are used for instance in Origins and evolutionary genomics of the 2009 swine- origin H1N1 influenza A epidemic http://www.nature.com/nature/journal/v459/n7250/full/nature08182.html A current real-world phylogenomic data set study contains 1,500 genes and requires 2,000,000 CPU hours on a BlueGene/L system 1. 1 M. Ott, et. al, Large-scale g Maximum Likelihood-based Phylogenetic Analysis on the IBM BlueGene/L, In SC 07, NY, USA. 31

SDHA: Case Study MrBayes Phylogenetic Tree: Origin and evolution of the H1N1 (simplified) Reconstruction of the sequence of reassortment events leading up to the emergence of S-OIV. GJD Smith et al. Nature 459, 1122-1125 (2009) doi:10.1038/nature081821038/nature08182 32

SDHA: Case Study MrBayes Profiling showed that >85% of the program execution is spent in the Phylogenetic Likelihood Functions (PLF). 2 PLF, CondLikeDown and CondLikeRoot Data Structures Conditional Likelihood vector Nucleotide substitution matrix (A Adenine, C Cytosine, G Guanine, T- Thymine) 33

SDHA: MrBayes Stream-based Parallelization Primitive kernel Stream-Based Parallelization 34

SDHA: MrBayes Stream-based Parallelization The stream-based version was implemented with CUDA for GPGPU The PLF are executed on the GPU, while the remaining part of the program is computed on the CPU Each thread computes one inner product, which can be seen as a reduction Number of threads used: 10240 The data streams are split between blocks Number of blocks used: 40 More details about the CUDA implementation and the algorithm parallelization can be found at F. Pratas, et. al, Fine-grain Parallelism using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function, In ICPP 09, Vienna, Austria. 35

SDHA: MrBayes From Stream-based to Reconfigurable Hardware Parallel Hardware Deeper operations Reusing pipelining 36

SDHA: MrBayes From Stream-based to Reconfigurable Hardware 37

SDHA: Experimental Setup Prototyping platform The stream-based algorithm is programmed in CUDA The GPU is used for optimization and debugging The algorithm is ported to VHDL with a minimum effort A simple control unit can be used for several units following a SIMD approach 38

SDHA: Experimental Setup The Stream-based implementation was programmed with CUDA API v2.1 The hardware was described in VHDL and the results were obtained with Xilinx ISE 10.1 after Place-and-Route. Baseline is a general-purpose architecture and is used as reference system. MB MrBayes version 312 3.1.2 with ihinput datasets obtained dfrom Seq-Gen. Datasets are named according to X_Y, where X is related to the number of PLF calls and Y is related to the size of the data set. 39

SDHA: Experimental Results Scalability of the accelerated kernel for different number of FCAs on the Virtex5 FX200T We implemented Super-pipelined FCAs, with 8 stages floating-point units. Multiples of 4 FCAs were used to simplify the control due to the number of discrete rates. 40

SDHA: Experimental Results The CondLikeRoot function has one extra inner product per conditional likelihood element extra floating-point units. The obtained frequency is similar for the two FPGAS when using approximately the same number of FCAs. The frequency decreases with the increase of the number of FCAs due to the FPGA occupancy and associated routing overhead. The latency is constant because the parallelism is the only factor being modified d across different topologies. 41

SDHA: Experimental Results Results are relative to the kernel functions. FPGA can be reconfigured two implement one of the PLFs at a time It is assumed that the input data fits in the main memory. The fastest architecture is the Virtex5 FX200T with 64 FCAs, showing a speedup of 10.2x. The GPU achieves a speedup of 4.3x. Although the difference between the # FCAs in the two Virtex5 is 8x, the speedup obtained is only 4x (different operating frequencies). 42

Conclusions and Future work CHPS - Collaborative Execution Environment for Heterogeneous Parallel Systems Aleksandar Ilic and Leonel Sousa, Collaborative Execution Environment for Heterogeneous Parallel Systems, submitted to 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing (PDP) 2010. SDHA - Applying the Stream-Based Computing Model to Design Hardware Accelerators Frederico Pratas and Leonel Sousa, Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study, In International Workshop on Systems, Architectures, MOdeling, and Simulation (SAMOS), Springer, July 2009. 43

CHPS The proposed unified execution environment achieves significant speedups relatively to the single CPU core execution: 4.5 for dense matrix multiplication 2.8 for complex 3D fast Fourier transform Future work: Implementation in more heterogeneous systems (more GPUs, more CPU cores, or special-purpose accelerators) Asynchronous memory transfers Adoption of advanced scheduling policies Performance prediction and application self-tuning To identify limits in performance to choose the processor to use (e.g GPU versus CPU) 44

SDHA We were able to obtain a hardware implementation i of the target algorithm using the stream-based Due to the stream-based parallel characteristics it naturally leads to scalable hardware architectures Typical folding procedures were used systematically y to optimize the operations in hardware The results show that the solution is efficient, even not being a completely l custom solution Future work: GPUs can be used as prototyping platforms to design hardware Development of software tools to assist hardware design. 45

Questions? Thank you 8/26/2009 46