Evolução das Aquiteturas Stream Computing: Desempenho de Algoritmos Dwarf Mine em GPGPU

Transcrição

1 Evolução das Aquiteturas Stream Computing: Desempenho de Algoritmos Dwarf Mine em GPGPU Professor: Philippe O. A. Navaux Doutorando: Laércio Pilla Institute of Informatics - UFRGS Parallel and Distributed Processing Group

2 Index Stream Processors Evolution Processamento SISD Pipeline Processamento SIMD Arquiteturas Superescalares - ILP Arquiteturas Multithreading - TLP Aquiteturas Multi core, Many core Processamento Stream Experimental Evaluation Map Reduce Spectral Methods Sparse Linear Algebra Architecture and Software Evolution 2 Conclusion

3 Stream Processing Evolution 3

4 Evolução dos Processadores Ao longo do tempo sempre houve uma preocupação em obter mais desempenho das CPUs para atender demandas crescentes de processamento. Num primeiro tempo o desenvolvimento de CPUs mais rápidas, graças ao aumento do clock, eram suficientes para atender a demanda. Logo, verificou-se que os investimentos para obter CPUs mais rápidas tornavam estas custosas demais, a partir deste momento procuraram-se soluções através do paralelismo na execução de instruções. 4

5 Processamento Serial - SISD 5

6 Processamento serial - SISD Arquiteturas tradicionais de CPUs são SISD, o que subentende que conceitualmente somente uma operação será executada a cada tempo. Na figura ao lado verifica-se que a cada vez entra uma instrução e um dado para gerar um resultado. 6

7 Primeiros momentos do Paralelismo em CPUs Observando a evolução das CPUs verifica-se que o emprego da técnica pipeline é empregada como uma das primeira formas de obter um paralelismo temporal na execução das instruções. O aumento de desempenho é obtido pela sobreposição temporal na execuçao das instruções. Na mesma época surge o paralelismo de recursos, com o surgimento das arquiteturas SIMD, também conhecidas por Array Processors. 7

8 Arquitetura Pipeline 8

9 Pipeline Architecture 5 stages depth. Clock frequency depends on the slower stage. More stages, more registers to save the states between them. 9

10 Pipeline Architecture No instruction parallelism. Hardware reuse. CPI ideal = 1 (full pipeline). 10

11 Arquitetura SIMD 11

12 Processamento Paralelo - SIMD Numa arquitetura SIMD, o paradigma de programação permite a aplicação de uma instrução sobre várias instâncias de dados em paralelo, por exemplo elementos de uma matriz. Na figura ao lado verifica-se que uma instrução age sobre multíplos dados. 12

13 Arquitetura SIMD Funcionamento da Arquitetura SIMD: Múltiplas unidades de processamento, PEs, supervisionadas por uma unidade de controle comum. Mesma instrução é executada sobre diferentes dados. Processor Memory PE M PE M PE M PE M PE M PE M PE M PE M PE M PE M PE M PE M PE M PE M PE M A memória compartilhada é PE dividida em múltiplos módulos. M Cada PE acessa seu próprio módulo de memória simultaneamente. 13

14 ILLIAC IV SIMD Supercomputer The ILLIAC IV was built by the University of Illinois, 1972, at the request of the Department of Defense Advanced Research Projects Agency. The ILLIAC IV was build as a large-scale array, parallel processing computer. 14

15 Arquitetura Superescalar 15

16 Arquiteturas Superescalares - ILP Logo mais, visando melhorar o emprego do hardware surgem as arquiteturas superescalares com o paralelismo de instruçoes, ILP. A razão maior era que as arquiteturas pipeline começaram a ter uma estrutura com várias unidades funcionais que eram apenas empregadas uma de cada vez, ficando portanto as outras em idle. O IPC (instruction per cycle) era no máximo 1. 16

17 Superscalar Architecture - ILP Beginning of the parallelism. Fetch, Decode and execute more than one instruction per cycle. Aggressive Parallelism Dispatch and Out-ofOrder execution (OOO). 17

18 Superscalar Architecture - ILP Pipelining stills alive. IPC ideal > 1 Instruction Level Parallelism - ILP. 18

19 Processors Evolution Pawlowski, 2006 Intel 19

20 Arquiteturas Multithreading 20

21 Multithreading Architecture - TLP PROBLEMS Even supporting instruction level parallelism, the maximum parallelism obtained was 2 instructions per cycle, keeping some units on idle. MOTIVATION Usage of MULTITHREADING techniques. Gains with the parallel execution of more than one thread per cycle. TLP. 21

22 Multithreading Architecture - TLP Many active threads 22

23 Multithreading Architecture - TLP Only one physical processors. Multiples virtual processors. Beginning of the TLP (Thread Level Paralelism) The ILP (Instruction Level Paralelism) still alive. 23

24 Multithreading Architecture - TLP Small hardware increase on: PC, ROB, Register File Bottleneck on shared resources Functional Units Cache Memory Main Memory 24

25 Chip Multithreading - TLP 25

26 Arquiteturas Multi-core e Many-core 26

27 Multi-Core Architecture PROBLEMS Although the multirhreading architectures improves the performance, the bottleneck for data and instructions still occurring. Cache memory problem. MOTIVATION Usage of MULTI-CORE architectures. Gains using simple cores, in order to better distribute the resources and threads. 27

28 Multi-Core Architecture It also represents a multithread. Many active threads. Can also combine techniques such as IMT, BMT and SMT. 28

29 Arquitetura Multi-Core AMD Barcelona Nehalem i7 29

30 Many-Core Architectures PROBLEM Chips with tens of cores are emerging. It is necessary to define new interconections, ways to communicate and bind threads. MOTIVATION Usage of MANY-CORE architectures. Gains using tens of cores and adoption of NoC (Network-on-Chip). NOWADAYS 30

31 Evolution of Multi-Core = Many Core Pawlowski, Intel

32 Many-Core Architectures Intel Tera-Scale Tera-flops scale. 80 cores VLIW. Each core has a NoC router integrated. L1 Cache integrated. NoC Interconexion. 32

33 Arquiteturas Stream 33

34 Processamento Stream Dado um conjunto de dados, stream, uma série de operações, kernel functions, são aplicadas nos elementos do stream. Na figura verifica-se que os kernels são aplicados aos streams de dados 34

35 Motivação no surgimento de Stream Processors A complexidade do processamento: de midia 3D graphics, Compressão de imagens, e processamento de sinais, Necessita dezenas a centenas de bilhõess de computações por segundo. Para alcançar este nível de computação, surgem os processadores de midia, que empregam arquiteturas special-purpose, projetadas para uma específica aplicação.. 35

36 Processamento Stream Processamento em Stream é adaptado para aplicações com características de. Computação Intensiva a relação entre o número de operações aritméticas em relação ao de operações de E/S é muito grande. Localidade dos dados os dados necessários ao processamento são locais, minimizando a necessidade/ espera por dados externos de acesso mais lento. Paralelismo de dados a mesma função é aplicada aos dados de uma stream sendo processados simultaneamente, sem espera por 36 resultados.

37 What is a Stream Processor? A processor that is optimized to execute a stream program Features include Exploit parallelism TLP with multiple processors DLP with multiple clusters within each processor ILP with multiple ALUs within each cluster Exploit locality with a bandwidth hierarchy Kernel locality within each cluster Producer -consumer locality within each processor 37 Many different possible architectures

38 Stream Processors Act as accelerators Compared to the usual multi-core processors, stream processors have: A higher performance Higher paralellism A better energy efficiency (flop/watt) Example: NVIDIA s CUDA architecture 38

39 Stream Processors Execution model CPU GPU Process Time Kernel 39

40 Stream Processors CUDA s processing hierarchy Logical Physical Thread Scalar Processor 1 Block (0,0) Streaming Multiprocessor 1... SP 1 SP 2 SP 3 SP 4 SP 5 SP 6... SP x Grid (kernel) (0,0) (1,0) (0,1) (1,1) GPU 0 (x,0) (x,1)... SM1 SM2 SM3 SM4 SM5 SM6... SMn 40

41 Stream Processors CUDA s memory hierarchy (older models) SM n SM 2 SM 1 GPU s Memory Local Memory Shared Memory... Local Memory Registers SP 1 SP 2... Threads SP 8 Global Memory Constants Cache Constants Memory Textures Cache Textures Memory 41

42 Stream Processors CUDA s memory hierarchy (newer models) SM n SM 2 SM 1 GPU s Memory Local Memory Shared Memory... Local Memory Registers SP 1 SP 2 L1 Cache L2 Cache... Threads SP x Global Memory Constants Memory Textures Memory 42

43 Experimental Evaluation of GPGPU Processing Laercio Pilla 43

44 Experimental Evaluation Performance comparison Stream Processors x Multi-cores Core 2 Duo Baseline Core 2 Duo + GTX 280 Second Generation GPU, 1 GB memory 2x Nehalem Nehalem + GTX 480 Third Generation GPU, 1 GB memory Xeon + Tesla C1060 Second Generation GPU, 4 GB memory, lower frequency 44

45 Experimental Evaluation NAS Parallel Benchmarks (NPB) Metrics: time and MOPS Baseline: parallel time on Core 2 Duo Use of double precision floating point operations Statistical confidence (minimum of 20 runs) CUDA on the GPUs OpenMP on the CPUs 45

46 Dwarf Mine - Berkeley 13 Dwarfs Classification that organizes algorithm methods according to their behavioral patterns Computations Communications Independent of implementation 46

47 Computational Patterns 47

48 Dwarf Mine 1. Dense Linear Algebra 2. Sparse Linear Algebra 3. Spectral Methods 4. N-Body 5. Structured Grids 6. Unstructured Grids 7. MapReduce 8. Combinational Logic 9. Graph Traversal 10. Dynamic Programming 11. Backtrack & Branch+Bound 12. Construct Graphical Models 13. Finite State Machines 48

49 1 Map Reduce Map + Reduce EP benchmark Embarrassingly parallel High arithmetic intensity Regular computations Regular memory access 49

50 1 Map Reduce Map... Reduce... 50

51 bigger 1 Map Reduce Results Newer GPU up to 10x better than 2x Nehalem better 51

52 2 Spectral Methods FFT FT benchmark Low arithmetic intensity Regular computations Different patterns for memory access Data dependencies Communications 52

53 2 Spectral Methods y Step 1 x z y Step 2 x Tim e z y Step 3 x z 53

54 bigger 2 Spectral Methods Results GPUs performance affected by non coalesced memory access better 54

55 3 Sparse Linear Algebra Sparse matrices CG benchmark Low performance Regular computations Irregular memory access Data dependencies 55

56 3 Sparse Linear Algebra... y... x 56

57 bigger 3 Sparse Linear Algebra Results Small speedups, dependencies affect the parallelism better 57

58 Stream Processors Some Experimental Conclusions: Very good for embarrasingly parallel apps For other apps, there is still a lot of work to be done New FFT libraries Use of reduced precision (single/float) Newer GPUs do not bring a better performance automatically 58

59 Architecture and Software Evolution 59

60 Multiple cores / Heterogeneity Multiple cores and customization will be the major drivers for future microprocessor performance (total chip performance). Multiple cores can increase computational throughput (such as a 1x 4x increase could result from four cores), Customization can reduce execution latency. Chip architects must consider more radical options of smaller cores in greater numbers, along with innovative ways to coordinate them, Heterogeneous implementation are an important part of increasing performance 60 Borkar & Chien 2011

61 Multiple cores / Heterogeneity 61 Borkar & Chien 2011

62 Heterogeneity Choices in multiple cores Core size and number of cores, and the related choices in an heterogeneous implementation increase performance. 62 Borkar & Chien 2011

63 GPUs in HPC Top

64 Fusing CPU GPU Fusing CPU and GPU cores reduce data transfer overheads to a great extent AMD Fusion, Intel Knights Ferry, and NVIDIA Tegra are all steps in the right direction. 64

65 Actual Architecture of GPUs Mayank Daga, AshwinM. Aji, and Wu-chunFeng 65

66 Overhead Data Transfer Mayank Daga, AshwinM. Aji, and Wu-chunFeng 66

67 New Architecture for CPUs - GPUs Mayank Daga, AshwinM. Aji, and Wu-chunFeng 67

68 AMD Fusion APU A Fused CPU+GPUThread Mayank Daga, AshwinM. Aji, and Wu-chunFeng 68

69 Nvidia Tegra2 Harmony development board 69 Mehaut

70 Nvidia Tegra 2 70 NVidia

73 Mont Blanc Project based on ARM & GPU 73 Mateo Valero 2011

74 Mont Blanc 200 PF machine on 10 MW 74 Mateo Valero 2011

75 New programming standard OpenACC 75

76 Conclusão 76

77 Conclusões - Melhorias Os seguintes pontos são objetos de gargalos no desempenho de GPGPUs mas que com o avanço da tecnologia podem melhorar: Tamanho da Cache, Tamanho da Memória, Transferência entre CPU e GPU, Precisão dupla (já melhorada), 77

78 Conclusões - Gargalos Os pontos abaixo são gargalos inerentes ao funcionamento das arquiteturas GPGPUs: Aplicação com falta de paralelismo ou Irregularidade no paralelismo, Falta de Dados, a largura de banda pode ser grande mas se os acessos são irregulares não adianta. Falta de reuso dos dados - intensidade aritmética. 78

79 Evolução das Aquiteturas Stream Computing: Desempenho de Algoritmos Dwarf Mine em GPGPU Professor: Philippe O. A. Navaux Doutorando: Laércio Pilla Institute of Informatics - UFRGS OBRIGADO! navaux@inf.ufrgs.br Parallel and Distributed Processing Group