Modernização de código em Xeon e Xeon Phi Programando para Multi-core e Many-core. Igor Freitas Intel do Brasil igor.freitas@intel.

Transcrição

1 Modernização de código em Xeon e Xeon Phi Programando para Multi-core e Many-core Igor Freitas Intel do Brasil igor.freitas@intel.com

2 Agenda Próximo passo para a Computação Exascale Modernização de código em processadores Xeon Identificando oportunidades de otimização Vetorização / SIMD Otimizando de código com semi-autovetorização Identificando oportunidades de Paralelismo (multi-threading) Modernização de código em co-processadores Xeon Phi Otimizações feitas no Xeon aplicadas no Xeon Phi 2

3 Próximo passo para a Computação Exascale 3

4 Next Step on Intel s Path to Exascale Computing Exascale Vision Next Step: KNL Systems scalable to >100 PFlop/s I/O Resiliency Standard Programming Models Power Efficiency Up to 100 Gb/s with Storm Lake integrated fabric Over 15 GF/Watt 1 Processor Performance Memory ~500 GB/s sustained memory bandwidth with integrated onpackage memory ~3X Flops and ~3X singlethread theoretical peak performance over Knights Corner 1 1 Projections based on internal Intel analysis during early product definition, as compared to prior generation Intel Xeon Phi Coprocessors, and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

5 The Road Ahead You need the best To get the most Transistors Fabric Memory Integration Performance Optimized Hardware Choice Unified Software ROI General Purpose Approach Standards Secure Future 5

6 A Paradigm Shift for Highly-Parallel Server Processor and Integration are Keys to Future Coprocessor Fabric Memory Knights Landing Server Processor Memory Bandwidth ~500 GB/s STREAM Memory Capacity Over 25x* KNC Resiliency Systems scalable to >100 PF Power Efficiency Over 25% better than card 1 I/O Up to 100 GB/s with int fabric Cost Less costly than discrete parts 1 Flexibility Limitless configurations Density 3+ KNL with fabric in 1U 3 *Comparison to 1 st Generation Intel Xeon Phi 7120P Coprocessor (formerly codenamed Knights Corner) 1 Results based on internal Intel analysis using estimated power consumption and projected component pricing in the 2015 timeframe. This analysis is provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 2 Comparison to a discrete Knights Landing processor and discrete fabric component. 3 Theoretical density for air-cooled system; other cooling solutions and configurations will enable lower or higher density.

7 Knights Landing Holistic Approach to Real Application breakthroughs Platform Memory Up to 384 GB DDR4 (6 ch) Compute Intel Xeon Processor Binary-Compatible 3+ TFLOPS 1, 3X ST 2 (single-thread) perf. vs KNC 2D Mesh Architecture... Over 60 Cores... Out-of-Order Cores On-Package Memory Over 5x STREAM vs. DDR4 3 Up to 16 GB at launch Integrated Intel Omni-Path Processor Package Omni-Path (optional) 1 st Intel processor to integrate I/O Up to 36 PCIe 3.0 lanes Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit

8 Knights Landing Details PERFORMANCE 3+ TeraFLOPS of double-precision peak theoretical performance per single socket node 0 INTEGRATION Intel Omni Scale fabric integration High-performance on-package memory (MCDRAM) Over 5x STREAM vs. DDR4 1 Over 400 GB/s Up to 16GB at launch NUMA support Over 5x Energy Efficiency vs. GDDR5 2 Over 3x Density vs. GDDR5 2 In partnership with Micron Technology Flexible memory modes including cache and flat SERVER PROCESSOR Standalone bootable processor (running host OS) and a PCIe coprocessor (PCIe end-point device) Platform memory: up to 384GB DDR4 using 6 channels Reliability ( Intel server-class reliability ) Power Efficiency (Over 25% better than discrete coprocessor) 4 Over 10 GF/W Density (3+ KNL with fabric in 1U) 5 Up to 36 lanes PCIe* Gen 3.0 FUTURE Knights Hill is the codename for the 3 rd generation of the Intel Xeon Phi product family Based on Intel s 10 nanometer manufacturing technology Integrated 2 nd generation Intel Omni-Path Fabric MICROARCHITECTURE Over 8 billion transistors per die based on Intel s 14 nanometer manufacturing technology Binary compatible with Intel Xeon Processors with support for Intel Advanced Vector Extensions 512 (Intel AVX-512) 6 3x Single-Thread Performance compared to Knights Corner cores in a 2D Mesh architecture 2 cores per tile with 2 vector processing units (VPU) per core) 1MB L2 cache shared between 2 cores in a tile (cache-coherent) Based on Intel Atom core (based on Silvermont microarchitecture) with many HPC enhancements 4 Threads / Core 2X Out-of-Order Buffer Depth 8 Gather/scatter in hardware Advanced Branch Prediction High cache bandwidth 32KB Icache, Dcache 2 x 64B Load ports in Dcache 46/48 Physical/virtual address bits Most of today s parallel optimizations carry forward to KNL Multiple NUMA domain support per socket AVAILABILITY First commercial HPC systems in 2H 15 Knights Corner to Knights Landing upgrade program available today Intel Adams Pass board (1U half-width) is custom designed for Knights Landing (KNL) and will be available to system integrators for KNL launch; the board is OCP Open Rack 1.0 compliant, features 6 ch native DDR4 (1866/2133/2400MHz) and 36 lanes of integrated PCIe* Gen 3 I/O All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. All projections are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 0 Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expecations of cores, clock frequency and floating point operations per cycle. 1 Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory with all channels populated. 2 Projected result based on internal Intel analysis comparison of 16GB of ultra high-bandwidth memory to 16GB of GDDR5 memory used in the Intel Xeon Phi coprocessor 7120P. 3 Compared to 1 st Generation Intel Xeon Phi 7120P Coprocessor (formerly codenamed Knights Corner) 4 Projected result based on internal Intel analysis using estimated performance and power consumption of a rack sized deployment of Intel Xeon processors and Knights Landing coprocessors as compared to a rack with KNL processors only 5 Projected result based on internal Intel analysis comparing a discrete Knights Landing processor with integrated fabric to a discrete Intel fabric component card. 6 Binary compatible with Intel Xeon Processors v3 (Haswell) with the exception of Intel TSX (Transactionaly Synchronization Extensions) 7 Projected peak theoretical single-thread performance relative to 1 st Generation Intel Xeon Phi Coprocessor 7120P 8 Compared to the Intel Atom core (base on Silvermont microarchitecture) Continued on next page

9 Knights Landing Details MOMENTUM Aurora is currently the largest planned system at petaflop/s, to be delivered in Intel is teaming with Cray on both projects. Aurora: 50,000 nodes: Future generation Intel Xeon Phi processors (Knights Hill); 2nd generation Intel Omni-Path fabric; New memory hierarchy composed of Intel Lustre, Burst Buffer Storage, and persistent memory through high bandwidth on-package memory; Cray s Shasta platform. Cori Supercomputer at NERSC (National Energy Research Scientific Computing Center at LBNL/DOE) became the first publically announced Knights Landing based system, with over 9,300 nodes slated to be deployed in mid-2016 Trinity Supercomputer at NNSA (National Nuclear Security Administration) is a $174 million deal awarded to Cray that will feature Haswell and Knights Landing, with acceptance phases in both late and Expecting over 50 system providers for the KNL host processor, in addition to many more PCIe-card based solutions. >100 Petaflops of committed customer deals to date More info The path to Aurora HPC Scalable System Framework Coral Program Overview All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. All projections are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

10 Modernização de código em processadores Xeon Identificando oportunidades de otimização - Vetorização 10

11 Legal Disclaimers Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Intel Hyper-Threading Technology Available on select Intel Xeon processors. Requires an Intel HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit Intel Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different processor sequences. See for details. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject to change without notice Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel s current plan of record product roadmaps. Product plans, dates, and specifications are preliminary and subject to change without notice Copyright Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Xeon logo, Xeon Phi and Xeon Phi logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and products specified are for planning purposes only and are subject to change without notice. *Other names and brands may be claimed as the property of others. 11

12 Identificando oportunidades de otimização Foco deste seminário: Modernização de código Otimização em um único nó de processamento Vetorização & Paralelismo HPC Cluster Cluster Edition Multi-fabric MPI library MPI error checking and tuning Professional Edition MPI Messages Threading design & prototyping Parallel performance tuning Memory & thread correctness Composer Edition Vectorized & Threaded Node Intel C++ and Fortran compilers Parallel models (e.g., OpenMP*) Optimized libraries 12

13 Identificando oportunidades de otimização Foco deste seminário: Modernização do código Código C/C++ ou Fortran Paralelismo (Multithreading) Thread 0/Core0 Thread 0 / Core 0 Thread 1/ Core1 Thread 2 / Core 2... Thread 12 / Core12 Thread 1/Core1 Thread 2/Core2... Thread 244 /Core61 Vector Processor Unit por Core 128 Bits 256 Bits Vetorização Vector Processor Unit por Core 512 Bits 13

14 Identificando oportunidades de otimização Vetorização dentro do core Código de exemplo - Black-Scholes Pricing Code a mathematical model of a financial market containing certain derivative investment instruments. Exemplo retirado do livro High Performance Parallelism Pearls Código fonte: Artigo sobre otimização deste método 14

15 Identificando oportunidades de otimização Recapitulando o que é vetorização / SIMD O que é e? Capacidade de realizar uma operação matemática em dois ou mais elementos ao mesmo tempo. Por que Vetorizar? Ganho substancial em performance! for (i=0;i<=max;i++) c[i]=a[i]+b[i]; A + B C Scalar - Uma instrução - Uma operação a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] + b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] Vector - Uma instrução - Oito operações c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] 15

16 Evolução do processamento vetorial dos processadores Intel 64 X4 Y4 X4opY4 X3 Y3 X3opY3 X2 Y2 X2opY2 X1 Y1 X1opY1 0 MMX Vector size: 64bit Data types: 8, 16 and 32 bit integers VL: 2,4,8 For sample on the left: Xi, Yi 16 bit integers X4 128 Y4 X4opY4 X3 Y3 X3opY3 X2 Y2 X2opY2 X1 Y1 X1opY1 0 Intel SSE Vector size: 128bit Data types: 8,16,32,64 bit integers 32 and 64bit floats VL: 2,4,8,16 Sample: Xi, Yi bit 32 int / float 16

17 Evolução do processamento vetorial dos processadores Intel 255 X8 Y8 X7 Y7 X6 Y6 X5 Y X4 Y4 X3 Y3 X2 Y2 X1 Y1 0 Intel AVX / AVX2 Vector size: 256bit Data types: 32 and 64 bit floats VL: 4, 8, 16 Sample: Xi, Yi 32 bit int or float X8opY8 X7opY7 X6opY6 X5opY5 X4opY4 X3opY3 X2opY2 X1opY1 511 X16... Y16 X16opY X9 X8 X7 Y9 Y8 Y7 X9opY9 X8opY8 X6 Y6... X5 Y5 X4 Y4 X3 Y3 X2 Y2 0 X1 Y1 X1opY1 Intel MIC / AVX-512 Vector size: 512bit Data types: 32 and 64 bit integers 32 and 64bit floats (some support for 16 bits floats) VL: 8,16 Sample: 32 bit float 17

18 Princípios do desempenho obtido Vários caminhos, uma única arquitetura Performance Work Time = Path Length Work Instruction x IPC Instruction Cycle x Frequency Cycle Time Não podemos mais contar somente com aumento da frequência Algoritmo eficiente mesma carga de trabalho com menos instruções Compilador reduz as instruções e melhora IPC (Instructions per cyle) Uso eficiente da Cache: melhora IPC Vetorização: mesmo trabalho com menos instruções Paralelização: mais instruções por ciclo *Other logos, brands and names are the property of their respective owners. All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel Xeon Processor generations from left to right in each chart: 64-bit, 5100 series, 5500 series, 5600 series, E5-2600, E v2 Intel Xeon Phi Product Family from left to right in each chart: Intel Xeon Phi x100 Product Family (formerly codenamed Knights Corner), Knights Landing (next-generation Intel Xeon Phi Product Family)

19 Identificando oportunidades de otimização Maneiras de vetorizar o código Facilidade de Uso Vectors Intel Math Kernel Library Auto vectorization Array Notation: Intel Cilk Plus Devemos avaliar três fatores: Necessidade de performance Disponibilidade de recursos para otimizar o código Portabilidade do código Semi-auto vectorization: #pragma (vector, ivdep, simd) C/C++ Vector Classes (F32vec16, F64vec8) Ajuste Fino 19

20 Identificando oportunidades de otimização Vetorização dentro do core O código está otimizado para rodar em uma única thread? Compilar o código com parâmetro -qopt-report[=n] no Linux ou /Qopt-report[:n] no Windows. /Qopt-report-file:vecReport.txt Analisar relatório, encontrar dicas sobre loops não vetorizados e principal causa Loop de inicialização de variáveis LOOP BEGIN at...black-scholes-ch19\02_referenceversion.cpp(93,3) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed OUTPUT dependence between pt line 95 and pk line 97 remark #25439: unrolled with remainder by 2 LOOP END Loop dentro da função GetOptionPrices LOOP BEGIN at...black-scholes-ch19\02_referenceversion.cpp(56,3) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed ANTI dependence between ps0 line 58 and pc line 62 LOOP END 20

21 Identificando oportunidades de otimização Vetorização dentro do core Rodar Intel VTune General Exploration marcar opção Analyze memory bandwidth Identificar hotspot = função que gasta mais tempo na execução Identificar se as funções estão vetorizadas 21

22 Identificando oportunidades de otimização Vetorização dentro do core Parâmetros de compilação utilizados: /GS /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc110.pdb" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qipo /Zc:forScope /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Qprof-dir "x64\release\" /Fp"x64\Release\Reference.pch Parâmetros de execução <número de elementos> <número de threads>

23 Identificando oportunidades de otimização Vetorização dentro do core Instruções escalares como movsd ou cvtsd2ss ( s de scalar) estão sendo utilizadas ao invés de vmovapd ; v de AVX, p de packed, d de double SSE utiliza 128 bits ; e não 256 bits igual instruções AVX 23

24 Identificando oportunidades de otimização Vetorização dentro do core O que identificamos? Código não está vetorizado, não utiliza instruções AVX/AVX2 de 256 bits Compilador apontou dependência de dados Oportunidades apontadas pelo VTune Back-end bound : baixo desempenho na execução das instruções Memory bound: aplicação dependente da troca de mensagens entre Cache RAM Instruções load (ram -> cache) e store ( cache -> ram ) L1 bound: dados não são encontrados neste nível de cache Port utilization: baixa utilização do core, non-memory issues Funções cdfnormf e GetOptionPrices são hotspots 24

25 Modernização de código com semi-autovetorização 25

26 Modernização de código Semi-autovetorização Uso do #pragma ivdep (por enquanto utilizando apenas 1 thread/core) Código não está vetorizado, não utiliza instruções AVX/AVX2 de 256 bits Compilador apontou dependência de dados na linha 56 #pragma ivdep for (i = 0; i < N; i++) { d1 = (log(ps0[i] / pk[i]) + (r + sig * sig * 0.5) * pt[i]) / (sig * sqrt(pt[i])); d2 = (log(ps0[i] / pk[i]) + (r - sig * sig * 0.5) * pt[i]) / (sig * sqrt(pt[i])); p1 = cdfnormf(d1); p2 = cdfnormf(d2); pc[i] = ps0[i] * p1 - pk[i] * exp((-1.0) * r * pt[i]) * p2; } Performance elementos #pragma ivdep time = Código original time = x de speedup Configuração Intel Core i CPU 2.5 GHZ 4GB RAM Windows 8.1 x64 Intel Compiler C

27 Modernização de código Semi-autovetorização Uso do #pragma ivdep Entenda o que mudou rodando novamente o VTune, compare as duas versões! Menos instruções executadas! Menos ciclos de clock por execução! Menos misses na cache L1 Melhor uso do core 27

28 Modernização de código Semi-autovetorização Uso do #pragma ivdep Entenda o que mudou rodando novamente o VTune, compare as duas versões! Menos instruções executadas! Menos ciclos de clock por execução! Menos misses na cache L1 Melhor uso do core 28

29 Modernização de código Semi-autovetorização Uso do #pragma ivdep Entenda o que mudou rodando novamente o VTune, compare as duas versões! Relatório do Compilador: análise do loop vetorizado e dicas de como vetorizar week\pdf-codigos-intel-lncc\paralelismo-dia-02\black-scholes-ch19\02_referenceversion.cpp(63,5) ] remark #15300: LOOP WAS VECTORIZED remark #15442: entire loop may be executed in remainder remark #15448: unmasked aligned unit stride loads: 1 remark #15450: unmasked unaligned unit stride loads: 2 remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 760 remark #15477: vector loop cost: remark #15478: estimated potential speedup: remark #15479: lightweight vector operations: 66 remark #15480: medium-overhead vector operations: 2 remark #15482: vectorized math library calls: 5 remark #15487: type converts: 13 remark #15488: --- end vector loop cost summary --- LOOP END

30 SPEEDUP Modernização de código Semi-autovetorização IVDEP Ignora dependência entre os vetores (QxAVX) Instruções AVX 256 bits OTIMIZAÇÕES DENTRO DO CORE 1 THREAD 60Mi Configuração Intel Core i CPU 2.5 GHZ 4GB RAM Windows 8.1 x64 Intel Compiler C Mi 10 5 Unroll ( n ) Desmembra o loop para instruções SIMD Link sobre unroll Requisitos para loop ser vetorizado Loop unrolling 0 BASELINE IVDEP QXAVX + IVDEP QXAVX + IVDEP + UNROLL(4)

31 Identificando oportunidades de otimização [2] Precisão numérica e alinhamento de dados 31

32 Identificando oportunidades de otimização Vetorização dentro do core Precisão e dados alinhados 2 oportunidades apontadas pelo Compilador! Redução da precisão numérica Double (64 bits) para Single (32 bits) Alinhamento de dados LOOP BEGIN at... Black-scholes-ch19\02_ReferenceVersion.cpp(58,3) remark #15389: vectorization support: reference ps0 has unaligned access [ remark #15381: vectorization support: unaligned access used inside loop body remark #15399: vectorization support: unroll factor set to 2 remark #15417: vectorization support: number of FP up converts: single precision to double precision 1 remark #15389: vectorization support: reference pk has unaligned access [... Black-scholesch19\02_ReferenceVersion.cpp(64,5) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15399: vectorization support: unroll factor set to 8 remark #15417: vectorization support: number of FP up converts: single precision to double precision 1 [... Black-scholes-ch19\02_ReferenceVersion.cpp(60,5) ]

33 Identificando oportunidades de otimização Vetorização dentro do core Precisão e dados alinhados Otimizando o código para precisão simples d1 = (log(ps0[i] / pk[i]) + (r + sig * sig * 0.5) * pt[i]) / (sig * sqrt(pt[i])); d2 = (log(ps0[i] / pk[i]) + (r - sig * sig * 0.5) * pt[i]) / (sig * sqrt(pt[i])); p1 = cdfnormf(d1); p2 = cdfnormf(d2); pc[i] = ps0[i] * p1 - pk[i] * exp((-1.0) * r * pt[i]) * p2; d1 = (logf(ps0[i] / pk[i]) + (r + sig * sig * 0.5f) * pt[i]) / (sig * sqrtf(pt[i])); d2 = (logf(ps0[i] / pk[i]) + (r - sig * sig * 0.5f) * pt[i]) / (sig * sqrtf(pt[i])); p1 = cdfnormf (d1); p2 = cdfnormf (d2); pc[i] = ps0[i] * p1 - pk[i] * expf((-1.0f) * r * pt[i]) * p2; 23.6x speedup vs Código original 1.4x speedup vs AVX + Unrool + IVDEP

34 Identificando oportunidades de otimização Vetorização dentro do core Precisão e dados alinhados SPEEDUP Otimizando o código para precisão simples OTIMIZAÇÕES DENTRO DO CORE 1 THREAD Configuração Intel Core i CPU 2.5 GHZ 4GB RAM Windows 8.1 x64 Intel Compiler C Mi 120Mi BASELINE IVDEP QXAVX + IVDEP QXAVX + IVDEP + UNROLL(4) ALL + REDUÇÃO DA PRECISÃO

35 Identificando oportunidades de Paralelismo (multi-threading) Do Not Guess Measure 35

36 Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Apesar de vetorizado (paralelismo em nível de instruções), o código está rodando em apenas uma única thread/core! Antes de começar a otimização no código, podemos analisar se vale a pena paralelizalo em mais threads! Intel Advisor XE Modela e compara a performance entre vários frameworks para criação de threads tanto em processadores quanto em co-processadores OpenMP, Intel Cilk Plus, Intel Threading Bulding Blocks C, C++, Fortran (apenas OpenMP) e C# (Microsoft TPL) Prevê escalabilidade do código: relação n.º de threads/ganho de performance Identifica oportunidades de paralelismo no código Checa corretude do código (deadlocks, race condition) 36

37 Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Passos para utilizar o Intel Advisor 1º - Inclua os headers #include "advisor-annotate.h 2º - Adicionar referência ao diretório include ; linkar lib ao projeto (Windows e Linux) Windows com Visual Studio 2012 Geralmente localizado em C:\Program Files (x86)\intel\advisor XE\include Linux - Compilando / Link com Advisor icpc -O2 -openmp 02_ReferenceVersion.cpp -o 02_ReferenceVersion -I/opt/intel/advisor_xe/include/ -L/opt/intel/advisor_xe/lib64/ 37

38 Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Passos para utilizar o Intel Advisor 3º - Executando o Advisor Linux $ advixe-gui & Crie um novo projeto - Interface é a mesma para Linux e Windows - No caso do Visual Studio há a opção de roda-lo de forma integrada. 38

39 Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Passos para utilizar o Intel Advisor Advisor Workflow Survey Target: analisa o código em busca de oportunidades de paralelismo Annotate Sources: Anotações são inseridas em possíveis regiões de código paralelas Check Suitability: Analisa as regiões paralelas anotadas, entrega previsão de ganho de performance e escalabilidade do código Check correctness: Analisa possíveis problemas como race conditions e dealocks Add Parallel Framework: Passo para substituir anotações do Advisor pelo código do framework escolhido (OpenMP, Cilk Plus, TBB, etc.) 39

40 Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Identificando hotspots e quais loops podem ser paralelizados

41 Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Inserindo as anotações do Advisor para executar a próxima fase: Check Suitability

42 Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Identificando hotspots e quais loops podem ser paralelizados

43 SPEEDUP Aplicando paralelismo via OpenMP Análise de concorrência com o Intel Vtune Código otimizado IVDEP + AVX + UNROLL + FLOAT PRECISION OTIMIZAÇÃO MULTI-THREAD OpenMP - 60mi Nthreads: 1 time = Nthreads: 2 time = Nthreads: 4 time = Nthreads: 8 time = THREADS Configuração Intel Core i CPU 2.5 GHZ 4GB RAM Windows 8.1 x64 Intel Compiler C

44 Modernização de código no coprocessador Xeon Phi 44

45 Agenda Black-Scholes Pricing Code Intensidade computacional - Monte-carlo 45

46 Modernização de código no Intel Xeon Phi Recapitulando modelos de programação Modelos de programação Xeon Multi-Core Centric MIC Many Core Centric Multi-Core Hosted Aplicações Seriais e Paralelas Offload Aplicações com etapas paralelas Symmetric Load Balance Many-Core Hosted - Native Aplicações Massivamente Paralelas 46

47 Modernização de código no Intel Xeon Phi Recapitulando modelos de programação MIC Plataform Software Stack Linux/Windows - Xeon Host Intel Xeon Phi Coprocessor Offload Aplication SSH SSH Session Native Aplication Offload Aplication System Level Code MPSS System Level Code Coprocessor Communication and applaunching support PCI-e Bus PCI-e Bus Windows / Linux OS Linux uos 47

48 Modernização de código no Intel Xeon Phi Black-Scholes Pricing Code Compilando para Xeon Phi $ icc -O2 -mmic 02_ReferenceVersion.cpp -o reference.mic openmp $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/compiler/ /composerxe/lib/mic/ $ scp reference.mic mic0:/tmp $ ssh mic0 $ cd /tmp $./reference.mic micnativeloadex reference.mic -l Dependency information for reference.mic... Binary was built for Intel(R) Xeon Phi(TM) Coprocessor (codename: Knights Corner) architecture SINK_LD_LIBRARY_PATH = Dependencies Found: (none found) Dependencies Not Found Locally (but may exist already on the coprocessor): libm.so.6 libiomp5.so libstdc++.so.6 libgcc_s.so.1 libpthread.so.0 libc.so.6 libdl.so.2 Verificando dependências de biblioteca

49 Testando o código no Intel Xeon Phi Código vetorizado, porém em 1 thread Código simplesmente portado para Xeon Phi icc -O2 -openmp -fimf-precision=low -fimf-domain-exclusion=31 -mmic 11_XeonPhi.cpp -o 11_XeonPhi.mic icc -O2 -openmp -fimf-precision=low -fimf-domain-exclusion=31 -mmic 12_XeonPhiWorkInParallel.cpp -o 12_XeonPhiWorkInParallel.mic icc -O2 -openmp -fimf-precision=low -fimf-domain-exclusion=31 -mmic 13_XeonPhiStreamingStores.cpp -o 13_XeonPhiStreamingStores.mic 49

50 Modernização de código no Intel Xeon Phi Código vetorizado, porém em 1 thread #pragma vector aligned #pragma simd for (i = 0; i < N; i++) { invf = invsqrtf(sig2 * pt[i]); d1 = (logf(ps0[i] / pk[i]) + (r + sig2 * 0.5f) * pt[i]) / invf; d2 = (logf(ps0[i] / pk[i]) + (r - sig2 * 0.5f) * pt[i]) / invf; erf1 = 0.5f + 0.5f * erff(d1 * invsqrt2); erf2 = 0.5f + 0.5f * erff(d2 * invsqrt2); pc[i] = ps0[i] * erf1 - pk[i] * expf((-1.0f) * r * pt[i]) * erf2; } } time = ~ Vetorizado, porém em 1 thread time = Intel Xeon processor Intel Xeon Phi coprocessor

51 Testando o código no Intel Xeon Phi Código vetorizado e paralelizado #pragma vector aligned #pragma simd #pragma omp parallel for private(d1, d2, erf1, erf2, invf) for (i = 0; i < N; i++) { invf = invsqrtf(sig2 * pt[i]); d1 = (logf(ps0[i] / pk[i]) + (r + sig2 * 0.5f) * pt[i]) / invf; d2 = (logf(ps0[i] / pk[i]) + (r - sig2 * 0.5f) * pt[i]) / invf; erf1 = 0.5f + 0.5f * erff(d1 * invsqrt2); erf2 = 0.5f + 0.5f * erff(d2 * invsqrt2); pc[i] = ps0[i] * erf1 - pk[i] * expf((-1.0f) * r * pt[i]) * erf2; } } 4 threads time = threads time = Intel Xeon processor Intel Xeon Phi coprocessor

52 TEMPO DE EXECUÇÃO Testando o código no Intel Xeon Phi Código vetorizado e paralelizado BLACK SCHOLES Xeon E5-2697v3 MIC - Simple port MIC - Parallel MIC-StreamStore MIC - warm-up THREADS Configuração Intel(R) Xeon(R) CPU E GHz 64GB RAM Linux version el6.x86_64.crt1 Red Hat Intel Xeon Phi 7110ª MPSS

53 Testando o código no Intel Xeon Phi Análise de desempenho VTune está indicando que há muita latência, impacto direto na vetorização! 53

54 Testando o código no Intel Xeon Phi Análise de desempenho VTune está indicando que há muita latência, impacto direto na vetorização! 54

55 Testando o código no Intel Xeon Phi Análise de desempenho - Bandwidth Bandwidth DRAM << >> CACHE Versão com instruções StreamStore 55

56 Testando o código no Intel Xeon Phi Análise de desempenho - Bandwidth Bandwidth DRAM << >> CACHE Comparação StreamStore vs Non-StreamStore 56

57 Testando o código no Intel Xeon Phi Algumas conclusões Aplicação orientada a memory bandwidth Neste caso, mais threads não trará ganhos Streams-stores no Xeon Phi evitou uso de cache nos vetores somente de escrita, reduzindo consumo de bandwidth Computar X options = 3X reads + X writes + X Read-for-ownership = 5X Streamstore: 3X reads + X writes = 4X Latência, devido a cache misses impacta vetorização Overhead na criação de threads impacta 1ª execução do loop OpenMP, principalmente no Xeon Phi Versão final do código computou 2067mi opções/sec no Xeon e 9231mi opções/sec no Xeon Phi 57

58 Intensidade computacional Monte Carlo European Option Xeon Phi 58

59 Intensidade computacional Monte Carlo European Option Xeon Phi Monte Carlo European Option with Pre-generated Random Numbers for Intel Xeon Phi Coprocessor Artigo 1 Download do código utilizado Artigo 2 Detalhes de otimização Compilando para Xeon Phi icpc MonteCarlo.cpp -mmic -O3 -ipo -fno-alias -opt-threads-per-core=4 -openmp -restrict -vec-report2 -fimfprecision=low -fimf-domain-exclusion=31 -no-prec-div -no-prec-sqrt -DCOMPILER_VERSION="15" -ltbbmalloc -DFPFLOAT -o MonteCarloSP.mic Verificando dependências de biblioteca $ ssh mic0 $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/compiler/ /lib/mic:/opt/intel/compiler/ /tbb/lib/mic $ source /opt/intel/compiler/ /composerxe/bin/compilervars.sh intel64 59

60 Xeon and Xeon Phi Mesmo código, mesmo modelo de programação Xeon Phi: icpc MonteCarlo.cpp -mmic -O3 -ipo -fno-alias -opt-threads-per-core=4 -openmp - restrict -vec-report2 -fimf-precision=low -fimf-domain-exclusion=31 -no-prec-div -noprec-sqrt -DCOMPILER_VERSION="15" -ltbbmalloc DFPFLOAT -g -o MonteCarloSP.mic Xeon icpc MonteCarlo.cpp -O3 -ipo -fno-alias -opt-threads-per-core=2 -openmp -restrict - vec-report2 -fimf-precision=low -fimf-domain-exclusion=31 -no-prec-div -no-prec-sqrt -DCOMPILER_VERSION="15" -ltbbmalloc -DFPFLOAT -o MonteCarloSP.xeon 60

61 Intensidade computacional Monte Carlo European Option Xeon Phi... #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) {... #pragma vector aligned #pragma loop count (RAND_N) #pragma simd reduction(+:v0) reduction(+:v1) #pragma unroll(unrollcount) for(const FP_TYPE *hrnd = start; hrnd < end; ++hrnd) { const FP_TYPE result = std::max(typed_zero, Y*EXP2(VBySqrtT*(*hrnd) + MuByT) - Z); v0 += result; v1 += result*result; } CallResultList[opt] += v0; CallConfidenceList[opt] += v1; } } threads time = AVX2 AVX2 AVX2 Baseline: código paralelo e vetorizado! Intel IMCI 240 threads time = AVX2 AVX2 Intel Xeon processor Intel Xeon Phi coprocessor

62 SPEEDUP Monte Carlo European Option Tempo de execução Xeon and Xeon Phi Xeon E5-2697v3 Xeon Phi THREADS Configuração Intel(R) Xeon(R) CPU E GHz 64GB RAM Linux version el6.x86_64.crt1 Red Hat Intel Xeon Phi 7110ª MPSS

63 Xeon and Xeon Phi Análise do uso da memória Aplicação é orientada a cálculo dos dados compute bound, e não memory bandwidth Alta utilização da memória cache boa localidade dos dados 63

64 Xeon and Xeon Phi Análise do uso da memória Aplicação é orientada a cálculo dos dados compute bound, e não memory bandwidth Pode haver espaço para otimizações Software prefetching Data locality 64

65 Xeon and Xeon Phi - Outros Cases Mesmo modelo de programação e otimização Joint study from Colfax International and Stanford University GOAL: building a 3D model of the Mikly Way galaxy using large volumes of 2D sky survey data App ported quickly to Xeon Phi: performance was <1/3 because we moved from >3 GHz out-of-order modern CPU to a lowpower 1 GHz in-order core limited threading, no vectorization After modernizing the code, it runs up to 125x faster (vs. Xeon) and 620x faster (vs. Xeon Phi) than baseline!!! 1 Source: Colfax International and Stanford University. Complete details can be found at:

66 Performance (voxels/second) Xeon and Xeon Phi - Outros Cases Mesmo modelo de programação e otimização 10 4 HEATCODE BENCHMARKS 1.9x 4.4x x CPU & Intel C++ 620x Xeon Phi CPU & Intel C++ Xeon Phi CPU + Xeon Phi + Xeon Phi NON-OPTIMIZED OPTIMIZED HETERO 1 Source: Colfax International and Stanford University. Complete details can be found at:

67 Conclusões Modernização de código para Xeon e Xeon Phi 1º passo: modernize seu código serial para aproveitar ao máximo o paralelismo em nível de instrução e processamento vetorial Workflow de otimização: meça, identifique, otimize, teste! Xeon Phi: Todas otimizações feitas pra Xeon serão preservadas! Explore Xeon + Xeon Phi Otimização para single-thread Vetorização / SIMD / Paralelismo em nível de instrução / Multi-threading Uso de padrões abertos Mesma técnica de programação para Xeon e Xeon Phi Várias maneiras de otimização:facilidade ou Ajuste fino 67

68 Links úteis Por favor, dê seu feedback desta palestra - Manual e referência do compilador Intel Compiler 15.0 Manual e referência - Intel Intrinsics Guia para Auto-vetorização Xeon Phi Home Page Xeon Phi CODE RECIPES Intel Developer Zone IPCCs - Centros de Computação Paralela Product Overview (IBL) 68

69 A Growing Application Catalog Over 100 apps listed as available today or in flight

70 Testes Servidor Xeon Intel(R) Xeon(R) CPU E GHz - Haswell CPU(s): 56 Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 35840K 70

71 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Intel Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 71