Modernização de código em Xeon e Xeon Phi Programando para Multi-core e Many-core. Igor Freitas Intel do Brasil igor.freitas@intel.

Documentos relacionados

Principais conceitos e técnicas em vetorização

Computação Heterogênea Programação paralela, clusters e GPUs

Microprocessadores II - ELE 1084

NetApp Visão Geral Alguns Desafios de uma Infraestrutura em Nuvem Integração com o System Center 2012

Perguntas e respostas

COMPUTAÇÃO PARALELA. uma visão geral. Guilherme Galante. v.2.0

periféricos: interfaces humano-computador (HCI) arquivo de informação comunicações

Windows NT 4.0. Centro de Computação

Uma nova luz na web com Microsoft Silverlight. Leonardo Sobral Consultor de Tecnologia

Bits internos e bits externos. Barramentos. Processadores Atuais. Conceitos Básicos Microprocessadores. Sumário. Introdução.

SAP Best Practices Informações sobre disponibilidade e download para Clientes SAP e Parceiros SAP. Know-how setorial e global pré-configurado

29/3/2011. Primeira unidade de execução (pipe U): unidade de processamento completa, capaz de processar qualquer instrução;

Live Show Gerenciamento de Ambientes. Danilo Bordini ( ) Rodrigo Dias (

Hitachi Unified Storage. Família HUS 100. Henrique Leite! Tuesday, 4 de September de 12! Solutions Consultant!

Seja Bem-Vindo. System Center Family. Inicio: 09:00 Termino: 12:00

Processador ( CPU ) E/S. Memória. Sistema composto por Processador, Memória e dispositivos de E/S, interligados por um barramento

DevOps. Carlos Eduardo Buzeto IT Specialist IBM Software, Rational Agosto Accelerating Product and Service Innovation

User Guide Manual de Utilizador

A base para uma melhor inteligência de negócios

Programação em Memória Compartilhada com OpenMP

Symantec & Jogos Olímpicos Rio Julho de 2015

Para os. edition: AutoCAD. Mechanical. Showcase. Autodesk. Autodesk. SketchBook. Designer. Mudbox Vault. Autodesk. Autodesk. Ultimate.

Sistema de Visão Computacional sobre Processadores com Arquitetura Multi Núcleos

hdd enclosure caixa externa para disco rígido

Ata de Registro de Preços. Vigência: 05/08/2013 à 05/08/2014

Seja Bem-Vindo. Sharepoint 2007 para Desenvolvedores. Inicio: 19:00 Termino: 22:00

CASE STUDY FOR RUNNING HPC APPLICATIONS IN PUBLIC CLOUDS

Virtual Operating Environment (VOE) Marcelo Tomoyose, Kodak Brasileira

Software product lines. Paulo Borba Informatics Center Federal University of Pernambuco

HMI Caracteristicas e extensões utilizando FT View ME v6.1 e PanelView Plus 6

1 Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Acelerando Seus Negócios Riverbed Performance Platform

Matheus S. Serpa, Vinícius G. Pinto, Philippe O. A. Navaux Contato: INTEL MODERN CODE PARTNER

Programação de Sistemas

Programação de Sistemas

Eliminando Fronteiras: Aumente a Produtividade com Soluções de Mobilidade Symantec Arthur Cesar Oreana

Especialização em Engenharia e Administração de Banco de Dados SISTEMA DE GERENCIAMENTO DE BANCO DE DADOS I

Capítulo 3. Avaliação de Desempenho. 3.1 Definição de Desempenho

Grupo de Arquitetos Microsoft Brasil

Métodos Formais em Engenharia de Software. VDMToolTutorial

Project Management Activities

CMDB no ITIL v3. Miguel Mira da Silva

Sistemas Operacionais

Grupo de Arquitetos Microsoft Brasil

Versão: 1.0. Segue abaixo, os passos para o processo de publicação de artigos que envolvem as etapas de Usuário/Autor. Figura 1 Creating new user.

Análise de desempenho e eficiência energética de aceleradores NVIDIA Kepler

É a associação de mais de um fluxo de execução em um único processo.

SSC510 Arquitetura de Computadores. 12ª aula

Produzindo Valor com Gerenciamento do Ciclo de Vida de Aplicativos Delivering Value with Application Lifecycle Management (ALM)

ANHANGUERA EDUCACIONAL. Capítulo 2. Conceitos de Hardware e Software

Uma introdução sobre Frameworks de Desenvolvimento

Grupo de Arquitetos Microsoft Brasil

SECRETARIA DA JUSTIÇA E DEFESA DA CIDADANIA FUNDAÇÃO DE PROTEÇÃO E DEFESA DO CONSUMIDOR - PROCON/SP MANUAL DE INSTALAÇÃO DO SISTEMA

Efficient Locally Trackable Deduplication in Replicated Systems. technology from seed

Construindo Sistemas de Gravação e Reprodução de Dados de Áudio, Vídeo e GPS com a Plataforma PXI

Heterogeneous multi-core computer architectures and

Hardware Avançado. Laércio Vasconcelos Rio Branco, mar/2007

Programação Paralela Híbrida em CPU e GPU: Uma Alternativa na Busca por Desempenho

SAP Business One, version for HANA. Ralph Oliveira

Arquitetura e Organização de Computadores. Capítulo 0 - Introdução

Accessing the contents of the Moodle Acessando o conteúdo do Moodle

Guia de Instalação Rápida TE100-PIP

Serviços: API REST. URL - Recurso

Imagem retirada de documentações de treinamentos oficiais INTEL

Symmetric Multiprocessing Simultaneous Multithreading Paralelismo ao nível dos dados

Parallel Computing Paradigms

Programa de Educação Continuada em Angiografia

Alcance Adapte-se a mundança. Nome, Título

Guia de Instalação Rápida TE100-PIU

Microsoft S+S Day Data: 09 de dezembro de 2008

Programa de Educação Continuada em Ressonância Magnética

A Cloud Computing Architecture for Large Scale Video Data Processing

OVERVIEW DO EAMS. Enterprise Architecture Management System 2.0

Transparent application acceleration by intelligent scheduling of shared library calls on heterogeneous systems September 9, 2013

Faça você mesmo: Do Projeto 3D às Ferramentas de Fabricação Digital

Desempenho DESEMPENHO DE COMPUTADORES

O que há de novo no LabVIEW Real- Time e LabVIEW FPGA

Universidade Federal de Minas Gerais Escola de Engenharia Departamento de Engenharia Eletrônica

Easy Linux! FUNAMBOL FOR IPBRICK MANUAL. IPortalMais: a «brainware» company Manual

Aqui pode escolher o Sistema operativo, e o software. Para falar, faça download do Cliente 2.

Introduction to Network Design and Planning

Guia de Instalação Rápida TE100-P

ÍNDICE PORTUGUÊS INDEX ENGLISH

O hardware é a parte física do computador, como o processador, memória, placamãe, entre outras. Figura 2.1 Sistema Computacional Hardware

Guia de Instalação Rápida TU2-ETG H/W: V1.3R

ANÁLISE DE DESEMPENHO DA PARALELIZAÇÃO DO CÁLCULO DE NÚMEROS PRIMOS UTILIZANDO PTHREAD E OPENMP 1

Tecnologias de Construção de Memórias e Memórias RAM, entrelaçada e Virtual

CHPC Computational Platforms

Estudo Qualitativo e Quantitativo de Linguagens Paralelas para Arquiteturas Multicore

Curso de especialização em Teleinformática Disciplina Sistemas Distribuídos Prof. Tacla

GABINETE ALTURA PROCESSADORES NÚCLEOS DE PROCESSAMENTO MEMÓRIA RAM CHIPSET ARQUITETURA BIOS

Arquitetura e Organização de Computadores 2

VGM. VGM information. ALIANÇA VGM WEB PORTAL USER GUIDE June 2016

Multi-processamento. Arquitecturas MIMD de memória partilhada Multi-cores heterogéneos Multi-processadores

CURSO PRÁTICO. Módulo 2 Pré-requisitos. Application Virtualization 5.0. Nível: Básico / Intermediário

Capítulo 8 Arquitetura de Computadores Paralelos

BRIGHAM AND EHRHARDT PDF

Arquitetura de Computadores. Ivan Saraiva Silva

Transcrição:

Modernização de código em Xeon e Xeon Phi Programando para Multi-core e Many-core Igor Freitas Intel do Brasil igor.freitas@intel.com

Agenda Próximo passo para a Computação Exascale Modernização de código em processadores Xeon Identificando oportunidades de otimização Vetorização / SIMD Otimizando de código com semi-autovetorização Identificando oportunidades de Paralelismo (multi-threading) Modernização de código em co-processadores Xeon Phi Otimizações feitas no Xeon aplicadas no Xeon Phi 2

Próximo passo para a Computação Exascale 3

Next Step on Intel s Path to Exascale Computing Exascale Vision Next Step: KNL Systems scalable to >100 PFlop/s I/O Resiliency Standard Programming Models Power Efficiency Up to 100 Gb/s with Storm Lake integrated fabric Over 15 GF/Watt 1 Processor Performance Memory ~500 GB/s sustained memory bandwidth with integrated onpackage memory ~3X Flops and ~3X singlethread theoretical peak performance over Knights Corner 1 1 Projections based on internal Intel analysis during early product definition, as compared to prior generation Intel Xeon Phi Coprocessors, and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

The Road Ahead You need the best To get the most Transistors Fabric Memory Integration Performance Optimized Hardware Choice Unified Software ROI General Purpose Approach Standards Secure Future 5

A Paradigm Shift for Highly-Parallel Server Processor and Integration are Keys to Future Coprocessor Fabric Memory Knights Landing Server Processor Memory Bandwidth ~500 GB/s STREAM Memory Capacity Over 25x* KNC Resiliency Systems scalable to >100 PF Power Efficiency Over 25% better than card 1 I/O Up to 100 GB/s with int fabric Cost Less costly than discrete parts 1 Flexibility Limitless configurations Density 3+ KNL with fabric in 1U 3 *Comparison to 1 st Generation Intel Xeon Phi 7120P Coprocessor (formerly codenamed Knights Corner) 1 Results based on internal Intel analysis using estimated power consumption and projected component pricing in the 2015 timeframe. This analysis is provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 2 Comparison to a discrete Knights Landing processor and discrete fabric component. 3 Theoretical density for air-cooled system; other cooling solutions and configurations will enable lower or higher density.

Knights Landing Holistic Approach to Real Application breakthroughs Platform Memory Up to 384 GB DDR4 (6 ch) Compute Intel Xeon Processor Binary-Compatible 3+ TFLOPS 1, 3X ST 2 (single-thread) perf. vs KNC 2D Mesh Architecture... Over 60 Cores... Out-of-Order Cores On-Package Memory Over 5x STREAM vs. DDR4 3 Up to 16 GB at launch Integrated Intel Omni-Path Processor Package Omni-Path (optional) 1 st Intel processor to integrate I/O Up to 36 PCIe 3.0 lanes Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Knights Landing Details PERFORMANCE 3+ TeraFLOPS of double-precision peak theoretical performance per single socket node 0 INTEGRATION Intel Omni Scale fabric integration High-performance on-package memory (MCDRAM) Over 5x STREAM vs. DDR4 1 Over 400 GB/s Up to 16GB at launch NUMA support Over 5x Energy Efficiency vs. GDDR5 2 Over 3x Density vs. GDDR5 2 In partnership with Micron Technology Flexible memory modes including cache and flat SERVER PROCESSOR Standalone bootable processor (running host OS) and a PCIe coprocessor (PCIe end-point device) Platform memory: up to 384GB DDR4 using 6 channels Reliability ( Intel server-class reliability ) Power Efficiency (Over 25% better than discrete coprocessor) 4 Over 10 GF/W Density (3+ KNL with fabric in 1U) 5 Up to 36 lanes PCIe* Gen 3.0 FUTURE Knights Hill is the codename for the 3 rd generation of the Intel Xeon Phi product family Based on Intel s 10 nanometer manufacturing technology Integrated 2 nd generation Intel Omni-Path Fabric MICROARCHITECTURE Over 8 billion transistors per die based on Intel s 14 nanometer manufacturing technology Binary compatible with Intel Xeon Processors with support for Intel Advanced Vector Extensions 512 (Intel AVX-512) 6 3x Single-Thread Performance compared to Knights Corner 7 60+ cores in a 2D Mesh architecture 2 cores per tile with 2 vector processing units (VPU) per core) 1MB L2 cache shared between 2 cores in a tile (cache-coherent) Based on Intel Atom core (based on Silvermont microarchitecture) with many HPC enhancements 4 Threads / Core 2X Out-of-Order Buffer Depth 8 Gather/scatter in hardware Advanced Branch Prediction High cache bandwidth 32KB Icache, Dcache 2 x 64B Load ports in Dcache 46/48 Physical/virtual address bits Most of today s parallel optimizations carry forward to KNL Multiple NUMA domain support per socket AVAILABILITY First commercial HPC systems in 2H 15 Knights Corner to Knights Landing upgrade program available today Intel Adams Pass board (1U half-width) is custom designed for Knights Landing (KNL) and will be available to system integrators for KNL launch; the board is OCP Open Rack 1.0 compliant, features 6 ch native DDR4 (1866/2133/2400MHz) and 36 lanes of integrated PCIe* Gen 3 I/O All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. All projections are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. 0 Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expecations of cores, clock frequency and floating point operations per cycle. 1 Projected result based on internal Intel analysis of STREAM benchmark using a Knights Landing processor with 16GB of ultra high-bandwidth versus DDR4 memory with all channels populated. 2 Projected result based on internal Intel analysis comparison of 16GB of ultra high-bandwidth memory to 16GB of GDDR5 memory used in the Intel Xeon Phi coprocessor 7120P. 3 Compared to 1 st Generation Intel Xeon Phi 7120P Coprocessor (formerly codenamed Knights Corner) 4 Projected result based on internal Intel analysis using estimated performance and power consumption of a rack sized deployment of Intel Xeon processors and Knights Landing coprocessors as compared to a rack with KNL processors only 5 Projected result based on internal Intel analysis comparing a discrete Knights Landing processor with integrated fabric to a discrete Intel fabric component card. 6 Binary compatible with Intel Xeon Processors v3 (Haswell) with the exception of Intel TSX (Transactionaly Synchronization Extensions) 7 Projected peak theoretical single-thread performance relative to 1 st Generation Intel Xeon Phi Coprocessor 7120P 8 Compared to the Intel Atom core (base on Silvermont microarchitecture) Continued on next page

Knights Landing Details MOMENTUM Aurora is currently the largest planned system at 180-450 petaflop/s, to be delivered in 2018. Intel is teaming with Cray on both projects. Aurora: 50,000 nodes: Future generation Intel Xeon Phi processors (Knights Hill); 2nd generation Intel Omni-Path fabric; New memory hierarchy composed of Intel Lustre, Burst Buffer Storage, and persistent memory through high bandwidth on-package memory; Cray s Shasta platform. Cori Supercomputer at NERSC (National Energy Research Scientific Computing Center at LBNL/DOE) became the first publically announced Knights Landing based system, with over 9,300 nodes slated to be deployed in mid-2016 Trinity Supercomputer at NNSA (National Nuclear Security Administration) is a $174 million deal awarded to Cray that will feature Haswell and Knights Landing, with acceptance phases in both late- 2015 and 2016. Expecting over 50 system providers for the KNL host processor, in addition to many more PCIe-card based solutions. >100 Petaflops of committed customer deals to date More info The path to Aurora HPC Scalable System Framework Coral Program Overview All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. All projections are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

Modernização de código em processadores Xeon Identificando oportunidades de otimização - Vetorização 10

Legal Disclaimers Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported. Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Intel Hyper-Threading Technology Available on select Intel Xeon processors. Requires an Intel HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading. Intel Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different processor sequences. See http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject to change without notice Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel s current plan of record product roadmaps. Product plans, dates, and specifications are preliminary and subject to change without notice Copyright 2012-15 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Xeon logo, Xeon Phi and Xeon Phi logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and products specified are for planning purposes only and are subject to change without notice. *Other names and brands may be claimed as the property of others. 11

Identificando oportunidades de otimização Foco deste seminário: Modernização de código Otimização em um único nó de processamento Vetorização & Paralelismo HPC Cluster Cluster Edition Multi-fabric MPI library MPI error checking and tuning Professional Edition MPI Messages Threading design & prototyping Parallel performance tuning Memory & thread correctness Composer Edition Vectorized & Threaded Node Intel C++ and Fortran compilers Parallel models (e.g., OpenMP*) Optimized libraries 12

Identificando oportunidades de otimização Foco deste seminário: Modernização do código Código C/C++ ou Fortran Paralelismo (Multithreading) Thread 0/Core0 Thread 0 / Core 0 Thread 1/ Core1 Thread 2 / Core 2... Thread 12 / Core12 Thread 1/Core1 Thread 2/Core2... Thread 244 /Core61 Vector Processor Unit por Core 128 Bits 256 Bits Vetorização Vector Processor Unit por Core 512 Bits 13

Identificando oportunidades de otimização Vetorização dentro do core Código de exemplo - Black-Scholes Pricing Code a mathematical model of a financial market containing certain derivative investment instruments. Exemplo retirado do livro High Performance Parallelism Pearls Código fonte: http://lotsofcores.com/pearls.code Artigo sobre otimização deste método https://software.intel.com/en-us/articles/case-study-computing-black-scholes-withintel-advanced-vector-extensions 14

Identificando oportunidades de otimização Recapitulando o que é vetorização / SIMD O que é e? Capacidade de realizar uma operação matemática em dois ou mais elementos ao mesmo tempo. Por que Vetorizar? Ganho substancial em performance! for (i=0;i<=max;i++) c[i]=a[i]+b[i]; A + B C Scalar - Uma instrução - Uma operação a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i] + b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i] Vector - Uma instrução - Oito operações c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i] 15

Evolução do processamento vetorial dos processadores Intel 64 X4 Y4 X4opY4 X3 Y3 X3opY3 X2 Y2 X2opY2 X1 Y1 X1opY1 0 MMX Vector size: 64bit Data types: 8, 16 and 32 bit integers VL: 2,4,8 For sample on the left: Xi, Yi 16 bit integers X4 128 Y4 X4opY4 X3 Y3 X3opY3 X2 Y2 X2opY2 X1 Y1 X1opY1 0 Intel SSE Vector size: 128bit Data types: 8,16,32,64 bit integers 32 and 64bit floats VL: 2,4,8,16 Sample: Xi, Yi bit 32 int / float 16

Evolução do processamento vetorial dos processadores Intel 255 X8 Y8 X7 Y7 X6 Y6 X5 Y5 128 127 X4 Y4 X3 Y3 X2 Y2 X1 Y1 0 Intel AVX / AVX2 Vector size: 256bit Data types: 32 and 64 bit floats VL: 4, 8, 16 Sample: Xi, Yi 32 bit int or float X8opY8 X7opY7 X6opY6 X5opY5 X4opY4 X3opY3 X2opY2 X1opY1 511 X16... Y16 X16opY16 255 X9 X8 X7 Y9 Y8 Y7 X9opY9 X8opY8 X6 Y6... X5 Y5 X4 Y4 X3 Y3 X2 Y2 0 X1 Y1 X1opY1 Intel MIC / AVX-512 Vector size: 512bit Data types: 32 and 64 bit integers 32 and 64bit floats (some support for 16 bits floats) VL: 8,16 Sample: 32 bit float 17

Princípios do desempenho obtido Vários caminhos, uma única arquitetura Performance Work Time = Path Length Work Instruction x IPC Instruction Cycle x Frequency Cycle Time Não podemos mais contar somente com aumento da frequência Algoritmo eficiente mesma carga de trabalho com menos instruções Compilador reduz as instruções e melhora IPC (Instructions per cyle) Uso eficiente da Cache: melhora IPC Vetorização: mesmo trabalho com menos instruções Paralelização: mais instruções por ciclo *Other logos, brands and names are the property of their respective owners. All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel Xeon Processor generations from left to right in each chart: 64-bit, 5100 series, 5500 series, 5600 series, E5-2600, E5-2600 v2 Intel Xeon Phi Product Family from left to right in each chart: Intel Xeon Phi x100 Product Family (formerly codenamed Knights Corner), Knights Landing (next-generation Intel Xeon Phi Product Family)

Identificando oportunidades de otimização Maneiras de vetorizar o código Facilidade de Uso Vectors Intel Math Kernel Library Auto vectorization Array Notation: Intel Cilk Plus Devemos avaliar três fatores: Necessidade de performance Disponibilidade de recursos para otimizar o código Portabilidade do código Semi-auto vectorization: #pragma (vector, ivdep, simd) C/C++ Vector Classes (F32vec16, F64vec8) Ajuste Fino 19

Identificando oportunidades de otimização Vetorização dentro do core O código está otimizado para rodar em uma única thread? Compilar o código com parâmetro -qopt-report[=n] no Linux ou /Qopt-report[:n] no Windows. /Qopt-report-file:vecReport.txt Analisar relatório, encontrar dicas sobre loops não vetorizados e principal causa Loop de inicialização de variáveis LOOP BEGIN at...black-scholes-ch19\02_referenceversion.cpp(93,3) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed OUTPUT dependence between pt line 95 and pk line 97 remark #25439: unrolled with remainder by 2 LOOP END Loop dentro da função GetOptionPrices LOOP BEGIN at...black-scholes-ch19\02_referenceversion.cpp(56,3) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed ANTI dependence between ps0 line 58 and pc line 62 LOOP END 20

Identificando oportunidades de otimização Vetorização dentro do core Rodar Intel VTune General Exploration marcar opção Analyze memory bandwidth Identificar hotspot = função que gasta mais tempo na execução Identificar se as funções estão vetorizadas 21

Identificando oportunidades de otimização Vetorização dentro do core Parâmetros de compilação utilizados: /GS /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc110.pdb" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qipo /Zc:forScope /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Qprof-dir "x64\release\" /Fp"x64\Release\Reference.pch Parâmetros de execução <número de elementos> <número de threads> 60000000 1 22

Identificando oportunidades de otimização Vetorização dentro do core Instruções escalares como movsd ou cvtsd2ss ( s de scalar) estão sendo utilizadas ao invés de vmovapd ; v de AVX, p de packed, d de double SSE utiliza 128 bits ; e não 256 bits igual instruções AVX 23

Identificando oportunidades de otimização Vetorização dentro do core O que identificamos? Código não está vetorizado, não utiliza instruções AVX/AVX2 de 256 bits Compilador apontou dependência de dados Oportunidades apontadas pelo VTune Back-end bound : baixo desempenho na execução das instruções Memory bound: aplicação dependente da troca de mensagens entre Cache RAM Instruções load (ram -> cache) e store ( cache -> ram ) L1 bound: dados não são encontrados neste nível de cache Port utilization: baixa utilização do core, non-memory issues Funções cdfnormf e GetOptionPrices são hotspots 24

Modernização de código com semi-autovetorização 25

Modernização de código Semi-autovetorização Uso do #pragma ivdep (por enquanto utilizando apenas 1 thread/core) Código não está vetorizado, não utiliza instruções AVX/AVX2 de 256 bits Compilador apontou dependência de dados na linha 56 #pragma ivdep for (i = 0; i < N; i++) { d1 = (log(ps0[i] / pk[i]) + (r + sig * sig * 0.5) * pt[i]) / (sig * sqrt(pt[i])); d2 = (log(ps0[i] / pk[i]) + (r - sig * sig * 0.5) * pt[i]) / (sig * sqrt(pt[i])); p1 = cdfnormf(d1); p2 = cdfnormf(d2); pc[i] = ps0[i] * p1 - pk[i] * exp((-1.0) * r * pt[i]) * p2; } Performance 60.000.000 elementos #pragma ivdep time = 1.886624 Código original time = 22.904422 12.1x de speedup Configuração Intel Core i5-4300 CPU 2.5 GHZ 4GB RAM Windows 8.1 x64 Intel Compiler C++ 15.0 26

Modernização de código Semi-autovetorização Uso do #pragma ivdep Entenda o que mudou rodando novamente o VTune, compare as duas versões! Menos instruções executadas! Menos ciclos de clock por execução! Menos misses na cache L1 Melhor uso do core 27

Modernização de código Semi-autovetorização Uso do #pragma ivdep Entenda o que mudou rodando novamente o VTune, compare as duas versões! Menos instruções executadas! Menos ciclos de clock por execução! Menos misses na cache L1 Melhor uso do core 28

Modernização de código Semi-autovetorização Uso do #pragma ivdep Entenda o que mudou rodando novamente o VTune, compare as duas versões! Relatório do Compilador: análise do loop vetorizado e dicas de como vetorizar week\pdf-codigos-intel-lncc\paralelismo-dia-02\black-scholes-ch19\02_referenceversion.cpp(63,5) ] remark #15300: LOOP WAS VECTORIZED remark #15442: entire loop may be executed in remainder remark #15448: unmasked aligned unit stride loads: 1 remark #15450: unmasked unaligned unit stride loads: 2 remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 760 remark #15477: vector loop cost: 290.250 remark #15478: estimated potential speedup: 2.560 remark #15479: lightweight vector operations: 66 remark #15480: medium-overhead vector operations: 2 remark #15482: vectorized math library calls: 5 remark #15487: type converts: 13 remark #15488: --- end vector loop cost summary --- LOOP END

SPEEDUP Modernização de código Semi-autovetorização IVDEP Ignora dependência entre os vetores (QxAVX) Instruções AVX 256 bits 20 15 OTIMIZAÇÕES DENTRO DO CORE 1 THREAD 60Mi Configuração Intel Core i5-4300 CPU 2.5 GHZ 4GB RAM Windows 8.1 x64 Intel Compiler C++ 15.0 120Mi 10 5 Unroll ( n ) Desmembra o loop para instruções SIMD Link sobre unroll Requisitos para loop ser vetorizado Loop unrolling 0 BASELINE IVDEP QXAVX + IVDEP QXAVX + IVDEP + UNROLL(4)

Identificando oportunidades de otimização [2] Precisão numérica e alinhamento de dados 31

Identificando oportunidades de otimização Vetorização dentro do core Precisão e dados alinhados 2 oportunidades apontadas pelo Compilador! Redução da precisão numérica Double (64 bits) para Single (32 bits) Alinhamento de dados LOOP BEGIN at... Black-scholes-ch19\02_ReferenceVersion.cpp(58,3) remark #15389: vectorization support: reference ps0 has unaligned access [ remark #15381: vectorization support: unaligned access used inside loop body remark #15399: vectorization support: unroll factor set to 2 remark #15417: vectorization support: number of FP up converts: single precision to double precision 1 remark #15389: vectorization support: reference pk has unaligned access [... Black-scholesch19\02_ReferenceVersion.cpp(64,5) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15399: vectorization support: unroll factor set to 8 remark #15417: vectorization support: number of FP up converts: single precision to double precision 1 [... Black-scholes-ch19\02_ReferenceVersion.cpp(60,5) ]

Identificando oportunidades de otimização Vetorização dentro do core Precisão e dados alinhados Otimizando o código para precisão simples d1 = (log(ps0[i] / pk[i]) + (r + sig * sig * 0.5) * pt[i]) / (sig * sqrt(pt[i])); d2 = (log(ps0[i] / pk[i]) + (r - sig * sig * 0.5) * pt[i]) / (sig * sqrt(pt[i])); p1 = cdfnormf(d1); p2 = cdfnormf(d2); pc[i] = ps0[i] * p1 - pk[i] * exp((-1.0) * r * pt[i]) * p2; d1 = (logf(ps0[i] / pk[i]) + (r + sig * sig * 0.5f) * pt[i]) / (sig * sqrtf(pt[i])); d2 = (logf(ps0[i] / pk[i]) + (r - sig * sig * 0.5f) * pt[i]) / (sig * sqrtf(pt[i])); p1 = cdfnormf (d1); p2 = cdfnormf (d2); pc[i] = ps0[i] * p1 - pk[i] * expf((-1.0f) * r * pt[i]) * p2; 23.6x speedup vs Código original 1.4x speedup vs AVX + Unrool + IVDEP

Identificando oportunidades de otimização Vetorização dentro do core Precisão e dados alinhados SPEEDUP Otimizando o código para precisão simples OTIMIZAÇÕES DENTRO DO CORE 1 THREAD 30 25 20 15 10 Configuração Intel Core i5-4300 CPU 2.5 GHZ 4GB RAM Windows 8.1 x64 Intel Compiler C++ 15.0 5 0 60Mi 120Mi BASELINE IVDEP QXAVX + IVDEP QXAVX + IVDEP + UNROLL(4) ALL + REDUÇÃO DA PRECISÃO

Identificando oportunidades de Paralelismo (multi-threading) Do Not Guess Measure 35

Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Apesar de vetorizado (paralelismo em nível de instruções), o código está rodando em apenas uma única thread/core! Antes de começar a otimização no código, podemos analisar se vale a pena paralelizalo em mais threads! Intel Advisor XE Modela e compara a performance entre vários frameworks para criação de threads tanto em processadores quanto em co-processadores OpenMP, Intel Cilk Plus, Intel Threading Bulding Blocks C, C++, Fortran (apenas OpenMP) e C# (Microsoft TPL) Prevê escalabilidade do código: relação n.º de threads/ganho de performance Identifica oportunidades de paralelismo no código Checa corretude do código (deadlocks, race condition) 36

Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Passos para utilizar o Intel Advisor 1º - Inclua os headers #include "advisor-annotate.h 2º - Adicionar referência ao diretório include ; linkar lib ao projeto (Windows e Linux) Windows com Visual Studio 2012 Geralmente localizado em C:\Program Files (x86)\intel\advisor XE\include Linux - Compilando / Link com Advisor icpc -O2 -openmp 02_ReferenceVersion.cpp -o 02_ReferenceVersion -I/opt/intel/advisor_xe/include/ -L/opt/intel/advisor_xe/lib64/ 37

Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Passos para utilizar o Intel Advisor 3º - Executando o Advisor Linux $ advixe-gui & Crie um novo projeto - Interface é a mesma para Linux e Windows - No caso do Visual Studio há a opção de roda-lo de forma integrada. 38

Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Passos para utilizar o Intel Advisor Advisor Workflow Survey Target: analisa o código em busca de oportunidades de paralelismo Annotate Sources: Anotações são inseridas em possíveis regiões de código paralelas Check Suitability: Analisa as regiões paralelas anotadas, entrega previsão de ganho de performance e escalabilidade do código Check correctness: Analisa possíveis problemas como race conditions e dealocks Add Parallel Framework: Passo para substituir anotações do Advisor pelo código do framework escolhido (OpenMP, Cilk Plus, TBB, etc.) 39

Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Identificando hotspots e quais loops podem ser paralelizados

Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Inserindo as anotações do Advisor para executar a próxima fase: Check Suitability

Identificando oportunidades de Paralelismo Multithreads Intel Advisor XE Identificando hotspots e quais loops podem ser paralelizados

SPEEDUP Aplicando paralelismo via OpenMP Análise de concorrência com o Intel Vtune Código otimizado IVDEP + AVX + UNROLL + FLOAT PRECISION OTIMIZAÇÃO MULTI-THREAD OpenMP - 60mi Nthreads: 1 time = 0.967425 Nthreads: 2 time = 0.569371 Nthreads: 4 time = 0.387649 Nthreads: 8 time = 0.396282 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 1 2 4 8 THREADS Configuração Intel Core i5-4300 CPU 2.5 GHZ 4GB RAM Windows 8.1 x64 Intel Compiler C++ 15.0

Modernização de código no coprocessador Xeon Phi 44

Agenda Black-Scholes Pricing Code Intensidade computacional - Monte-carlo 45

Modernização de código no Intel Xeon Phi Recapitulando modelos de programação Modelos de programação Xeon Multi-Core Centric MIC Many Core Centric Multi-Core Hosted Aplicações Seriais e Paralelas Offload Aplicações com etapas paralelas Symmetric Load Balance Many-Core Hosted - Native Aplicações Massivamente Paralelas 46

Modernização de código no Intel Xeon Phi Recapitulando modelos de programação MIC Plataform Software Stack Linux/Windows - Xeon Host Intel Xeon Phi Coprocessor Offload Aplication SSH SSH Session Native Aplication Offload Aplication System Level Code MPSS System Level Code Coprocessor Communication and applaunching support PCI-e Bus PCI-e Bus Windows / Linux OS Linux uos 47

Modernização de código no Intel Xeon Phi Black-Scholes Pricing Code Compilando para Xeon Phi $ icc -O2 -mmic 02_ReferenceVersion.cpp -o reference.mic openmp $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/compiler/2015.2.164/composerxe/lib/mic/ $ scp reference.mic mic0:/tmp $ ssh mic0 $ cd /tmp $./reference.mic micnativeloadex reference.mic -l Dependency information for reference.mic... Binary was built for Intel(R) Xeon Phi(TM) Coprocessor (codename: Knights Corner) architecture SINK_LD_LIBRARY_PATH = Dependencies Found: (none found) Dependencies Not Found Locally (but may exist already on the coprocessor): libm.so.6 libiomp5.so libstdc++.so.6 libgcc_s.so.1 libpthread.so.0 libc.so.6 libdl.so.2 Verificando dependências de biblioteca

Testando o código no Intel Xeon Phi Código vetorizado, porém em 1 thread Código simplesmente portado para Xeon Phi icc -O2 -openmp -fimf-precision=low -fimf-domain-exclusion=31 -mmic 11_XeonPhi.cpp -o 11_XeonPhi.mic icc -O2 -openmp -fimf-precision=low -fimf-domain-exclusion=31 -mmic 12_XeonPhiWorkInParallel.cpp -o 12_XeonPhiWorkInParallel.mic icc -O2 -openmp -fimf-precision=low -fimf-domain-exclusion=31 -mmic 13_XeonPhiStreamingStores.cpp -o 13_XeonPhiStreamingStores.mic 49

Modernização de código no Intel Xeon Phi Código vetorizado, porém em 1 thread #pragma vector aligned #pragma simd for (i = 0; i < N; i++) { invf = invsqrtf(sig2 * pt[i]); d1 = (logf(ps0[i] / pk[i]) + (r + sig2 * 0.5f) * pt[i]) / invf; d2 = (logf(ps0[i] / pk[i]) + (r - sig2 * 0.5f) * pt[i]) / invf; erf1 = 0.5f + 0.5f * erff(d1 * invsqrt2); erf2 = 0.5f + 0.5f * erff(d2 * invsqrt2); pc[i] = ps0[i] * erf1 - pk[i] * expf((-1.0f) * r * pt[i]) * erf2; } } time = ~ 0.481689 Vetorizado, porém em 1 thread time = 0.586156 Intel Xeon processor Intel Xeon Phi coprocessor

Testando o código no Intel Xeon Phi Código vetorizado e paralelizado #pragma vector aligned #pragma simd #pragma omp parallel for private(d1, d2, erf1, erf2, invf) for (i = 0; i < N; i++) { invf = invsqrtf(sig2 * pt[i]); d1 = (logf(ps0[i] / pk[i]) + (r + sig2 * 0.5f) * pt[i]) / invf; d2 = (logf(ps0[i] / pk[i]) + (r - sig2 * 0.5f) * pt[i]) / invf; erf1 = 0.5f + 0.5f * erff(d1 * invsqrt2); erf2 = 0.5f + 0.5f * erff(d2 * invsqrt2); pc[i] = ps0[i] * erf1 - pk[i] * expf((-1.0f) * r * pt[i]) * erf2; } } 4 threads time = 0.138331 10 threads time = 0.174219 Intel Xeon processor Intel Xeon Phi coprocessor

TEMPO DE EXECUÇÃO Testando o código no Intel Xeon Phi Código vetorizado e paralelizado BLACK SCHOLES Xeon E5-2697v3 MIC - Simple port MIC - Parallel MIC-StreamStore MIC - warm-up 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 4 8 12 16 32 48 60 120 240 THREADS Configuração Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz 64GB RAM Linux version 2.6.32-358.6.2.el6.x86_64.crt1 Red Hat 4.4.7-3 Intel Xeon Phi 7110ª MPSS 3.3.30726 52

Testando o código no Intel Xeon Phi Análise de desempenho VTune está indicando que há muita latência, impacto direto na vetorização! 53

Testando o código no Intel Xeon Phi Análise de desempenho VTune está indicando que há muita latência, impacto direto na vetorização! 54

Testando o código no Intel Xeon Phi Análise de desempenho - Bandwidth Bandwidth DRAM << >> CACHE Versão com instruções StreamStore 55

Testando o código no Intel Xeon Phi Análise de desempenho - Bandwidth Bandwidth DRAM << >> CACHE Comparação StreamStore vs Non-StreamStore 56

Testando o código no Intel Xeon Phi Algumas conclusões Aplicação orientada a memory bandwidth Neste caso, mais threads não trará ganhos Streams-stores no Xeon Phi evitou uso de cache nos vetores somente de escrita, reduzindo consumo de bandwidth Computar X options = 3X reads + X writes + X Read-for-ownership = 5X Streamstore: 3X reads + X writes = 4X Latência, devido a cache misses impacta vetorização Overhead na criação de threads impacta 1ª execução do loop OpenMP, principalmente no Xeon Phi Versão final do código computou 2067mi opções/sec no Xeon e 9231mi opções/sec no Xeon Phi 57

Intensidade computacional Monte Carlo European Option Xeon Phi 58

Intensidade computacional Monte Carlo European Option Xeon Phi Monte Carlo European Option with Pre-generated Random Numbers for Intel Xeon Phi Coprocessor Artigo 1 Download do código utilizado Artigo 2 Detalhes de otimização Compilando para Xeon Phi icpc MonteCarlo.cpp -mmic -O3 -ipo -fno-alias -opt-threads-per-core=4 -openmp -restrict -vec-report2 -fimfprecision=low -fimf-domain-exclusion=31 -no-prec-div -no-prec-sqrt -DCOMPILER_VERSION="15" -ltbbmalloc -DFPFLOAT -o MonteCarloSP.mic Verificando dependências de biblioteca $ ssh mic0 $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/intel/compiler/2015.2.164/lib/mic:/opt/intel/compiler/2015.2.164/tbb/lib/mic $ source /opt/intel/compiler/2015.2.164/composerxe/bin/compilervars.sh intel64 59

Xeon and Xeon Phi Mesmo código, mesmo modelo de programação Xeon Phi: icpc MonteCarlo.cpp -mmic -O3 -ipo -fno-alias -opt-threads-per-core=4 -openmp - restrict -vec-report2 -fimf-precision=low -fimf-domain-exclusion=31 -no-prec-div -noprec-sqrt -DCOMPILER_VERSION="15" -ltbbmalloc DFPFLOAT -g -o MonteCarloSP.mic Xeon icpc MonteCarlo.cpp -O3 -ipo -fno-alias -opt-threads-per-core=2 -openmp -restrict - vec-report2 -fimf-precision=low -fimf-domain-exclusion=31 -no-prec-div -no-prec-sqrt -DCOMPILER_VERSION="15" -ltbbmalloc -DFPFLOAT -o MonteCarloSP.xeon 60

Intensidade computacional Monte Carlo European Option Xeon Phi... #pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) {... #pragma vector aligned #pragma loop count (RAND_N) #pragma simd reduction(+:v0) reduction(+:v1) #pragma unroll(unrollcount) for(const FP_TYPE *hrnd = start; hrnd < end; ++hrnd) { const FP_TYPE result = std::max(typed_zero, Y*EXP2(VBySqrtT*(*hrnd) + MuByT) - Z); v0 += result; v1 += result*result; } CallResultList[opt] += v0; CallConfidenceList[opt] += v1; } }... 60 threads time = 480.372514 AVX2 AVX2 AVX2 Baseline: código paralelo e vetorizado! Intel IMCI 240 threads time = 66.528498 AVX2 AVX2 Intel Xeon processor Intel Xeon Phi coprocessor

SPEEDUP Monte Carlo European Option Tempo de execução Xeon and Xeon Phi Xeon E5-2697v3 Xeon Phi 7110 140.00 120.00 100.00 80.00 60.00 40.00 20.00 0.00 4 8 12 16 32 48 60 120 160 240 THREADS Configuração Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz 64GB RAM Linux version 2.6.32-358.6.2. el6.x86_64.crt1 Red Hat 4.4.7-3 Intel Xeon Phi 7110ª MPSS 3.3.30726 62

Xeon and Xeon Phi Análise do uso da memória Aplicação é orientada a cálculo dos dados compute bound, e não memory bandwidth Alta utilização da memória cache boa localidade dos dados 63

Xeon and Xeon Phi Análise do uso da memória Aplicação é orientada a cálculo dos dados compute bound, e não memory bandwidth Pode haver espaço para otimizações Software prefetching Data locality 64

Xeon and Xeon Phi - Outros Cases Mesmo modelo de programação e otimização Joint study from Colfax International and Stanford University GOAL: building a 3D model of the Mikly Way galaxy using large volumes of 2D sky survey data App ported quickly to Xeon Phi: performance was <1/3 because we moved from >3 GHz out-of-order modern CPU to a lowpower 1 GHz in-order core limited threading, no vectorization After modernizing the code, it runs up to 125x faster (vs. Xeon) and 620x faster (vs. Xeon Phi) than baseline!!! 1 Source: Colfax International and Stanford University. Complete details can be found at: http://research.colfaxinternational.com/?tag=/heatcode

Performance (voxels/second) Xeon and Xeon Phi - Outros Cases Mesmo modelo de programação e otimização 10 4 HEATCODE BENCHMARKS 1.9x 4.4x 10 3 10 2 10 1 125x CPU & Intel C++ 620x Xeon Phi CPU & Intel C++ Xeon Phi CPU + Xeon Phi + Xeon Phi NON-OPTIMIZED OPTIMIZED HETERO 1 Source: Colfax International and Stanford University. Complete details can be found at: http://research.colfaxinternational.com/?tag=/heatcode

Conclusões Modernização de código para Xeon e Xeon Phi 1º passo: modernize seu código serial para aproveitar ao máximo o paralelismo em nível de instrução e processamento vetorial Workflow de otimização: meça, identifique, otimize, teste! Xeon Phi: Todas otimizações feitas pra Xeon serão preservadas! Explore Xeon + Xeon Phi Otimização para single-thread Vetorização / SIMD / Paralelismo em nível de instrução / Multi-threading Uso de padrões abertos Mesma técnica de programação para Xeon e Xeon Phi Várias maneiras de otimização:facilidade ou Ajuste fino 67

Links úteis Por favor, dê seu feedback desta palestra - http://bit.ly/intelpesquisa Manual e referência do compilador Intel Compiler 15.0 Manual e referência - Intel Intrinsics Guia para Auto-vetorização Xeon Phi Home Page Xeon Phi CODE RECIPES Intel Developer Zone IPCCs - Centros de Computação Paralela Product Overview (IBL) 68

A Growing Application Catalog Over 100 apps listed as available today or in flight http://software.intel.com/xeonphicatalog

Testes Servidor Xeon Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz - Haswell CPU(s): 56 Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 35840K 70

2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Intel Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 71