GPU Accelerated Stochastic Inversion of Deep Water Seismic Data

GPU Accelerated Stochastic Inversion of Deep Water Seismic Data Tomás Ferreirinha Rúben Nunes Amílcar Soares Frederico Pratas Pedro Tomás Nuno Roma Pedro Tomás INESC-ID, Instituto Superior Técnico, Universidade de Lisboa pedro.tomas@inesc-id.pt

Outline Motivation and objectives Stochastic Seismic AVO Inversion Algorithm Algorithm Description Dependencies Parallelization Strategy Mapping Methodology Main Considered Optimizations Multi-GPU Parallelization Approach Experimental Results Conclusions 2

Seismic Inversion methods Allow estimating the physical properties of the Earth subsurface Typically based on seismic reflection data Allows studying well logs, e.g., for accurate drilling 3

Motivation Simple models for seismic inversion are characterized by a high level of uncertainty, leading to drilling errors Complex models on the other hand lead to significant execution times Simulation times ranging between weeks to months are common As a result, simulations are often constrained in algorithmic complexity and achieved numerical precision 4

Objective Accelerate state-of-art complex Seismic Inversion Algorithms Decrease execution time from weeks or months to days or even hours such as to allow using more complex and accurate models Allows using more complex computational models Can be used for larger fields Efficiently exploit platforms composed of multiple heterogeneous accelerators, specifically GPUs OpenCL allows exploiting GPUs from different vendors 5 25/08/2014

Stochastic Seismic AVO Inversion Algorithm Concepts The Stochastic Seismic AVO Inversion algorithm is based in two main concepts: 1. Density, P-wave and S-wave velocity models (ρ, Vp and Vs) perturbed towards an objective function, according to the Direct Sequential Simulation (DSS) Algorithm 2. A genetic algorithm, acting as a global optimizer, that ensures the convergence of the solution from iteration to iteration 6

Stochastic Seismic AVO Inversion Algorithm Flowchart 1. Stochastic simulation of the ρ/vp/vs data using the Direct Sequential Simulation (DSS) algorithm 7

Stochastic Seismic AVO Inversion Algorithm Flowchart 1. Stochastic simulation of the ρ/vp/vs data using the Direct Sequential Simulation (DSS) algorithm 2. Calculation of the synthetic pre-stack seismic cube with the simulated ρ/vp/vs models 8 25/08/2014

Stochastic Seismic AVO Inversion Algorithm Flowchart 1. Stochastic simulation of the ρ/vp/vs data using the Direct Sequential Simulation (DSS) algorithm 2. Calculation of the synthetic pre-stack seismic cube with the simulated ρ/vp/vs models 3. Comparison between the synthetic seismic and the real seismic data and creation of the seismic correlation cube 9 25/08/2014

Stochastic Seismic AVO Inversion Algorithm Flowchart 1. Stochastic simulation of the ρ/vp/vs data using the Direct Sequential Simulation (DSS) algorithm 2. Calculation of the synthetic pre-stack seismic cube with the simulated ρ/vp/vs models 3. Comparison between the synthetic seismic and the real seismic data and creation of the seismic correlation cube 4. Best density/vp/vs models are built by selecting areas of higher correlation using a genetic algorithm 10 25/08/2014

Stochastic Seismic AVO Inversion Algorithm Flowchart 1. Stochastic simulation of the ρ/vp/vs data using the Direct Sequential Simulation (DSS) algorithm 2. Calculation of the synthetic pre-stack seismic cube with the simulated ρ/vp/vs models 3. Comparison between the synthetic seismic and the real seismic data and creation of the seismic correlation cube 4. Best density/vp/vs models are built by selecting areas of higher correlation using a genetic algorithm 5. Repeat step 1, 2, 3 and 4 until a defined number of simulations per iteration have been achieved 11 25/08/2014

Stochastic Seismic AVO Inversion Algorithm Flowchart 1. Stochastic simulation of the ρ/vp/vs data using the Direct Sequential Simulation (DSS) algorithm 2. Calculation of the synthetic pre-stack seismic cube with the simulated ρ/vp/vs models 3. Comparison between the synthetic seismic and the real seismic data and creation of the seismic correlation cube 4. Best density/vp/vs models are built by selecting areas of higher correlation using a genetic algorithm 5. Repeat step 1, 2, 3 and 4 until a defined number of simulations per iteration have been achieved 6. Estimate the models accuracy by computing the correlation cubes (using the best ρ/vp/vs models) 12 25/08/2014

Stochastic Seismic AVO Inversion Algorithm Flowchart 1. Stochastic simulation of the ρ/vp/vs data using the Direct Sequential Simulation (DSS) algorithm 2. Calculation of the synthetic pre-stack seismic cube with the simulated ρ/vp/vs models 3. Comparison between the synthetic seismic and the real seismic data and creation of the seismic correlation cube 4. Best density/vp/vs models are built by selecting areas of higher correlation using a genetic algorithm 5. Repeat step 1, 2, 3 and 4 until a defined number of simulations per iteration have been achieved 6. Estimate the models accuracy by computing the correlation cubes (using the best ρ/vp/vs models) 7. Repeat the whole procedure, using information regarding the best models found, until a matching criteria is reached 13 25/08/2014

Stochastic Seismic AVO Inversion DSS Algorithm 95% of the execution time Direct Sequential Simulation (DSS) algorithm required for ρ/vp/vs Required for each node of the simulation cube, leading to Simulation Cube 14

Direct Sequential Simulation (DSS) Algorithm Dependencies 95% of the execution time Required for each node of the simulation cube, leading to 85% of the execution time Simulation Cube 15 25/08/2014

Next: Parallelization Strategy Motivation and objectives Stochastic Seismic AVO Inversion Algorithm Algorithm Description Dependencies Parallelization Strategy Mapping Methodology Main Considered Optimizations Multi-GPU Parallelization Approach Experimental Results Conclusions 16

Algorithm Parallelization Possible strategies (1) Data-level: each part of the algorithm is individually accelerated by simultaneously processing independent data Few parallelization opportunities and sequential processing time is not enough to be worth the parallelization overhead Speed-up is around 1 17

Algorithm Parallelization Possible strategies (2) Functional-level: simultaneously execute multiple independent parts of the algorithm Create a pipeline of operations for the random sequence of nodes Simulation is limited by the sequential execution of parts C and D, limiting the maximum achievable speed-up to 1.4 85% of the execution time

Algorithm Parallelization Possible strategies (3) Path-level: multiple nodes from the random path are simulated at the same time by different parallel processing threads Has to ensure that no conflicts are observed between the nodes being simulated in parallel Large division in sub-grids: more dependencies broken eventually impacting on algorithm convergence 19

Algorithm Parallelization Optimizations Postpone the simulation of nodes located in regions with few available data values in the neighbourhood 5 6 5 4 1 4 5 4 4 2 For example, in this case only nodes from sub-grids with at least 4 conditioning nodes (both in the sub-grid being simulated and in the neighbouring sub-grids) are simulated. 3 4 3 3 2 20 25/08/2014

Algorithm Parallelization Optimizations Postpone the simulation of nodes located in regions with few available data values in the neighbourhood Optimize data structures and its indexing in order to increase the coalescence of memory accesses 21 25/08/2014

Algorithm Parallelization Optimizations Postpone the simulation of nodes located in regions with few available data values in the neighbourhood Optimize data structures and its indexing in order to increase the coalescence of memory accesses Use GPU local memory in order to optimize average access times to global memory buffers 22 25/08/2014

Algorithm Parallelization Optimizations Postpone the simulation of nodes located in regions with few available data values in the neighbourhood Optimize data structures and its indexing in order to increase the coalescence of memory accesses Use GPU local memory in order to optimize average access times to global memory buffers Minimize the number of OpenCL calls in order to reduce the parallelization overhead 23 25/08/2014

Algorithm Parallelization Optimizations Postpone the simulation of nodes located in regions with few available data values in the neighbourhood Optimize data structures and its indexing in order to increase the coalescence of memory accesses Use GPU local memory in order to optimize average access times to global memory buffers Minimize the number of OpenCL calls in order to reduce the parallelization overhead Use the bottom-up merge sort algorithm to minimize the warp divergence when executing multiple sorts in parallel 24 25/08/2014

Algorithm Parallelization Optimizations Postpone the simulation of nodes located in regions with few available data values in the neighbourhood Optimize data structures and its indexing in order to increase the coalescence of memory accesses Use GPU local memory in order to optimize average access times to global memory buffers Minimize the number of OpenCL calls in order to reduce the parallelization overhead Use the bottom-up merge sort algorithm to minimize the warp divergence when executing multiple sorts in parallel Overlap computations with communications and write file operations 25 25/08/2014

Algorithm Parallelization Multi-GPU Approach Requires a synchronization step at the end of each iteration Distribute nodes amongst the available devices according to realtime performance measurements 26

Next: Experimental Results Motivation and objectives Stochastic Seismic AVO Inversion Algorithm Algorithm Description Dependencies Parallelization Strategy Mapping Methodology Main Considered Optimizations Multi-GPU Parallelization Approach Experimental Results Conclusions 27

Experimental Results Computing Platforms Component System 1 System 2 System 3 CPU i7 3820 Xeon E5-2609 i7 4770K RAM 16 GB 32 GB 32 GB GPU 1 Hawaii R9 290X GeForce GTX 680 GeForce GTX 780 GPU2 GeForce GTX 560 Ti GeForce GTX 680 GeForce GTX 660 Ti Reference: i7 3820 CPU (4-cores) Experimental dataset with 237x197x350 nodes 5 iterations, each with 8 sets of simulations 28

Experimental Results Performance 1 4 cores: 5.3x speedup (hyper-threading enabled) 29

Experimental Results Performance 1 GPU 2 GPUs 1GPU 2GPUs: 1.8x speedup for the parallelized parts (difference for 2x due to device synchronization requirements) 30

Experimental Results Performance 1 GPU 2 GPUs Algorithm is now limited by the non-parallelized parts 1GPU 2GPUs: 1.8x speedup for the parallelized parts (difference for 2x due to device synchronization requirements) 31

Experimental Results Performance 1 GPU 2 GPUs 1GPU 2GPUs: 1.8x speedup for the parallelized parts (difference for 2x due to device synchronization requirements) Best performance: 22x speedup 32

Experimental Results Convergence Convergence is achieved when the number of grid divisions is well selected 33

Conclusions and on-going work Significant acceleration of state of the art Stochastic Seismic AVO Inversion algorithm Up to 22x speed-up versus 4-core CPU-only version Parallelism achieved by relaxing the algorithm dependencies in what concerns the random path No loss of accuracy in the generated models On-going work: Accelerate other parts of the algorithm, such as the computation of the correlation Further explore the algorithm parallelism 34

Pedro Tomás pedro.tomas@inesc-id.pt 35 Título da apresentação

Experimental Results Performance Bottlenecks Synchronization and communication requirements Application is memory bounded 36