Jairo Panetta - Academia.edu (original) (raw)

Papers by Jairo Panetta

The dynamic load-balancing framework in Charm++/AMPI, developed at the University of Illinois, is... more The dynamic load-balancing framework in Charm++/AMPI, developed at the University of Illinois, is based on using processor virtualization to allow thread migration across processors. This framework has been successfully applied to many scientific applications in the past, such as BRAMS, NAMD, ChaNGa, and others. Most of these applications use only CPUs to perform their operations. However, the use of GPUs to improve computational performance is quickly getting massively disseminated in the high-performance computing community. This paper aims to investigate how the same Charm++/AMPI framework can be extended to balance load in a synthetic application inspired by the BRAMS numerical forecast model, running mostly on GPUs rather than on CPUs. Many major questions involving the use of GPUs with AMPI where handled in this work, including: how to measure the GPU's load, how to use and share GPUs among user-level threads, and what results are obtained when applying the mandatory over-decomposition technique to a GPUaccelerated program.

Anais do XI International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 1999), Sep 29, 1999

ABSTRACT We describe the first parallel version of CPTECs General Circulation Model, targeting a ... more ABSTRACT We describe the first parallel version of CPTECs General Circulation Model, targeting a 4 processor, shared memory NEC SX4. This paper emphasizes techniques to parallelize vintage production code, keeping results reproducible. Measured speed-ups compare favorably with Amdahls Law predicted values.

Anais do XI International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 1999), Sep 29, 1999

This paper presents the results from a prcliminary performance evaluation of parallel RAMS, a num... more This paper presents the results from a prcliminary performance evaluation of parallel RAMS, a numerical weather prediction model designed to simulate atmospheric phenomena at a regional levei. The main goal in our work was to study in detail the performance of the current RAMS version, and to uncover aspects of its code where opportunities for optimization exist. We present our observations on both computation and communication performance ofRAMS executing on a distributed mcmory parallel platform, and analyze their contributions to total program performance. From the observed data, we present simulations that predict bounds on potential performance gains for possible load balancing strategies.

Many software mechanisms for geophysics exploration in Oil & Gas industries are based on wave pro... more Many software mechanisms for geophysics exploration in Oil & Gas industries are based on wave propagation simulation. To perform such simulations, state-of-art HPC architectures are employed, generating results faster and with more accuracy at each generation. The software must evolve to support the new features of each design to keep performance scaling. Furthermore, it is important to understand the impact of each change applied to the software, in order to improve the performance as most as possible. In this paper, we propose several optimization strategies for a wave propagation model for five architectures: Intel Haswell, Intel Knights Corner, Intel Knights Landing, NVIDIA Kepler and NVIDIA Maxwell. We focus on improving the cache memory usage, vectorization, and locality in the memory hierarchy. We analyze the hardware impact of the optimizations, providing insights of how each strategy can improve the performance. The results show that NVIDIA Maxwell improves over Intel Haswell, Intel Knights Corner, Intel Knights Landing and NVIDIA Kepler performance by up to 17.9x. CORE Metadata, citation and similar papers at core.ac.uk

Anais do IV Workshop em Sistemas Computacionais de Alto Desempenho (WSCAD 2003)

Este trabalho apresenta a metodologia utilizada e os resultados obtidos na paralelização em dois ... more Este trabalho apresenta a metodologia utilizada e os resultados obtidos na paralelização em dois níveis - vetorização e paralelização OpenMP - do modelo numérico de previsão do tempo Eta, para a máquina de processamento vetorial e paralelo de memória central NEC-SX6. Demonstramos as principais vantagens do uso do padrão OpenMP, viabilizando a portabilidade de programas seriais para ambientes paralelos de memória compartilhada. A metodologia utilizada, as dificuldades de acomodar paralelismo em dois níveis e os resultados obtidos fazem parte deste trabalho.

This work is a small step on the direction of code portability over parallel and vector machines.... more This work is a small step on the direction of code portability over parallel and vector machines. The proposal consists of a style of programming and a set of parallel operators built over abstract data types. Objects and operators are directed to the Computational Linear Algebra area, although the principles of the proposal can be applied to any other area. A subset of the operators was implemented on a 64 processor, distributed memory MIMD machine, and the results are that computationally intensive operators achieve asymptotically optimal speed-ups, but data movement operators are inefficient, some even intrinsically sequential

Anais do X Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD 2009), 2009

A demanda constante para melhorar a qualidade de previsões numéricas do tempo obriga o uso de com... more A demanda constante para melhorar a qualidade de previsões numéricas do tempo obriga o uso de computadores progressivamente mais potentes. Com a popularização de processadores “multicore”, o uso de sistemas com muitas centenas de processadores tornou-se economicamente viável. Este trabalho investiga como escalar o paralelismo de uma aplicação operacional para a previsão do tempo que executa eficientemente em muitas dezenas de processadores para máquinas com muitas centenas de processadores. A investigação determinou as limitações da aplicação, suas surpreendentes causas e permitiu desenvolver uma solução eficiente que atinge a escala de paralelismo desejado.

This work investigates the possibility of obtaining portability with efficiency of an advection c... more This work investigates the possibility of obtaining portability with efficiency of an advection code to multi-core and many-core architectures using OpenMP and OpenACC directives. The advection code is part of the dynamics of the regional meteorological model BRAMS, executed daily in production mode at CPTEC/INPE, parallelized with the Message Passing Interface (MPI) library. We demonstrate that a single code with both directives obtains acceptable performance on both architectures. Resumo. Este trabalho investiga a possibilidade de se obter portabilidade com eficiência do código da advecção de escalares para arquiteturas multicore e many-core utilizando diretivas OpenMP e OpenACC. O código da advecção utilizado neste estudo é um trecho da dinâmica do modelo meteorológico regional BRAMS, correntemente executado em produção no CPTEC/INPE, paralelizado com a biblioteca de comunicação por troca de mensagens MPI. Demonstramos que é possível obter desempenho aceitável nas duas arquiteturas com uma única codificação contendo as duas classes de diretivas.

The variety of parallel programming languages poses the challenge of selecting an appropriate lan... more The variety of parallel programming languages poses the challenge of selecting an appropriate language for each target platform. Using classical and recent metrics, this preliminary study compares parallel programming languages with respect to programming productivity and performance portability. We used three parallel versions of the Game of Life coded in OpenMP, CUDA and Kokkos. Results show that OpenMP and Kokkos require less coding effort than CUDA, that OpenMP is the best choice for running on the CPU and that Kokkos is the best choice for the GPGPU. * Este trabalho foi parcialmente financiado pelo Termo de Cooperação 0050.0102253.16.9 entre a Petrobras e a Universidade Federal do Rio Grande do Sul

Hash Based Signature Schemes are gaining attention due to their believed quantum resistance. Thei... more Hash Based Signature Schemes are gaining attention due to their believed quantum resistance. Their signing and verification times are comparable to those of algorithms in use today, but their key's generation time is much greater. To speed-up the execution time of key generation algorithms, this paper introduces and analises two parallel MIMD implementations of the hash based signature scheme XMSS (eXtended Merkle Signature Scheme). Resumo. Algoritmos de assinatura baseados em hash estão ganhando atenção devido a sua apregoada resistência a algoritmos quânticos. Apesar dos tempos de confecção e verificação de assinatura serem compatíveis com os tempos dos algoritmos atualmente em uso, o tempo de geração das chavesé ordens de grandeza superior. Este artigo apresenta e analisa o desempenho de duas implementações de paralelismo MIMD (Multiple Instruction Multiple Data) para acelerar o tempo de execução do algoritmo de geração de chaves do esquema de assinatura digital baseado em hash XMSS (eXtended Merkle Signature Scheme).

Anais da IX Escola Regional de Alto Desempenho de São Paulo (ERAD-SP 2018), 2018

Devido à crescente necessidade de proteger informações, novos algoritmos criptográficos estão sen... more Devido à crescente necessidade de proteger informações, novos algoritmos criptográficos estão sendo desenvolvidos, como os algoritmos baseados em redes reversíveis aleatórias. O desempenho de implementações MPI desses algoritmos é dominado pela quantidade de comunicações entre processos, fruto da estrutura de conexão das redes. Este trabalho investiga como utilizar comunicação unilateral (One-Sided Communication) e tipos de dados derivados (Derived Datatype) para acelerar tais implementações.

Anais da IX Escola Regional de Alto Desempenho de São Paulo (ERAD-SP 2018), 2018

O modelo meteorológico regional BRAMS é executado operacionalmente no CPTEC/INPE num supercomputa... more O modelo meteorológico regional BRAMS é executado operacionalmente no CPTEC/INPE num supercomputador composto por nós com processadores multicore. Sua programação paralela é feita com a biblioteca de comunicação por troca de mensagens MPI, sendo o domínio do modelo dividido entre nós e também internamente a cada nó, geralmente com uso da comunicação convencional bilateral assíncrona e sem bloqueio. Entretanto, a recente versão 3.0 do MPI disponibiliza a nova comunicação unilateral de memória compartilhada para otimizar a comunicação entre processos executados num mesmo nó computacional. Este trabalho avalia o desempenho de comunicação dessa nova funcionalidade na execução paralela do modelo BRAMS.

Many software mechanisms for exploration geophysics in Oil & Gas industries are based on wave pro... more Many software mechanisms for exploration geophysics in Oil & Gas industries are based on wave propagation simulation. To execute such simulations, state-of-art HPC architectures are employed, generating results faster and with more accuracy at each generation. To keep performance scaling, the software must evolve to support the new architectures features. Furthermore, it is important to understand the impact of such changes performed to the software, in order to improve the performance as most as possible. In this paper, we propose several optimization strategies for a wave propagation model for two manycore systems: the Intel Xeon and Intel Xeon Phi processors. We analyze the hardware impact of the optimizations, providing insights of how each strategy is able to improve the performance. Performance was improved by up to 22.6x in Xeon and 116x in Xeon Phi.

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2017

Elliptic curve asymmetric cryptography has achieved increased popularity due to its capability of... more Elliptic curve asymmetric cryptography has achieved increased popularity due to its capability of providing comparable levels of security as other existing cryptographic systems while requiring less computational work. Pollard Rho and Parallel Collision Search, the fastest known sequential and parallel algorithms for breaking this cryptographic system, have been successfully applied over time to break ever-increasing bit-length system instances using implementations heavily optimized for the available hardware. This work presents portable, general implementations of a Parallel Collision Search based solution for prime elliptic curve asymmetric cryptographic systems that use publicly available big integer libraries and make no assumption on prime curve properties. It investigates which bit-length keys can be broken in reasonable time by a user that has access to a state of the art, public HPC equipment with CPUs and GPUs. The final implementation breaks a 79-bit system in about two h...

This article explore the impact of coding optimization, with focus on memory hierarchy, in two GP... more This article explore the impact of coding optimization, with focus on memory hierarchy, in two GPGPUs from distinct generations. The results are compared and explained in light of the memory hierarchy variation between generations. Resumo. Este trabalho explora o impacto de otimizações de código, com foco na hierarquia de memória, em duas gerações distintas de GPGPUs. Os resultados são comparados e explicados através das mudanças na constituição da memória ocorridas entre uma geração e outra. * Este trabalho foi parcialmente financiado pelo Termo de Cooperação 0050.0102253.16.9 entre a Petrobras e a Universidade Federal do Rio Grande do Sul

2018 Symposium on High Performance Computing Systems (WSCAD), 2018

Oil and gas have been among the most important commodities for over a century. To improve their e... more Oil and gas have been among the most important commodities for over a century. To improve their extraction, companies invest in new technology, which reduces extraction cost and allow new areas to be explored. Computing science has also been employed to support advances in oil and gas extraction technologies. Techniques such as computing simulation can be used to evaluate scenarios quicker and with a lower cost. Several mathematical models that simulate oil and gas extraction are based on wave propagation. To simulate with high performance, the software must be written considering the characteristics of the underlying hardware. In this context, our work shows how thread and data mapping policies can improve the performance of a wave propagation model provided by Petrobras, a multinational corporation in the petroleum industry. In our experiments, we are revealing that, with smart mapping policies, we reduced the execution time by up to 48.6% on Intel’s multi-core Xeon.

Journal of Computational Science, 2021

Abstract We discuss new developments of a hybrid parallel iterative sparse linear solver framewor... more Abstract We discuss new developments of a hybrid parallel iterative sparse linear solver framework focused on petroleum reservoir flow and geomechanical simulation. It runs efficiently on several platforms, from desktop workstations to clusters of multicore nodes, with or without multiple GPUs, using a two-tier hierarchical architecture for distributed matrices and vectors. Results show good parallel scalability. Comparisons with a well-established library and a proprietary commercial solver indicate that our solver is competitive with the best available tools. We present results of the solver's application to simulations of real and synthetic reservoir models of up to billions of unknowns, running on CPUs and GPUs on up to 2,000 processes.

Imaging and Sensing for Unmanned Aircraft Systems. Volume 1: Control and Performance, 2020

Anais do XI International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 1999), Sep 29, 1999

Anais do IV Workshop em Sistemas Computacionais de Alto Desempenho (WSCAD 2003)

Anais do X Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD 2009), 2009

Anais da IX Escola Regional de Alto Desempenho de São Paulo (ERAD-SP 2018), 2018

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2017

2018 Symposium on High Performance Computing Systems (WSCAD), 2018

Journal of Computational Science, 2021

Imaging and Sensing for Unmanned Aircraft Systems. Volume 1: Control and Performance, 2020