Xing Cai | University of Oslo (original) (raw)
Papers by Xing Cai
In this paper, we discuss the implementation and performance of m2cpp: an automated translator fr... more In this paper, we discuss the implementation and performance of m2cpp: an automated translator from MATLAB code to its matching Armadillo counterpart in the C++ language. A non-invasive strategy has been adopted, meaning that the user of m2cpp does not insert annotations or additional code lines into the input serial MATLAB code. Instead, a combination of code analysis, automated preprocessing and a user-editable metainfo file ensures that m2cpp overcomes some specialties of the MATLAB language, such as implicit typing of variables and multiple return values from functions. Thread-based parallelisation, using either OpenMP or Intel’s Threading Building Blocks (TBB) library, can also be carried out by m2cpp for designated for-loops. Such an automated and non-invasive strategy allows maintaining an independent MATLAB code base that is favoured by algorithm developers, while an updated translation into the easily readable C++ counterpart can be obtained at any time. Illustrating exampl...
Frontiers in Physics, 2021
The EMI model represents excitable cells in a more accurate manner than traditional homogenized m... more The EMI model represents excitable cells in a more accurate manner than traditional homogenized models at the price of increased computational complexity. The increased complexity of solving the EMI model stems from a significant increase in the number of computational nodes and from the form of the linear systems that need to be solved. Here, we will show that the latter problem can be solved by careful use of operator splitting of the spatially coupled equations. By using this method, the linear systems can be broken into sub-problems that are of the classical type of linear, elliptic boundary value problems. Therefore, the vast collection of methods for solving linear, elliptic partial differential equations can be used. We demonstrate that this enables us to solve the systems using shared-memory parallel computers. The computing time scales perfectly with the number of physical cells. For a collection of 512×256 cells, we solved linear systems with about2.5×108unknows. Since the...
Scientific Programming, 2019
The Unified Parallel C (UPC) programming language offers parallelism via logically partitioned sh... more The Unified Parallel C (UPC) programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory subsystems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can come at the cost of substantial performance penalties. This is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper, we study performance enhancement strategies specifically targeting such fine-grained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs ...
Abstract The bidomain model of electrophysiology consists of a 2× 2 system of partial differentia... more Abstract The bidomain model of electrophysiology consists of a 2× 2 system of partial differential equations (PDEs) coupled nonlinearly to a system of ordinary differential equations (ODEs). Numerical algorithms for this model are normally based on an operator ...
Frontiers in Physics
A new trend in processor architecture design is the packaging of thousands of small processor cor... more A new trend in processor architecture design is the packaging of thousands of small processor cores into a single device, where there is no device-level shared memory but each core has its own local memory. Thus, both the work and data of an application code need to be carefully distributed among the small cores, also termed as tiles. In this paper, we investigate how numerical computations that involve unstructured meshes can be efficiently parallelized and executed on a massively tiled architecture. Graphcore IPUs are chosen as the target hardware platform, to which we port an existing monodomain solver that simulates cardiac electrophysiology over realistic 3D irregular heart geometries. There are two computational kernels in this simulator, where a 3D diffusion equation is discretized over an unstructured mesh and numerically approximated by repeatedly executing sparse matrix-vector multiplications (SpMVs), whereas an individual system of ordinary differential equations (ODEs) i...
IEEE Transactions on Parallel and Distributed Systems
The network topology of modern parallel computing systems is inherently heterogeneous, with a var... more The network topology of modern parallel computing systems is inherently heterogeneous, with a variety of latency and bandwidth values. Moreover, contention for the bandwidth can exist on different levels when many processes communicate with each other. Many-pair, point-to-point MPI communication is thus characterized by heterogeneity and contention, even on a cluster of homogeneous multicore CPU nodes. To get a detailed understanding of the individual communication cost per MPI process, we propose a new modeling methodology that incorporates both heterogeneity and contention. First, we improve the standard max-rate model to better quantify the actually achievable bandwidth depending on the number of MPI processes in competition. Then, we make a further extension that more detailedly models the bandwidth contention when the competing MPI processes have different numbers of neighbors, with also non-uniform message sizes. Thereafter, we include more flexibility by considering interactions between intra-socket and inter-socket messaging. Through a series of experiments done on different processor architectures, we show that the new heterogeneous and contention-constrained performance models can adequately explain the individual communication cost associated with each MPI process. The largest test of realistic point-to-point MPI communication involves 8,192 processes and in total 2,744,632 simultaneous messages over 64 dual-socket AMD Epyc Rome compute nodes connected by InfiniBand, for which the overall prediction accuracy achieved is 84%.
Modeling Excitable Tissue, 2020
We want to be able to perform accurate simulations of a large number of cardiac cells based on ma... more We want to be able to perform accurate simulations of a large number of cardiac cells based on mathematical models where each individual cell is represented in the model. This implies that the computational mesh has to have a typical resolution of a fewµm leading to huge computational challenges. In this paper we use a certain operator splitting of the coupled equations and showthat this leads to systems that can be solved in parallel. This opens up for the possibility of simulating large numbers of coupled cardiac cells.
Lecture Notes in Computational Science and Engineering, 2003
Overlapping domain decomposition methods are efficient and flexible. It is also important that su... more Overlapping domain decomposition methods are efficient and flexible. It is also important that such methods are inherently suitable for parallel computing. In this chapter, we will first explain the mathematical formulation and algorithmic composition of the overlapping domain decomposition methods. Afterwards, we will focus on a generic implementation framework and its applications within Diffpack.
Monographs in Computational Science and Engineering, 2006
As described in the previous chapter, the human body consists of billions of cells, which may be ... more As described in the previous chapter, the human body consists of billions of cells, which may be connected by various coupling mechanisms depending on the type of tissue under consideration. When constructing mathematical models for electrical activity in the tissue, one possible approach would be to model each cell as a separate unit, and couple them together using mathematical models
Monographs in Computational Science and Engineering, 2006
The physical relevance of computations based on the model problems arising from the electrical ac... more The physical relevance of computations based on the model problems arising from the electrical activity in the heart depends on high accuracy of the solution. High accuracy requires the solution of large linear or nonlinear systems of ODEs and PDEs. This chapter deals with solution algorithms for the discretization of (linear) PDEs, which is a huge research field around the
Monographs in Computational Science and Engineering
The operator splitting algorithms introduced in Chapter 3 reduced the solution of the bidomain eq... more The operator splitting algorithms introduced in Chapter 3 reduced the solution of the bidomain equations to solving linear PDE systems and nonlinear systems of ODEs. Techniques for discretizing the PDE system were presented in Chapter 3, while techniques for solving the resulting linear systems were discussed in Chapter 4. What remains to have a complete computational method for the bidomain
Monographs in Computational Science and Engineering, 2006
The mathematical models derived in the previous chapter give a quantitative description of the el... more The mathematical models derived in the previous chapter give a quantitative description of the electrical activity in the heart, from the level of electrochemical reactions in the cells to body surface potentials that may be recorded as ECGs. However, the models are formulated as systems of nonlinear partial and ordinary differential equations, for which analytical solutions are not available. To be of any practical use, the equations of the models must therefore be solved with numerical methods. The choice of numerical methods that may be applied to the equations is large, see e.g. [83], but we have chosen to focus entirely on finite element methods (FEM). One reason for this is that the geometries of the heart and the body are irregular, and this is more conveniently handled by FEM than, for instance, by finite difference methods.
Our knowledge about the heart dates back more than two millenia. Already in the days of Aristotle... more Our knowledge about the heart dates back more than two millenia. Already in the days of Aristotle (350 b.c.) the importance of the heart was recognized, and it was, in fact, considered to be the most important organ in the body. Other vital organs, such as the brain and lungs, were thought to exist merely to cool the blood. Over
Biomechanics and Modeling in Mechanobiology
Cardiomyocytes are the functional building blocks of the heart—yet most models developed to simul... more Cardiomyocytes are the functional building blocks of the heart—yet most models developed to simulate cardiac mechanics do not represent the individual cells and their surrounding matrix. Instead, they work on a homogenized tissue level, assuming that cellular and subcellular structures and processes scale uniformly. Here we present a mathematical and numerical framework for exploring tissue-level cardiac mechanics on a microscale given an explicit three-dimensional geometrical representation of cells embedded in a matrix. We defined a mathematical model over such a geometry and parametrized our model using publicly available data from tissue stretching and shearing experiments. We then used the model to explore mechanical differences between the extracellular and the intracellular space. Through sensitivity analysis, we found the stiffness in the extracellular matrix to be most important for the intracellular stress values under contraction. Strain and stress values were observed to...
Monographs in Computational Science and Engineering
In the preceding chapters, we have discussed various numerical techniques for solving the differe... more In the preceding chapters, we have discussed various numerical techniques for solving the different parts of our mathematical model problem. Now it is time to turn our attention to simulating the complete mathematical model. First, we will explain the diverse computational tasks that constitute an electrocardiac simulator. Then, we will estimate the computational resources needed to carry out high-resolution simulations.
Procedia Computer Science
This paper studies the CUDA programming challenges with using multiple GPUs inside a single machi... more This paper studies the CUDA programming challenges with using multiple GPUs inside a single machine to carry out plane-by-plane updates in parallel 3D sweeping algorithms. In particular, care must be taken to mask the overhead of various data movements between the GPUs. Multiple OpenMP threads on the CPU side should be combined multiple CUDA streams per GPU to hide the data transfer cost related to the halo computation on each 2D plane. Moreover, the technique of peer-to-peer data motion can be used to reduce the impact of 3D volumetric data shuffles that have to be done between mandatory changes of the grid partitioning. We have investigated the performance improvement of 2-and 4-GPU implementations that are applicable to 3D anisotropic front propagation computations related to geological folding. In comparison with a straightforward multi-GPU implementation, the overall performance improvement due to masking of data movements on four GPUs of the Fermi architecture was 23%. The corresponding improvement obtained on four Kepler GPUs was 47%.
In this paper, we discuss the implementation and performance of m2cpp: an automated translator fr... more In this paper, we discuss the implementation and performance of m2cpp: an automated translator from MATLAB code to its matching Armadillo counterpart in the C++ language. A non-invasive strategy has been adopted, meaning that the user of m2cpp does not insert annotations or additional code lines into the input serial MATLAB code. Instead, a combination of code analysis, automated preprocessing and a user-editable metainfo file ensures that m2cpp overcomes some specialties of the MATLAB language, such as implicit typing of variables and multiple return values from functions. Thread-based parallelisation, using either OpenMP or Intel’s Threading Building Blocks (TBB) library, can also be carried out by m2cpp for designated for-loops. Such an automated and non-invasive strategy allows maintaining an independent MATLAB code base that is favoured by algorithm developers, while an updated translation into the easily readable C++ counterpart can be obtained at any time. Illustrating exampl...
Frontiers in Physics, 2021
The EMI model represents excitable cells in a more accurate manner than traditional homogenized m... more The EMI model represents excitable cells in a more accurate manner than traditional homogenized models at the price of increased computational complexity. The increased complexity of solving the EMI model stems from a significant increase in the number of computational nodes and from the form of the linear systems that need to be solved. Here, we will show that the latter problem can be solved by careful use of operator splitting of the spatially coupled equations. By using this method, the linear systems can be broken into sub-problems that are of the classical type of linear, elliptic boundary value problems. Therefore, the vast collection of methods for solving linear, elliptic partial differential equations can be used. We demonstrate that this enables us to solve the systems using shared-memory parallel computers. The computing time scales perfectly with the number of physical cells. For a collection of 512×256 cells, we solved linear systems with about2.5×108unknows. Since the...
Scientific Programming, 2019
The Unified Parallel C (UPC) programming language offers parallelism via logically partitioned sh... more The Unified Parallel C (UPC) programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory subsystems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can come at the cost of substantial performance penalties. This is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper, we study performance enhancement strategies specifically targeting such fine-grained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs ...
Abstract The bidomain model of electrophysiology consists of a 2× 2 system of partial differentia... more Abstract The bidomain model of electrophysiology consists of a 2× 2 system of partial differential equations (PDEs) coupled nonlinearly to a system of ordinary differential equations (ODEs). Numerical algorithms for this model are normally based on an operator ...
Frontiers in Physics
A new trend in processor architecture design is the packaging of thousands of small processor cor... more A new trend in processor architecture design is the packaging of thousands of small processor cores into a single device, where there is no device-level shared memory but each core has its own local memory. Thus, both the work and data of an application code need to be carefully distributed among the small cores, also termed as tiles. In this paper, we investigate how numerical computations that involve unstructured meshes can be efficiently parallelized and executed on a massively tiled architecture. Graphcore IPUs are chosen as the target hardware platform, to which we port an existing monodomain solver that simulates cardiac electrophysiology over realistic 3D irregular heart geometries. There are two computational kernels in this simulator, where a 3D diffusion equation is discretized over an unstructured mesh and numerically approximated by repeatedly executing sparse matrix-vector multiplications (SpMVs), whereas an individual system of ordinary differential equations (ODEs) i...
IEEE Transactions on Parallel and Distributed Systems
The network topology of modern parallel computing systems is inherently heterogeneous, with a var... more The network topology of modern parallel computing systems is inherently heterogeneous, with a variety of latency and bandwidth values. Moreover, contention for the bandwidth can exist on different levels when many processes communicate with each other. Many-pair, point-to-point MPI communication is thus characterized by heterogeneity and contention, even on a cluster of homogeneous multicore CPU nodes. To get a detailed understanding of the individual communication cost per MPI process, we propose a new modeling methodology that incorporates both heterogeneity and contention. First, we improve the standard max-rate model to better quantify the actually achievable bandwidth depending on the number of MPI processes in competition. Then, we make a further extension that more detailedly models the bandwidth contention when the competing MPI processes have different numbers of neighbors, with also non-uniform message sizes. Thereafter, we include more flexibility by considering interactions between intra-socket and inter-socket messaging. Through a series of experiments done on different processor architectures, we show that the new heterogeneous and contention-constrained performance models can adequately explain the individual communication cost associated with each MPI process. The largest test of realistic point-to-point MPI communication involves 8,192 processes and in total 2,744,632 simultaneous messages over 64 dual-socket AMD Epyc Rome compute nodes connected by InfiniBand, for which the overall prediction accuracy achieved is 84%.
Modeling Excitable Tissue, 2020
We want to be able to perform accurate simulations of a large number of cardiac cells based on ma... more We want to be able to perform accurate simulations of a large number of cardiac cells based on mathematical models where each individual cell is represented in the model. This implies that the computational mesh has to have a typical resolution of a fewµm leading to huge computational challenges. In this paper we use a certain operator splitting of the coupled equations and showthat this leads to systems that can be solved in parallel. This opens up for the possibility of simulating large numbers of coupled cardiac cells.
Lecture Notes in Computational Science and Engineering, 2003
Overlapping domain decomposition methods are efficient and flexible. It is also important that su... more Overlapping domain decomposition methods are efficient and flexible. It is also important that such methods are inherently suitable for parallel computing. In this chapter, we will first explain the mathematical formulation and algorithmic composition of the overlapping domain decomposition methods. Afterwards, we will focus on a generic implementation framework and its applications within Diffpack.
Monographs in Computational Science and Engineering, 2006
As described in the previous chapter, the human body consists of billions of cells, which may be ... more As described in the previous chapter, the human body consists of billions of cells, which may be connected by various coupling mechanisms depending on the type of tissue under consideration. When constructing mathematical models for electrical activity in the tissue, one possible approach would be to model each cell as a separate unit, and couple them together using mathematical models
Monographs in Computational Science and Engineering, 2006
The physical relevance of computations based on the model problems arising from the electrical ac... more The physical relevance of computations based on the model problems arising from the electrical activity in the heart depends on high accuracy of the solution. High accuracy requires the solution of large linear or nonlinear systems of ODEs and PDEs. This chapter deals with solution algorithms for the discretization of (linear) PDEs, which is a huge research field around the
Monographs in Computational Science and Engineering
The operator splitting algorithms introduced in Chapter 3 reduced the solution of the bidomain eq... more The operator splitting algorithms introduced in Chapter 3 reduced the solution of the bidomain equations to solving linear PDE systems and nonlinear systems of ODEs. Techniques for discretizing the PDE system were presented in Chapter 3, while techniques for solving the resulting linear systems were discussed in Chapter 4. What remains to have a complete computational method for the bidomain
Monographs in Computational Science and Engineering, 2006
The mathematical models derived in the previous chapter give a quantitative description of the el... more The mathematical models derived in the previous chapter give a quantitative description of the electrical activity in the heart, from the level of electrochemical reactions in the cells to body surface potentials that may be recorded as ECGs. However, the models are formulated as systems of nonlinear partial and ordinary differential equations, for which analytical solutions are not available. To be of any practical use, the equations of the models must therefore be solved with numerical methods. The choice of numerical methods that may be applied to the equations is large, see e.g. [83], but we have chosen to focus entirely on finite element methods (FEM). One reason for this is that the geometries of the heart and the body are irregular, and this is more conveniently handled by FEM than, for instance, by finite difference methods.
Our knowledge about the heart dates back more than two millenia. Already in the days of Aristotle... more Our knowledge about the heart dates back more than two millenia. Already in the days of Aristotle (350 b.c.) the importance of the heart was recognized, and it was, in fact, considered to be the most important organ in the body. Other vital organs, such as the brain and lungs, were thought to exist merely to cool the blood. Over
Biomechanics and Modeling in Mechanobiology
Cardiomyocytes are the functional building blocks of the heart—yet most models developed to simul... more Cardiomyocytes are the functional building blocks of the heart—yet most models developed to simulate cardiac mechanics do not represent the individual cells and their surrounding matrix. Instead, they work on a homogenized tissue level, assuming that cellular and subcellular structures and processes scale uniformly. Here we present a mathematical and numerical framework for exploring tissue-level cardiac mechanics on a microscale given an explicit three-dimensional geometrical representation of cells embedded in a matrix. We defined a mathematical model over such a geometry and parametrized our model using publicly available data from tissue stretching and shearing experiments. We then used the model to explore mechanical differences between the extracellular and the intracellular space. Through sensitivity analysis, we found the stiffness in the extracellular matrix to be most important for the intracellular stress values under contraction. Strain and stress values were observed to...
Monographs in Computational Science and Engineering
In the preceding chapters, we have discussed various numerical techniques for solving the differe... more In the preceding chapters, we have discussed various numerical techniques for solving the different parts of our mathematical model problem. Now it is time to turn our attention to simulating the complete mathematical model. First, we will explain the diverse computational tasks that constitute an electrocardiac simulator. Then, we will estimate the computational resources needed to carry out high-resolution simulations.
Procedia Computer Science
This paper studies the CUDA programming challenges with using multiple GPUs inside a single machi... more This paper studies the CUDA programming challenges with using multiple GPUs inside a single machine to carry out plane-by-plane updates in parallel 3D sweeping algorithms. In particular, care must be taken to mask the overhead of various data movements between the GPUs. Multiple OpenMP threads on the CPU side should be combined multiple CUDA streams per GPU to hide the data transfer cost related to the halo computation on each 2D plane. Moreover, the technique of peer-to-peer data motion can be used to reduce the impact of 3D volumetric data shuffles that have to be done between mandatory changes of the grid partitioning. We have investigated the performance improvement of 2-and 4-GPU implementations that are applicable to 3D anisotropic front propagation computations related to geological folding. In comparison with a straightforward multi-GPU implementation, the overall performance improvement due to masking of data movements on four GPUs of the Fermi architecture was 23%. The corresponding improvement obtained on four Kepler GPUs was 47%.