Juan Carlos Moure | University Autonoma of Barcelona (original) (raw)
Papers by Juan Carlos Moure
Journal on Educational Resources in …, Jan 1, 2002
Modern processors increase their performance with complex microarchitectural mechanisms, which ma... more Modern processors increase their performance with complex microarchitectural mechanisms, which makes them more and more difficult to understand and evaluate. KScalar is a user-friendly simulation tool that facilitates the study of such processors. It allows students to analyze the performance behavior of a wide range of processor microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with non-blocking caches, speculative execution, and complex branch prediction. The simulator interprets executables for the Alpha AXP instruction set: from very short program fragments to large applications. The object's program execution may be simulated in varying levels of detail: either cycle-by-cycle, observing all the pipeline events that determine processor performance, or million cycles at once, taking statistics of the main performance issues.
ACM Transactions on Computing Education / ACM Journal of Educational Resources in Computing, 2002
Modern processors increase their performance with complex microarchitectural mechanisms, which ma... more Modern processors increase their performance with complex microarchitectural mechanisms, which makes them more and more difficult to understand and evaluate. KScalar is a user-friendly simulation tool that facilitates the study of such processors. It allows students to analyze the performance behavior of a wide range of processor microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with non-blocking caches, speculative execution, and complex branch prediction. The simulator interprets executables for the Alpha AXP instruction set: from very short program fragments to large applications. The object's program execution may be simulated in varying levels of detail: either cycle-by-cycle, observing all the pipeline events that determine processor performance, or million cycles at once, taking statistics of the main performance issues.
Lecture Notes in Computer Science, 2003
... Juan C. Moure, Dolores I. Rexachs, and Emilio Luque1 Computer Architecture and Operating Syst... more ... Juan C. Moure, Dolores I. Rexachs, and Emilio Luque1 Computer Architecture and Operating Systems Group, Universidad Autónoma de Barcelona. 08193 Barcelona (Spain) {JuanCarlos.Moure, Dolores.Rexachs, Emilio.Luque}@uab.es Abstract. ...
Design, Automation, and Test in Europe, 2010
The optimal size of a large on-chip cache can be different for different programs: at some point,... more The optimal size of a large on-chip cache can be different for different programs: at some point, the reduction of cache misses achieved when increasing cache size hits diminishing returns, while the higher cache latency hurts performance. This paper presents the Amorphous Cache (AC), a reconfigurable L2 on-chip cache aimed at improving performance as well as reducing energy consumption. AC
Lecture Notes in Computer Science, 2006
ABSTRACT Adaptive processors can exploit the different characteristics exhibited by program phase... more ABSTRACT Adaptive processors can exploit the different characteristics exhibited by program phases better than a fixed hardware. However, they may significantly degrade performance and/or energy consumption. In this paper, we describe a reconfigurable cache memory, which is efficiently applied to the L1 data cache of an embedded general-purpose processor. A realistic hardware/software methodology of run-time tuning and reconfiguration of the cache is also proposed, which is based on a pattern-matching algorithm. It is used to identify the cache configuration and processor frequency when the programs data working-set changes. Considering a design scenario driven by the best product execution time×energy consumption, we show that power dissipation and energy consumption of a two-level cache hierarchy and the product time×energy can be reduced on average by 39%, 38% and 37% respectively, when compared with a non-adaptive embedded microarchitecture.
Tratamiento numérico de las Aplicaciones (Castellano) Tractament Numèric de les Aplicacions (Cata... more Tratamiento numérico de las Aplicaciones (Castellano) Tractament Numèric de les Aplicacions (Catalán) Numerical treatment of the applications (Inglés)
2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), 2010
The optimal size of a large on-chip cache can be different for different programs: at some point,... more The optimal size of a large on-chip cache can be different for different programs: at some point, the reduction of cache misses achieved when increasing cache size hits diminishing returns, while the higher cache latency hurts performance. This paper presents the Amorphous Cache (AC), a reconfigurable L2 on-chip cache aimed at improving performance as well as reducing energy consumption. AC is composed of heterogeneous sub-caches as opposed to common caches using homogenous subcaches. The sub-caches are turned off depending on the application workload to conserve power and minimize latencies. A novel reconfiguration algorithm based on Basic Block Vectors is proposed to recognize program phases, and a learning mechanism is used to select the appropriate cache configuration for each program phase. We compare our reconfigurable cache with existing proposals of adaptive and non-adaptive caches. Our results show that the combination of AC and the novel reconfiguration algorithm provides the best power consumption and performance. For example, on average, it reduces the cache access latency by 55.8%, the cache dynamic energy by 46.5%, and the cache leakage power by 49.3% with respect to a non-adaptive cache.
Proceedings. Second Euromicro Workshop on Parallel and Distributed Processing, 1994
ABSTRACT Not Available
Proceedings Euromicro Symposium on Digital Systems Design, 2001
Multithreaded processors, by simultaneously using both the thread-level parallelism and the instr... more Multithreaded processors, by simultaneously using both the thread-level parallelism and the instruction-level parallelism of applications, achieve larger instruction per cycle rate than single-thread processors. On a multi-thread workload, a clustered organization maximizes performances. On a single-thread workload, however, all but one of the clusters are idle, degrading single-thread performance significantly. Using a clustered multi-thread performance as a baseline, we propose and analyze several mechanisms and policies to improve single-thread execution exploiting the existing hardware without a significant multi-thread performance loss. We focus on the fetch unit, which is maybe the most performance-critical stage. Essentially, we analyze three ways of exploiting the idle fetch clusters: allowing a single thread accessing its neighbor clusters, use the idle fetch clusters to provide multiple-path execution, or use them to widen the effective single-three fetch block
Proceedings of the 3rd Conference on Computing Frontiers 2006, CF '06, 2006
... dbenitez@dis.ulpgc.es Juan C. Moure , Dolores I. Rexachs, Emilio Luque Computer Architecture ... more ... dbenitez@dis.ulpgc.es Juan C. Moure , Dolores I. Rexachs, Emilio Luque Computer Architecture and Operating Systems Department Universidad Autonoma de Barcelona, 08193 Barcelona, Spain. {JuanCarlos.Moure, Dolores.Rexachs, Emilio.Luque}@uab.es ...
Proceedings of the International Conference on Supercomputing, 2006
... Dolores I. Rexachs Emilio Luque University Aut. of Barcelona 08193, Barcelona, Spaindolores.r... more ... Dolores I. Rexachs Emilio Luque University Aut. of Barcelona 08193, Barcelona, Spaindolores.rexachs@uab.es emilio.luque@uab.es ABSTRACT High prediction bandwidth enables performance improve-ments and power reduction techniques. ...
Abstract. An open question in chip multiprocessors is how to organize large on-chip cache resourc... more Abstract. An open question in chip multiprocessors is how to organize large on-chip cache resources. Its answer must consider hit/miss laten-cies, energy consumption, and power dissipation. To handle this diversity of metrics, we propose the Amorphous Cache, an ...
Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, 2009
1
Procedia Computer Science, 2013
Fast pattern matching is a requirement for many problems, specially for bioinformatics sequence a... more Fast pattern matching is a requirement for many problems, specially for bioinformatics sequence analysis like short read mapping applications. This work presents a variation of the FM-index method, denoted n-step FM-index, that is applied in exact match genome search.
Procedia Computer Science, 2013
MapReduce simplifies parallel programming, abstracting the programmer responsibilities as synchro... more MapReduce simplifies parallel programming, abstracting the programmer responsibilities as synchronization and task management. The paradigm allows the programmer to write sequential code which is automatically parallelized. The MapReduce frameworks developed today are designed for situations where all keys generated by the Map phase must fit into main memory. However certain types of workload have a distribution of keys that provoke a growth of intermediate data structures, exceeding the amount of available main memory. Based on the behavior of MapReduce frameworks in multi-core architectures for these types of workload, we promote an extension of the original strategy of MapReduce for multi-core architectures. We present an extension in memory hierarchy, hard disk and main memory, which has as objective to reduce the use of main memory, as well as reducing the page faults, caused by the use of swap. The main goal of our extension is to ensure an acceptable performance of MapReduce, when intermediate data structures do not fit in main memory and it is necessary to make use of a secondary memory.
Procedia Computer Science, 2010
Neuroinformatics, 2014
A scheme to significantly speed up the processing of MRI with FreeSurfer (FS) is presented. The s... more A scheme to significantly speed up the processing of MRI with FreeSurfer (FS) is presented. The scheme is aimed at maximizing the productivity (number of subjects processed per unit time) for the use case of research projects with datasets involving many acquisitions. The scheme combines the already existing GPU-accelerated version of the FS workflow with a task-level parallel scheme supervised by a resource scheduler. This allows for an optimum utilization of the computational power of a given hardware platform while avoiding problems with shortages of platform resources. The scheme can be executed on a wide variety of platforms, as its implementation only involves the script that orchestrates the execution of the workflow components and the FS code itself requires no modifications. The scheme has been implemented and tested on a commodity platform within the reach of most research groups (a personal computer with four cores and an NVIDIA GeForce 480 GTX graphics card). Using the scheduled task-level parallel scheme, a productivity above 0.6 subjects per hour is achieved on the test platform, corresponding to a speedup of over six times compared to the default CPU-only serial FS workflow.
The Journal of Supercomputing, 2012
ABSTRACT The map-reduce paradigm has shown to be a simple and feasible way of filtering and analy... more ABSTRACT The map-reduce paradigm has shown to be a simple and feasible way of filtering and analyzing large data sets in cloud and cluster systems. Algorithms designed for the paradigm must implement regular data distribution patterns so that appropriate use of resources is ensured. Good scalability and performance on Map-Reduce applications greatly depend on the design of regular intermediate data generation-consumption patterns at the map and reduce phases. We describe the data distribution patterns found in current Map-Reduce read mapping bioinformatics applications and show some data decomposition principles to greatly improve their scalability and performance
Future Generation Computer Systems, 1994
ABSTRACT Despite the availability of parallel computing from the last two decades, there is littl... more ABSTRACT Despite the availability of parallel computing from the last two decades, there is little use of these systems in production-level environments. One of the factors most commonly blamed for the slow transition to parallelism is the lack of software support. While in serial programming the performance depends basically on the algorithm designed by the user, in parallel programming there are many machine-dependent aspects that have a significant impact on the final performance. Therefore, the user must learn a great deal about machine-dependent aspects like process grain determination, task allocation, message routing, etc.The aim of this project is intended in the design and implementation of a user friendly environment for parallel programming in a Transputer-based system. The environment will free the user of all machine-dependent aspects by means of two system services running on a host computer and a distributed kernel running on the target computer. On one hand, the two system services consist of one tool that is responsible for obtaining clusters of program tasks, and another tool that is responsible for mapping those clusters onto physical processors. On the other hand, the environment has a distributed kernel running on the target machine that hides the physical architecture. It executes user tasks with location transparency, offering a simple and efficient interface for interprocess communication, and monitoring the execution of the user programme. As a consequence, user productivity will be increased because he does not need to be aware of those aspects that are carried out automatically by the environment.
Journal on Educational Resources in …, Jan 1, 2002
Modern processors increase their performance with complex microarchitectural mechanisms, which ma... more Modern processors increase their performance with complex microarchitectural mechanisms, which makes them more and more difficult to understand and evaluate. KScalar is a user-friendly simulation tool that facilitates the study of such processors. It allows students to analyze the performance behavior of a wide range of processor microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with non-blocking caches, speculative execution, and complex branch prediction. The simulator interprets executables for the Alpha AXP instruction set: from very short program fragments to large applications. The object's program execution may be simulated in varying levels of detail: either cycle-by-cycle, observing all the pipeline events that determine processor performance, or million cycles at once, taking statistics of the main performance issues.
ACM Transactions on Computing Education / ACM Journal of Educational Resources in Computing, 2002
Modern processors increase their performance with complex microarchitectural mechanisms, which ma... more Modern processors increase their performance with complex microarchitectural mechanisms, which makes them more and more difficult to understand and evaluate. KScalar is a user-friendly simulation tool that facilitates the study of such processors. It allows students to analyze the performance behavior of a wide range of processor microarchitectures: from a very simple in-order, scalar pipeline, to a detailed out-of-order, superscalar pipeline with non-blocking caches, speculative execution, and complex branch prediction. The simulator interprets executables for the Alpha AXP instruction set: from very short program fragments to large applications. The object's program execution may be simulated in varying levels of detail: either cycle-by-cycle, observing all the pipeline events that determine processor performance, or million cycles at once, taking statistics of the main performance issues.
Lecture Notes in Computer Science, 2003
... Juan C. Moure, Dolores I. Rexachs, and Emilio Luque1 Computer Architecture and Operating Syst... more ... Juan C. Moure, Dolores I. Rexachs, and Emilio Luque1 Computer Architecture and Operating Systems Group, Universidad Autónoma de Barcelona. 08193 Barcelona (Spain) {JuanCarlos.Moure, Dolores.Rexachs, Emilio.Luque}@uab.es Abstract. ...
Design, Automation, and Test in Europe, 2010
The optimal size of a large on-chip cache can be different for different programs: at some point,... more The optimal size of a large on-chip cache can be different for different programs: at some point, the reduction of cache misses achieved when increasing cache size hits diminishing returns, while the higher cache latency hurts performance. This paper presents the Amorphous Cache (AC), a reconfigurable L2 on-chip cache aimed at improving performance as well as reducing energy consumption. AC
Lecture Notes in Computer Science, 2006
ABSTRACT Adaptive processors can exploit the different characteristics exhibited by program phase... more ABSTRACT Adaptive processors can exploit the different characteristics exhibited by program phases better than a fixed hardware. However, they may significantly degrade performance and/or energy consumption. In this paper, we describe a reconfigurable cache memory, which is efficiently applied to the L1 data cache of an embedded general-purpose processor. A realistic hardware/software methodology of run-time tuning and reconfiguration of the cache is also proposed, which is based on a pattern-matching algorithm. It is used to identify the cache configuration and processor frequency when the programs data working-set changes. Considering a design scenario driven by the best product execution time×energy consumption, we show that power dissipation and energy consumption of a two-level cache hierarchy and the product time×energy can be reduced on average by 39%, 38% and 37% respectively, when compared with a non-adaptive embedded microarchitecture.
Tratamiento numérico de las Aplicaciones (Castellano) Tractament Numèric de les Aplicacions (Cata... more Tratamiento numérico de las Aplicaciones (Castellano) Tractament Numèric de les Aplicacions (Catalán) Numerical treatment of the applications (Inglés)
2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), 2010
The optimal size of a large on-chip cache can be different for different programs: at some point,... more The optimal size of a large on-chip cache can be different for different programs: at some point, the reduction of cache misses achieved when increasing cache size hits diminishing returns, while the higher cache latency hurts performance. This paper presents the Amorphous Cache (AC), a reconfigurable L2 on-chip cache aimed at improving performance as well as reducing energy consumption. AC is composed of heterogeneous sub-caches as opposed to common caches using homogenous subcaches. The sub-caches are turned off depending on the application workload to conserve power and minimize latencies. A novel reconfiguration algorithm based on Basic Block Vectors is proposed to recognize program phases, and a learning mechanism is used to select the appropriate cache configuration for each program phase. We compare our reconfigurable cache with existing proposals of adaptive and non-adaptive caches. Our results show that the combination of AC and the novel reconfiguration algorithm provides the best power consumption and performance. For example, on average, it reduces the cache access latency by 55.8%, the cache dynamic energy by 46.5%, and the cache leakage power by 49.3% with respect to a non-adaptive cache.
Proceedings. Second Euromicro Workshop on Parallel and Distributed Processing, 1994
ABSTRACT Not Available
Proceedings Euromicro Symposium on Digital Systems Design, 2001
Multithreaded processors, by simultaneously using both the thread-level parallelism and the instr... more Multithreaded processors, by simultaneously using both the thread-level parallelism and the instruction-level parallelism of applications, achieve larger instruction per cycle rate than single-thread processors. On a multi-thread workload, a clustered organization maximizes performances. On a single-thread workload, however, all but one of the clusters are idle, degrading single-thread performance significantly. Using a clustered multi-thread performance as a baseline, we propose and analyze several mechanisms and policies to improve single-thread execution exploiting the existing hardware without a significant multi-thread performance loss. We focus on the fetch unit, which is maybe the most performance-critical stage. Essentially, we analyze three ways of exploiting the idle fetch clusters: allowing a single thread accessing its neighbor clusters, use the idle fetch clusters to provide multiple-path execution, or use them to widen the effective single-three fetch block
Proceedings of the 3rd Conference on Computing Frontiers 2006, CF '06, 2006
... dbenitez@dis.ulpgc.es Juan C. Moure , Dolores I. Rexachs, Emilio Luque Computer Architecture ... more ... dbenitez@dis.ulpgc.es Juan C. Moure , Dolores I. Rexachs, Emilio Luque Computer Architecture and Operating Systems Department Universidad Autonoma de Barcelona, 08193 Barcelona, Spain. {JuanCarlos.Moure, Dolores.Rexachs, Emilio.Luque}@uab.es ...
Proceedings of the International Conference on Supercomputing, 2006
... Dolores I. Rexachs Emilio Luque University Aut. of Barcelona 08193, Barcelona, Spaindolores.r... more ... Dolores I. Rexachs Emilio Luque University Aut. of Barcelona 08193, Barcelona, Spaindolores.rexachs@uab.es emilio.luque@uab.es ABSTRACT High prediction bandwidth enables performance improve-ments and power reduction techniques. ...
Abstract. An open question in chip multiprocessors is how to organize large on-chip cache resourc... more Abstract. An open question in chip multiprocessors is how to organize large on-chip cache resources. Its answer must consider hit/miss laten-cies, energy consumption, and power dissipation. To handle this diversity of metrics, we propose the Amorphous Cache, an ...
Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC 2009, 2009
1
Procedia Computer Science, 2013
Fast pattern matching is a requirement for many problems, specially for bioinformatics sequence a... more Fast pattern matching is a requirement for many problems, specially for bioinformatics sequence analysis like short read mapping applications. This work presents a variation of the FM-index method, denoted n-step FM-index, that is applied in exact match genome search.
Procedia Computer Science, 2013
MapReduce simplifies parallel programming, abstracting the programmer responsibilities as synchro... more MapReduce simplifies parallel programming, abstracting the programmer responsibilities as synchronization and task management. The paradigm allows the programmer to write sequential code which is automatically parallelized. The MapReduce frameworks developed today are designed for situations where all keys generated by the Map phase must fit into main memory. However certain types of workload have a distribution of keys that provoke a growth of intermediate data structures, exceeding the amount of available main memory. Based on the behavior of MapReduce frameworks in multi-core architectures for these types of workload, we promote an extension of the original strategy of MapReduce for multi-core architectures. We present an extension in memory hierarchy, hard disk and main memory, which has as objective to reduce the use of main memory, as well as reducing the page faults, caused by the use of swap. The main goal of our extension is to ensure an acceptable performance of MapReduce, when intermediate data structures do not fit in main memory and it is necessary to make use of a secondary memory.
Procedia Computer Science, 2010
Neuroinformatics, 2014
A scheme to significantly speed up the processing of MRI with FreeSurfer (FS) is presented. The s... more A scheme to significantly speed up the processing of MRI with FreeSurfer (FS) is presented. The scheme is aimed at maximizing the productivity (number of subjects processed per unit time) for the use case of research projects with datasets involving many acquisitions. The scheme combines the already existing GPU-accelerated version of the FS workflow with a task-level parallel scheme supervised by a resource scheduler. This allows for an optimum utilization of the computational power of a given hardware platform while avoiding problems with shortages of platform resources. The scheme can be executed on a wide variety of platforms, as its implementation only involves the script that orchestrates the execution of the workflow components and the FS code itself requires no modifications. The scheme has been implemented and tested on a commodity platform within the reach of most research groups (a personal computer with four cores and an NVIDIA GeForce 480 GTX graphics card). Using the scheduled task-level parallel scheme, a productivity above 0.6 subjects per hour is achieved on the test platform, corresponding to a speedup of over six times compared to the default CPU-only serial FS workflow.
The Journal of Supercomputing, 2012
ABSTRACT The map-reduce paradigm has shown to be a simple and feasible way of filtering and analy... more ABSTRACT The map-reduce paradigm has shown to be a simple and feasible way of filtering and analyzing large data sets in cloud and cluster systems. Algorithms designed for the paradigm must implement regular data distribution patterns so that appropriate use of resources is ensured. Good scalability and performance on Map-Reduce applications greatly depend on the design of regular intermediate data generation-consumption patterns at the map and reduce phases. We describe the data distribution patterns found in current Map-Reduce read mapping bioinformatics applications and show some data decomposition principles to greatly improve their scalability and performance
Future Generation Computer Systems, 1994
ABSTRACT Despite the availability of parallel computing from the last two decades, there is littl... more ABSTRACT Despite the availability of parallel computing from the last two decades, there is little use of these systems in production-level environments. One of the factors most commonly blamed for the slow transition to parallelism is the lack of software support. While in serial programming the performance depends basically on the algorithm designed by the user, in parallel programming there are many machine-dependent aspects that have a significant impact on the final performance. Therefore, the user must learn a great deal about machine-dependent aspects like process grain determination, task allocation, message routing, etc.The aim of this project is intended in the design and implementation of a user friendly environment for parallel programming in a Transputer-based system. The environment will free the user of all machine-dependent aspects by means of two system services running on a host computer and a distributed kernel running on the target computer. On one hand, the two system services consist of one tool that is responsible for obtaining clusters of program tasks, and another tool that is responsible for mapping those clusters onto physical processors. On the other hand, the environment has a distributed kernel running on the target machine that hides the physical architecture. It executes user tasks with location transparency, offering a simple and efficient interface for interprocess communication, and monitoring the execution of the user programme. As a consequence, user productivity will be increased because he does not need to be aware of those aspects that are carried out automatically by the environment.