Antonio Gonzalez - Academia.edu (original) (raw)
Uploads
Papers by Antonio Gonzalez
The effectiveness of the memory hierarchy is critical for the performance of current processors. ... more The effectiveness of the memory hierarchy is critical for the performance of current processors. The performance of the memory hierarchy can be improved by means of program transformations such as loop tiling, which is a code transformation targeted to reduce capacity misses. This paper presents a novel systematic approach to perform nearoptimal loop tiling based on an accurate data locality analysis (Cache Miss Equations) and a powerful technique to search the solution space that is based on a genetic algorithm. The results show that this approach can remove practically all capacity misses for all considered benchmarks. The reduction of replacement misses results in a decrease of the miss ratio that can be as significant as a factor of 7 for the matrix multiply kernel. eplacement equations. Given a reference, replacement equations represent the interferences with any other reference. For each pair of references (© and © ), the following expression gives the condition that determines whether they are mapped onto the same cache set:
ABSTRACT Register files are becoming one of the critical components of current out-of-order proce... more ABSTRACT Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register-renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.
Narrow values that can be represented by less number of bits than the full machine width occur ve... more Narrow values that can be represented by less number of bits than the full machine width occur very frequently in programs. On the other hand, clustering mechanisms enable cost-and performance-effective scaling of processor back-end features. Those attributes can be combined synergistically to design special clusters operating on narrow values (a.k.a. Helper Cluster), potentially providing performance benefits.
The effectiveness of the memory hierarchy is critical for the performance of current processors. ... more The effectiveness of the memory hierarchy is critical for the performance of current processors. The performance of the memory hierarchy can be improved by means of program transformations such as loop tiling, which is a code transformation targeted to reduce capacity misses. This paper presents a novel systematic approach to perform nearoptimal loop tiling based on an accurate data locality analysis (Cache Miss Equations) and a powerful technique to search the solution space that is based on a genetic algorithm. The results show that this approach can remove practically all capacity misses for all considered benchmarks. The reduction of replacement misses results in a decrease of the miss ratio that can be as significant as a factor of 7 for the matrix multiply kernel. eplacement equations. Given a reference, replacement equations represent the interferences with any other reference. For each pair of references (© and © ), the following expression gives the condition that determines whether they are mapped onto the same cache set:
ABSTRACT Register files are becoming one of the critical components of current out-of-order proce... more ABSTRACT Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register-renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.
Narrow values that can be represented by less number of bits than the full machine width occur ve... more Narrow values that can be represented by less number of bits than the full machine width occur very frequently in programs. On the other hand, clustering mechanisms enable cost-and performance-effective scaling of processor back-end features. Those attributes can be combined synergistically to design special clusters operating on narrow values (a.k.a. Helper Cluster), potentially providing performance benefits.