Measuring the Parallelism Available for Very Long Instruction Word Architectures (original) (raw)

Long instruction word architectures, such as attached scientific processors and horizontally microcoded CPU's, are a popular means of obtaining code speedup via fine-grained parallelism. The failing cost of hardware holds out the hope of using these architectures for much more parallelism. But this hope has been diminished by experiments measuring how much parallelism is available in the code to start with. These experiments implied that even if we had-infinite hardware, long instruction word architectures could not provide a speedup of more than a factor of 2 or 3 on real programs. These experiments measured only the parallelism within basic blocks. Given the machines that prompted them, it made no sense to measure anything else. Now it does. A recently developed code compaction technique, called trace scheduling (9], could exploit parallelism in operations even hundreds of blocks apart. Does such parallelism exist? .In this paper we show that it does. We did analogous experiments, but we disregarded basic block boundaries. We found huge amounts of parallelism available. Our measurements were made on standard Fortran programs in common use. The actual programs tested averaged about a factor of 90 parallelism. It ranged from about a factor of 4 to virtually unlimited amounts, restricted only by the size of the data. An important question is how much of this parallelism can actually be found and used by a real code generator. In the experiments, an oracle is used to resolve dynamic questions at compile time. It tells us which way jumps went and whether indirect references are to the same or different locations. Trace scheduling attempts to get the effect of the oracle at compile time with static index analysis and dynamic estimates of jump probabilities. We argue that most scientific code is so static that the oracle is fairly realistic. A real trace-scheduling code generator [7] might very well be able to find and use much of this parallelism. Index Terms-Memory antialiasing, microcode, multiprocessors, parallelism, trace scheduling, VLIW (very long instruction word) architectures. I. INTRODUCTION IN this paper we describe experiments we have done to empirically measure the maximum parallelism available to very long instruction word (VLIW) architectures. The most familiar examples of VLIW architectures are horizontally microcoded CPU's and some very popular specialized scientific processors, such as the floating point systems AP-120b and FPS-164. Very long instruction word architectures take advantage of fine-grained parallelism to speed up execution time.