About ParallelLoopMapper Pass Design issues (original) (raw)

The current mapping strategy seems to prioritize loop indices based on their position rather than their trip counts:

For loops with more than 3 dimensions at any level, the 4th+ dimensions are always mapped to sequential execution regardless of their trip counts.

If I have a loop structure like scf.parallel(1,1,1,512,512), the current implementation would map the last two large dimensions (512,512) to sequential execution, which seems suboptimal.

A simple idea would be to map dimensions based on the size of iteration counts for each dimension, allowing dimensions with larger iteration counts to be mapped to hardware dimensions. But why wasn’t this approach implemented? Are there other concerns that prevented this?

Additionally, the TODO item in the code mentions that ideally, the innermost distributed loop should be mapped to the X dimension, the next innermost to the Y dimension, and so on( This is considering that the X dimension is typically used for contiguous access, and contiguous access is beneficial for merging thread accesses within the same warp into fewer memory transactions, known as memory coalescing. Since the innermost loop usually has the most frequent memory access, mapping it to the X dimension can significantly improve memory bandwidth utilization and overall performance, Right?). Regarding the implementation of this point, it seems that one would only need to identify which index of the tensor being processed by the parallel loop corresponds to the innermost dimension and which corresponds to the next inner dimension to create a better mapping. This seems relatively straightforward, yet it remains a TODO item.
(sorry, I’m a novice - am I thinking about this too simplistically?)

I’m eager to learn how to optimize this mapping logic and would appreciate any advice you could offer.