SnuMAP : an Open-Source Trace Profiler for Manycore Systems (original) (raw)

Efficient, Unified, and Scalable Performance Monitoring for Multiprocessor Operating Systems

Proceedings of the 2003 ACM/IEEE conference on Supercomputing - SC '03, 2003

Programming, understanding, and tuning the performance of large multiprocessor systems is challenging. Experts have difficulty achieving good utilization for applications on large machines. The task of implementing a scalable system such as an operating system or database on large machines is even more challenging. And the importance of achieving good performance on multiprocessor machines is increasing as the number of cores per chip increases and as the size of multiprocessors increases. Crucial to achieving good performance is being able to understand the behavior of the system.

Collecting whole-system reference traces of multiprogrammed and multithreaded workloads

ACM SIGSOFT Software Engineering Notes, 2004

The simulated evaluation of memory management policies relies on reference traces-logs of memory operations performed by running processes. No existing approach to reference trace collection is applicable to a complete system, including the kernel and all processes. Specifically, none gather sufficient information for simulating the virtual memory management, the filesystem cache management, and the scheduling of a multiprogrammed, multithreaded workload. Existing trace collectors are also difficult to maintain and to port, making them partially or wholly unusable on many modern systems. We present Laplace, a trace collector that can log every memory reference that a system performs. Laplace is implemented through modifications to a simulated processor and an operating system kernel. Because it collects references at the simulated CPU layer, Laplace produces traces that are complete and minimally distorted. Its modified kernel also logs selected events that a post-processor uses to associate every virtual memory and filesystem reference with its thread. It also uses this information to reconcile all uses of shared memory, which is a task that is more complex than previously believed. Laplace is capable of tracing a workload without modifications to any source, library, or executable file. Finally, Laplace is relatively easy to maintain and port to different architectures and kernels.

Core monitors: monitoring performance in multicore processors

2009

As we reach the limits of single-core computing, we are promised more and more cores in our systems. Modern architectures include many performance counters per core, but few or no intercore counters. In fact, performance counters were not designed to be exploited by users, as they now are, but simply as aids for hardware debugging and testing during system creation. As such, they tend to be an "after thought" in the design, with no standardization across or within platforms. Nonetheless, given access to these counters, researchers are using them to great advantage [17]. Furthermore, evaluating counters for multicore systems has become a complex and resource-consuming task. We propose a Performance Monitoring System consisting of a specialized CPU core designed to allow efficient collection and evaluation of performance data for both static and dynamic optimizations. Our system provides a transparent mechanism to change architectural features dynamically, inform the Operating System of process behaviors, and assist in profiling and debugging. For instance, a piece of hardware watching snoop packets can determine when a write-update cache coherence protocol would be helpful or detrimental to the currently running program. Our system is designed to allow the hardware to feed performance statistics back to software, allowing dynamic architectural adjustments at runtime.

A framework for analyzing linux system overheads on hpc applications

2005

Linux currently plays an important role in high-end computing systems, but recent work has shown that Linux-related processing costs and variablity in network processing times can limit the scalability of HPC applications. Measuring and understanding these overheads is thus key for future use of Linux in large scale HPC systems. Unfortunately, currently available performance monitoring systems introduce large overheads, performance data is generally not available on-line or from the operating system, and the data collected by such systems is generally coarse-grained. In this paper, we present a low-overhead framework for solving one of these problems: making useful operating system performance data available to the application at runtime. Specifically, we have enhanced Linux Trace Toolkit(LTT) to monitor the performance characteristics of individual system calls and to make per-request performance data available to the application. We demonstrate the ability of this framework to monitor individual network and disk requests, and show that the overhead of our per-request performance monitoring framework is minimal. We also present preliminary measurements of Linux system call overhead on a simple HPC.

Advanced Profiling of Applications for Heterogeneous Multi-Core Platforms

e increased complexity of programming on multiprocessors platforms requires more insight into program behavior, for whi programmers need increasingly sophisticated methods for profiling, instrumentation, measurement, analysis, and modeling of applications. Particularly, tools to thoroughly investigate the memory access behavior of applications have become crucial due to the widening gap between the memory bandwidth/latency compared to the processing performance. To address these allenges, we developed the Q² profiling framework in the context of the Del Workben (DWB), whi is a semi automatic tool platform for integrated hardware/soware co-design, targeting heterogeneous computing systems containing reconfigurable components. e profiling framework consists of two parts, a static part whi extracts information related to memory accesses during the execution of an application. We present an advanced memory access profiling toolset that provides a detailed overview of the runtime behavior of the memory access pattern of an application. is information can be used for partitioning and mapping purposes. e second part involves a statistical model that allows to make predictions early in the design phase regarding memory and hardware usage based on soware metrics. We examine in detail a real application from the image processing domain to validate and specify all the potentials of the Q² profiling framework.

Performance Data Visualization of Linux Events on Multicores

Anais do XXII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD 2021), 2021

Profiling tools are essential to understand the behavior of parallel applications and assist in the optimization process. However, tools such as Perf generate a large amount of data. This way, they require significant storage space, which also complicates reasoning about this large volume of data. Therefore, we propose VisPerf: a tool-chain and an interactive visualization dashboard for Perf data. The VisPerf tool-chain profiles the application and pre-processes the data, reducing the storage space required by about 50 times. Moreover, we used the visualization dashboard to quickly understand the performance of different events and visualize specific threads and functions of a real-world application.

THOR: A performance analysis tool for Java applications running on multicore systems

Ibm Journal of Research and Development, 2010

The growing scale of available commercial multicore systems has increased the need for concurrent programming, because these systems expose hardware parallelism to the programmer that must be explicitly managed in software. Commercial software developers that are traditionally skilled in sequential programming are now forced to reason about concurrency to exploit multicore parallelism. More than ever, this new group of concurrent programmers has to rely on performance tools for help. This paper introduces a new tool called THOR, which addresses the complexities introduced by multicore parallelism by organizing views in terms of Java® threads and using vertical profiling techniques to understand the state of a Java thread at all layers of the execution stack. Data collection supports visualization by tagging each trace event with a corresponding thread identifier and with a time-stamp from the machine's high-resolution timer. By exhaustively tracing fine-grain events, such as operating system context switches and Java lock contention, data collection allows a user to utilize visualization to reconstruct the precise behavior of his application. In this paper, we present the details of the THOR architecture and design. THOR is currently available from IBM alphaWorks®*Trademark, service mark, or registered trademark of International Business Corporation in the United States, other countries, or both. as part of the multicore software development kit.

Universal Tracing Interface for Multicore Processors

Journal of Computer and Communications, 2016

Application developers of today need to produce code which is error-free, and whose performance is optimized for plethora of devices. Performance of application code is studied e.g. by analyzing performance data obtained by executing application with tracing tool. Developers typically have their favorite tools which they prefer to use but unfortunately target devices are based on different computing platforms that have different performance probes which cause difficulties for using same tool with different multicore platforms. Universal Tracing Interface for Multicore Processors (UTIMP) aims to provide an unchangeable tracing interface enabling developers to perform required tracing tasks with the UTIMP, utilizing the favorite tool when possible, for different multicore platforms.

Scalable fine-grained call path tracing

2011

Applications must scale well to make efficient use of even medium-scale parallel systems. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of performance bottlenecks. Although tracing is a powerful performance-analysis technique, tools that employ it can quickly become bottlenecks themselves. Moreover, to obtain actionable performance feedback for modular parallel software systems, it is often necessary to collect and present fine-grained contextsensitive data-the very thing scalable tools avoid. While existing tracing tools can collect calling contexts, they do so only in a coarse-grained fashion; and no prior tool scalably presents both context-and time-sensitive data. This paper describes how to collect, analyze and present fine-grained call path traces for parallel programs. To scale our measurements, we use asynchronous sampling, whose granularity is controlled by a sampling frequency, and a compact representation. To present traces at multiple levels of abstraction and at arbitrary resolutions, we use sampling to render complementary slices of calling-context-sensitive trace data. Because our techniques are general, they can be used on applications that use different parallel programming models (MPI, OpenMP, PGAS). This work is implemented in HPCToolkit.

Enhancing operating system support for multicore processors by using hardware performance monitoring

ACM SIGOPS Operating …, 2009

Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling.