Cosimo Prete - Academia.edu (original) (raw)
Papers by Cosimo Prete
Single-chip multiprocessors and multiple-thread architectures are becoming an affordable solution... more Single-chip multiprocessors and multiple-thread architectures are becoming an affordable solution for high-performance general-purpose workstations and servers. On these machines, the workload is typically constituted of both sequential and parallel applications. Shared-bus shared-memory multithreaded multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. A simple coherence protocol is presented. This protocol eliminates passive sharing using information from the compiler that is normally available in operating system kernels. The performance of this protocol has been evaluated and compared against other solutions proposed in the literature by means of enhanced trace-driven simulation. The performance of the proposed dolution outperforms the other protocols, especially in the case of a multithreaded processor, thus demonstrating its effectiveness in this kind of hardware platform. The complexity of the proposed approach has been evaluated in terms of the number of protocol states, additional bus lines and required software support. The protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.
Journal of Imaging, 2021
During the production of pharmaceutical glass tubes, a machine-vision based inspection system can... more During the production of pharmaceutical glass tubes, a machine-vision based inspection system can be utilized to perform the high-quality check required by the process. The necessity to improve detection accuracy, and increase production speed determines the need for fast solutions for defects detection. Solutions proposed in literature cannot be efficiently exploited due to specific factors that characterize the production process. In this work, we have derived an algorithm that does not change the detection quality compared to state-of-the-art proposals, but does determine a drastic reduction in the processing time. The algorithm utilizes an adaptive threshold based on the Sigma Rule to detect blobs, and applies a threshold to the variation of luminous intensity along a row to detect air lines. These solutions limit the detection effects due to the tube’s curvature, and rotation and vibration of the tube, which characterize glass tube production. The algorithm has been compared wi...
IEEE Parallel and Distributed Technology, 1995
8 %e Tracsgraphical programming environment promotes a modular approach to the development of dis... more 8 %e Tracsgraphical programming environment promotes a modular approach to the development of distributed applications. A few types of reusa ble design components make the environmen both simple and powerful.
IEE Proceedings E Computers and Digital Techniques
ACM SIGARCH Computer Architecture News
The ever-increasing gap between processor and memory speed is an issue also in embedded systems, ... more The ever-increasing gap between processor and memory speed is an issue also in embedded systems, because of the increased complexity of multimedia elaborations and the strict resource constraints of these devices.Profile-driven code optimization techniques can be effectively employed for tuning application-cache interaction and performances of cache system itself. In fact, applications running on such systems are usually known in advance and do not change over time. In a previous paper, we presented a profile-based code restructuring technique (CAT) that was able to dramatically increase cache exploitation of embedded applications.However, it is well known that profile-driven optimizations can suffer from input-sensitivity problems: an application that is optimized for a particular input can perform even worse than the original one, when subjected other inputs.In this paper we take into account jpeg and mpeg compressor/decompressor applications and analyze the input-sensitivity of C...
ACM SIGARCH Computer Architecture News
In this issue, we present a selection of papers from several workshops held in September 2001 in ... more In this issue, we present a selection of papers from several workshops held in September 2001 in Barcelona, Spain. The workshops were hosted within the PACT (Parallel Architecture and Compilation Techniques) Conference [1], [2]. The advances in technology arc improving the processing power and the computing speed of systems. As addressed by keynote speakers, the time has never been so propitious to explore the potentials of compilers on the architecture and vice versa, due to the strong demand for advances in the interaction of these two areas. The increasing interest is also shown by the record number of attendees this year. This is also due to the , high-quality workshops focused on hot topics in Compiler and Computer Architecture research areas. This year 2001, five different workshops covered hot research themes: the Compilers and Operating Systems for Low Power (COLP) workshop, the European Workshop on OpenMP (EWOMP), the MEmory DEcoupling Architecture workshop (MEDEA), the Ubiquitous Computing and Communication (UCC) workshop, and the Workshop on Binary Translation (WBT). For copyright reasons, we cannot include
MELECON '98. 9th Mediterranean Electrotechnical Conference. Proceedings (Cat. No.98CH36056), 2000
We present a procedure placement method for embedded applications. We use the trace-driven simula... more We present a procedure placement method for embedded applications. We use the trace-driven simulation to collect information on the use of the cache line and then a heuristic algorithm to perform the placement. The main features of our method are a short computation time and a strong reduction of miss ratio. Experimental results shows an average miss rate reduction of 32%, but better improvements are obtained depending on the specific application
Proceedings of the 1998 workshop on Computer architecture education - WCAE '98, 1998
Teaching how to design and tune an embedded system is indeed a difficult task, since the student ... more Teaching how to design and tune an embedded system is indeed a difficult task, since the student has to learn the many trade-offs that lead to the final system configuration. Existing tools are often too complex, or do not stress the basic steps in the design path. These steps are very useful during the first training sessions. The environment Csim2, which is used at our university, permits the student to become familiar with concepts of program locality, cache structure and performance tuning, while analyzing actual data produced by the actual software that has to be tied with the embedded system. The student can analyze program behavior by means of locality graphs, or run extensive parametric simulations in order to find the best configuration that minimize either system cost, power consumption, or execution time. Further optimizations allow the designer to explore more sophisticated features like selective cacheing, cache locking, scratch memory, and code mapping for better cache exploitation. In this paper we show the basic capabilities of the environment, and some example of training sessions. By means of graphs about program locality and performance metrics, the student is readily conducted to learn how to select an adequate embedded system configuration.
This paper presents an approach for profiling and tracing multithreaded applications with two mai... more This paper presents an approach for profiling and tracing multithreaded applications with two main objectives. First, extend the positive points and overcome the limitations of GPROF tool when used on parallel applications. Second, focus on gathering information that can be useful for extending the existing GCC profile-driven optimizations and to investigate on new ones for parallel applications. In order to perform an insightful profiling of a multithreaded application, our approach proposes to gather intra-thread together with inter-thread information. For the latter, Operating System activity, as well as the usage of programmer-level synchronization mechanisms (e.g., semaphores, mutex), have to be taken into account. The proposed approach exposes various per-thread information like the call-graph, and a number of intra-thread ones like blocking relationship between threads, blocking time, usage patterns of synchronization mechanisms, context switches. The approach introduces a relatively low overhead which makes it widely applicable: less than 9% on test multithreaded benchmarks and less than 3.9x slowdown for the real MySQL executions. 1 Intro and motivation Parallel and, in particular, multithreaded programming is very common especially in general-purpose applications (e.g., office automation, OS services and tools, web browsers) and in special-purpose systems like web-and DB-servers, but is gaining increasing importance also in the embedded domain due to the market demand for more and more complex portable applications, and the technological offer of growingly powerful devices. In addition, the trend towards on-chip parallel architectures enforces the general interest towards managing parallel applications along the entire software development process, (i.e., from the design and programming phases, down to compiling, optimizing, debugging, testing, and running phases) even if it is far more complicated than in case of sequential applications [1]. The simple, but still very useful, profiling capabilities provided by gprof GNU tool [11] for monoprocess, mono-threaded applications is not applicable for gathering insightful information for multi-threaded ones because of two main reasons: a) the collected information are per-process and, therefore, are not able to investigate on the thread-specific behavior; b) there is no way to gather inter-thread information, which are related to both cooperation and competition for shared resources, which the threads use through the Operating System (OS) primitives for synchronization (e.g., semaphores). In order to tune the performance of applications through specific optimizations [5] (manually and/or automatically), each thread profile has to be available, as well as specific information on the interaction between threads. For instance, some feedback-directed optimizations for cache performance, like Pettis and Hansen one [6], are already present in GCC and rely on the function call-graph, which is collectable by gprof on mono-threaded applications. Additional statistics for the analysis of temporal and spatial locality of functions, which could enable more sophisticated optimizations [7][8][9], are still missing even for monothreaded applications. For multi-threaded applications the gprof tool only collects the statistics on the main thread, which can constitute a negligible part of the executed instructions and of the execution time of the application. This work aims to provide a profiling framework that can put the bases for the profiling/tracing of multithreaded applications so that existing and, possibly, new feedback-directed optimizations can be investigated.
In embedded systems, cost, power consumption, and die size requirements push the designer to use ... more In embedded systems, cost, power consumption, and die size requirements push the designer to use small and simple cache memories. Such caches can provide low performance because of limited memory capacity and inflexible placement policy. A way to increase the performance is to adapt the program layout to the cache structure. This strategy needs the solution of a N-P complete problem and a very long processing time. We propose a strategy to look for a near optimum program layout within a reasonable time by means of smart heuristics. This solution does not add code and uses standard functionality's of a linker to produce the new layout. Our approach is able to reduce up to 70% the misses in case of a 2-kbyte direct access cache.
Proceedings of EUROMICRO 96. 22nd Euromicro Conference. Beyond 2000: Hardware and Software Design Strategies, 1995
This paper describes a hybrid methodology (based on both actual and synthetic reference streams) ... more This paper describes a hybrid methodology (based on both actual and synthetic reference streams) to produce traces representing significant complete workloads. By means of a software approach, we generate traces that in-clude both user and kernel references, starting from ...
28th International Conference on Information Technology Interfaces, 2006., 2006
In this paper we analyze how the elements in the Microsoft Authenticode interface influence final... more In this paper we analyze how the elements in the Microsoft Authenticode interface influence final users' decisions about downloading code from the Internet. Results show that the users' behavior appears to be mostly driven from the code publisher name, without considering other information provided by the interface. A proposal to improve the user interface is currently under evaluation.
Many distributed applications make use of distributed object technology. In this kind of systems,... more Many distributed applications make use of distributed object technology. In this kind of systems, modules providing services are implemented as objects spread over a network. Distributed objects are usually accessed through communication frameworks based on specific middleware solutions, such as CORBA, DCOM, and RMI. Applications of this kind might be built up (or extended) integrating different modules, possibly already coded and available on the market. Each required and available module might use a specific communication framework, hampering its prompt integration into a system exploiting a different framework. A convenient way to tackle this problem is the insertion of a gateway module, passing service requests between two different middleware solutions. This approach allows a quick integration of service modules, but it could lead to performance problems, due to the introduced communication overhead. In this paper, we report our experience in developing a simple CORBA/RMI gatew...
Single-chip multiprocessors and multiple-thread architectures are becoming an affordable solution... more Single-chip multiprocessors and multiple-thread architectures are becoming an affordable solution for high-performance general-purpose workstations and servers. On these machines, the workload is typically constituted of both sequential and parallel applications. Shared-bus shared-memory multithreaded multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. A simple coherence protocol is presented. This protocol eliminates passive sharing using information from the compiler that is normally available in operating system kernels. The performance of this protocol has been evaluated and compared against other solutions proposed in the literature by means of enhanced trace-driven simulation. The performance of the proposed dolution outperforms the other protocols, especially in the case of a multithreaded processor, thus demonstrating its effectiveness in this kind of hardware platform. The complexity of the proposed approach has been evaluated in terms of the number of protocol states, additional bus lines and required software support. The protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.
Journal of Imaging, 2021
During the production of pharmaceutical glass tubes, a machine-vision based inspection system can... more During the production of pharmaceutical glass tubes, a machine-vision based inspection system can be utilized to perform the high-quality check required by the process. The necessity to improve detection accuracy, and increase production speed determines the need for fast solutions for defects detection. Solutions proposed in literature cannot be efficiently exploited due to specific factors that characterize the production process. In this work, we have derived an algorithm that does not change the detection quality compared to state-of-the-art proposals, but does determine a drastic reduction in the processing time. The algorithm utilizes an adaptive threshold based on the Sigma Rule to detect blobs, and applies a threshold to the variation of luminous intensity along a row to detect air lines. These solutions limit the detection effects due to the tube’s curvature, and rotation and vibration of the tube, which characterize glass tube production. The algorithm has been compared wi...
IEEE Parallel and Distributed Technology, 1995
8 %e Tracsgraphical programming environment promotes a modular approach to the development of dis... more 8 %e Tracsgraphical programming environment promotes a modular approach to the development of distributed applications. A few types of reusa ble design components make the environmen both simple and powerful.
IEE Proceedings E Computers and Digital Techniques
ACM SIGARCH Computer Architecture News
The ever-increasing gap between processor and memory speed is an issue also in embedded systems, ... more The ever-increasing gap between processor and memory speed is an issue also in embedded systems, because of the increased complexity of multimedia elaborations and the strict resource constraints of these devices.Profile-driven code optimization techniques can be effectively employed for tuning application-cache interaction and performances of cache system itself. In fact, applications running on such systems are usually known in advance and do not change over time. In a previous paper, we presented a profile-based code restructuring technique (CAT) that was able to dramatically increase cache exploitation of embedded applications.However, it is well known that profile-driven optimizations can suffer from input-sensitivity problems: an application that is optimized for a particular input can perform even worse than the original one, when subjected other inputs.In this paper we take into account jpeg and mpeg compressor/decompressor applications and analyze the input-sensitivity of C...
ACM SIGARCH Computer Architecture News
In this issue, we present a selection of papers from several workshops held in September 2001 in ... more In this issue, we present a selection of papers from several workshops held in September 2001 in Barcelona, Spain. The workshops were hosted within the PACT (Parallel Architecture and Compilation Techniques) Conference [1], [2]. The advances in technology arc improving the processing power and the computing speed of systems. As addressed by keynote speakers, the time has never been so propitious to explore the potentials of compilers on the architecture and vice versa, due to the strong demand for advances in the interaction of these two areas. The increasing interest is also shown by the record number of attendees this year. This is also due to the , high-quality workshops focused on hot topics in Compiler and Computer Architecture research areas. This year 2001, five different workshops covered hot research themes: the Compilers and Operating Systems for Low Power (COLP) workshop, the European Workshop on OpenMP (EWOMP), the MEmory DEcoupling Architecture workshop (MEDEA), the Ubiquitous Computing and Communication (UCC) workshop, and the Workshop on Binary Translation (WBT). For copyright reasons, we cannot include
MELECON '98. 9th Mediterranean Electrotechnical Conference. Proceedings (Cat. No.98CH36056), 2000
We present a procedure placement method for embedded applications. We use the trace-driven simula... more We present a procedure placement method for embedded applications. We use the trace-driven simulation to collect information on the use of the cache line and then a heuristic algorithm to perform the placement. The main features of our method are a short computation time and a strong reduction of miss ratio. Experimental results shows an average miss rate reduction of 32%, but better improvements are obtained depending on the specific application
Proceedings of the 1998 workshop on Computer architecture education - WCAE '98, 1998
Teaching how to design and tune an embedded system is indeed a difficult task, since the student ... more Teaching how to design and tune an embedded system is indeed a difficult task, since the student has to learn the many trade-offs that lead to the final system configuration. Existing tools are often too complex, or do not stress the basic steps in the design path. These steps are very useful during the first training sessions. The environment Csim2, which is used at our university, permits the student to become familiar with concepts of program locality, cache structure and performance tuning, while analyzing actual data produced by the actual software that has to be tied with the embedded system. The student can analyze program behavior by means of locality graphs, or run extensive parametric simulations in order to find the best configuration that minimize either system cost, power consumption, or execution time. Further optimizations allow the designer to explore more sophisticated features like selective cacheing, cache locking, scratch memory, and code mapping for better cache exploitation. In this paper we show the basic capabilities of the environment, and some example of training sessions. By means of graphs about program locality and performance metrics, the student is readily conducted to learn how to select an adequate embedded system configuration.
This paper presents an approach for profiling and tracing multithreaded applications with two mai... more This paper presents an approach for profiling and tracing multithreaded applications with two main objectives. First, extend the positive points and overcome the limitations of GPROF tool when used on parallel applications. Second, focus on gathering information that can be useful for extending the existing GCC profile-driven optimizations and to investigate on new ones for parallel applications. In order to perform an insightful profiling of a multithreaded application, our approach proposes to gather intra-thread together with inter-thread information. For the latter, Operating System activity, as well as the usage of programmer-level synchronization mechanisms (e.g., semaphores, mutex), have to be taken into account. The proposed approach exposes various per-thread information like the call-graph, and a number of intra-thread ones like blocking relationship between threads, blocking time, usage patterns of synchronization mechanisms, context switches. The approach introduces a relatively low overhead which makes it widely applicable: less than 9% on test multithreaded benchmarks and less than 3.9x slowdown for the real MySQL executions. 1 Intro and motivation Parallel and, in particular, multithreaded programming is very common especially in general-purpose applications (e.g., office automation, OS services and tools, web browsers) and in special-purpose systems like web-and DB-servers, but is gaining increasing importance also in the embedded domain due to the market demand for more and more complex portable applications, and the technological offer of growingly powerful devices. In addition, the trend towards on-chip parallel architectures enforces the general interest towards managing parallel applications along the entire software development process, (i.e., from the design and programming phases, down to compiling, optimizing, debugging, testing, and running phases) even if it is far more complicated than in case of sequential applications [1]. The simple, but still very useful, profiling capabilities provided by gprof GNU tool [11] for monoprocess, mono-threaded applications is not applicable for gathering insightful information for multi-threaded ones because of two main reasons: a) the collected information are per-process and, therefore, are not able to investigate on the thread-specific behavior; b) there is no way to gather inter-thread information, which are related to both cooperation and competition for shared resources, which the threads use through the Operating System (OS) primitives for synchronization (e.g., semaphores). In order to tune the performance of applications through specific optimizations [5] (manually and/or automatically), each thread profile has to be available, as well as specific information on the interaction between threads. For instance, some feedback-directed optimizations for cache performance, like Pettis and Hansen one [6], are already present in GCC and rely on the function call-graph, which is collectable by gprof on mono-threaded applications. Additional statistics for the analysis of temporal and spatial locality of functions, which could enable more sophisticated optimizations [7][8][9], are still missing even for monothreaded applications. For multi-threaded applications the gprof tool only collects the statistics on the main thread, which can constitute a negligible part of the executed instructions and of the execution time of the application. This work aims to provide a profiling framework that can put the bases for the profiling/tracing of multithreaded applications so that existing and, possibly, new feedback-directed optimizations can be investigated.
In embedded systems, cost, power consumption, and die size requirements push the designer to use ... more In embedded systems, cost, power consumption, and die size requirements push the designer to use small and simple cache memories. Such caches can provide low performance because of limited memory capacity and inflexible placement policy. A way to increase the performance is to adapt the program layout to the cache structure. This strategy needs the solution of a N-P complete problem and a very long processing time. We propose a strategy to look for a near optimum program layout within a reasonable time by means of smart heuristics. This solution does not add code and uses standard functionality's of a linker to produce the new layout. Our approach is able to reduce up to 70% the misses in case of a 2-kbyte direct access cache.
Proceedings of EUROMICRO 96. 22nd Euromicro Conference. Beyond 2000: Hardware and Software Design Strategies, 1995
This paper describes a hybrid methodology (based on both actual and synthetic reference streams) ... more This paper describes a hybrid methodology (based on both actual and synthetic reference streams) to produce traces representing significant complete workloads. By means of a software approach, we generate traces that in-clude both user and kernel references, starting from ...
28th International Conference on Information Technology Interfaces, 2006., 2006
In this paper we analyze how the elements in the Microsoft Authenticode interface influence final... more In this paper we analyze how the elements in the Microsoft Authenticode interface influence final users' decisions about downloading code from the Internet. Results show that the users' behavior appears to be mostly driven from the code publisher name, without considering other information provided by the interface. A proposal to improve the user interface is currently under evaluation.
Many distributed applications make use of distributed object technology. In this kind of systems,... more Many distributed applications make use of distributed object technology. In this kind of systems, modules providing services are implemented as objects spread over a network. Distributed objects are usually accessed through communication frameworks based on specific middleware solutions, such as CORBA, DCOM, and RMI. Applications of this kind might be built up (or extended) integrating different modules, possibly already coded and available on the market. Each required and available module might use a specific communication framework, hampering its prompt integration into a system exploiting a different framework. A convenient way to tackle this problem is the insertion of a gateway module, passing service requests between two different middleware solutions. This approach allows a quick integration of service modules, but it could lead to performance problems, due to the introduced communication overhead. In this paper, we report our experience in developing a simple CORBA/RMI gatew...