Phanindra Mannava - Academia.edu (original) (raw)
Uploads
Papers by Phanindra Mannava
Abstract. In the last three years or so we at Enterprise Platforms Group at Intel Corporation hav... more Abstract. In the last three years or so we at Enterprise Platforms Group at Intel Corporation have been applying formal methods to various problems that arose during the process of defining platform architectures for Intel’s processor families. In this paper we give an overview of some of the problems we have worked on, the results we have obtained, and the lessons we have learned. The last topic is addressed mainly from the perspective of platform architects. 1. Problems and Results Modern computer systems are highly complex distributed systems with many interacting components. Architecturally they are often organized like a computer network into multiple layers: physical layer, link layer, protocol layer, etc. Most of the problems to which we applied formal methods are the formal verification (FV) of intricate protocols in the protocol and link layers. In addition, we also found several novel uses of binary decision diagrams (BDDs) [3] that are worth mentioning. 1.1. Directory-bas...
The cache coherence scheme for a scalable distributed shared memory multiprocessor should be effi... more The cache coherence scheme for a scalable distributed shared memory multiprocessor should be efficient in terms of memory overhead for maintaining the directories, as well as network latency for a memory request. In this paper, we propose a cache coherence scheme which minimizes the memory access delay and at the same time, reduces the directory overhead by using a limited directory scheme. In the proposed scheme, pointer overflow is handled an efficient invalidation mechanism using logically embedded rings. rings for transmitting control messages. A single ring architecture for small scale multiprocessor and a multiple ring hierarchical architecture for a scalable multiprocessor are evaluated. In both the architectures, wormhole routing, in conjunction with the usage of ring, introduces a snoopy behavior to the proposed scheme. We will show, with the help of execution driven simulation results, that for several applications our techniques outperform the full map directory scheme, a...
The cache coherence scheme for a scalable distributed shared memory multiprocessor should be effi... more The cache coherence scheme for a scalable distributed shared memory multiprocessor should be efficient in terms of memory overhead for maintaining the directories, as well as network latency for a memory request. In this paper, we propose a cache coherence scheme which minimizes the memory access delay and at the same time, reduces the directory overhead by using a limited directory scheme. In the proposed scheme, pointer overflow is handled by using a logically embedded ring for transmitting control messages. Wormhole routing, in conjunction with the usage of ring, introduces a snoopy behavior into the proposed scheme. We will show, with the help of several exection driven simulation results, that for real applications our technique outperforms the full map directory scheme, as well as the traditional implementations of limited directory schemes.
Due to ever increasing demand on computation power and having reached the physical speed up limit... more Due to ever increasing demand on computation power and having reached the physical speed up limitations of uniprocessor based computers, all the major computer vendors have started designing multiprocessors. Multiprocessors can be broadly classified into shared memory and message passing. Shared memory ones are more popular due to the ease of programming and the availability of parallelizing compilers. For small scale systems a bus has been used successfully for connecting the processors together. For the system to be scalable, other networks such as MIN, Mesh, Hypercube etc. have to be used. Among these hypercube has been used for many multiprocessor designs. When a point to point network is used for shared memory the memory access time is reduced by using caches near each processor. In such a situation directories are used for maintaining cache coherence. Several "limited directory" schemes have been proposed, which limit the directory storage overhead but result in incr...
A centralised synchronising device 20 for use in a data processing system having a plurality of d... more A centralised synchronising device 20 for use in a data processing system having a plurality of devices 50, 52, 54, 80, 82, 84 and an interconnect 10 interconnecting the devices. The synchronising device comprising: input/output port 25, buffer 32 storing pending system synchronising requests, arbitration circuitry 34 for selecting the next pending system synchronising request and forwarding it to synchronising request generator 37 and multicast circuitry 39. The synchronising request generator 37 and multicast circuitry 39 generate synchronising requests in response to the system synchronising request and output the requests as a multicast to at least some of the devices within the data processing system. The devices which are the target of the multicast may be specified in target lists 35. Gather circuitry 40 collects responses to the synchronising requests and being configured output a response to the system synchronising request via response generator 45 to when responses to all...
Midwest Symposium on Circuits and Systems, 1997
The most common analysis of power systems, load flow requires the solution of a set of thousands ... more The most common analysis of power systems, load flow requires the solution of a set of thousands of nonlinear algebraic equations. In this paper we develop a methodology to evaluate the efficiency of parallelization of load flow analysis techniques on different multiprocessor architectures. Parallelization is essential for obtaining analysis in real time
ABSTRACT The cache coherence scheme for a scalable distributed shared memory multiprocessor shoul... more ABSTRACT The cache coherence scheme for a scalable distributed shared memory multiprocessor should be efficient in terms of memory overhead for maintaining the directories, as well as network latency for a memory request. In this paper, we propose a cache coherence scheme which minimizes the memory access delay and at the same time, reduces the directory overhead by using a limited directory scheme. In the proposed scheme, pointer overflow is handled an efficient invalidation mechanism using logically embedded rings. rings for transmitting control messages. A single ring architecture for small scale multiprocessor and a multiple ring hierarchical architecture for a scalable multiprocessor are evaluated. In both the architectures, wormhole routing, in conjunction with the usage of ring, introduces a snoopy behavior to the proposed scheme. We will show, with the help of execution driven simulation results, that for several applications our techniques outperform the full map directory scheme, as well as the traditional implementations of limited directory schemes.
Abstract. In the last three years or so we at Enterprise Platforms Group at Intel Corporation hav... more Abstract. In the last three years or so we at Enterprise Platforms Group at Intel Corporation have been applying formal methods to various problems that arose during the process of defining platform architectures for Intel’s processor families. In this paper we give an overview of some of the problems we have worked on, the results we have obtained, and the lessons we have learned. The last topic is addressed mainly from the perspective of platform architects. 1. Problems and Results Modern computer systems are highly complex distributed systems with many interacting components. Architecturally they are often organized like a computer network into multiple layers: physical layer, link layer, protocol layer, etc. Most of the problems to which we applied formal methods are the formal verification (FV) of intricate protocols in the protocol and link layers. In addition, we also found several novel uses of binary decision diagrams (BDDs) [3] that are worth mentioning. 1.1. Directory-bas...
The cache coherence scheme for a scalable distributed shared memory multiprocessor should be effi... more The cache coherence scheme for a scalable distributed shared memory multiprocessor should be efficient in terms of memory overhead for maintaining the directories, as well as network latency for a memory request. In this paper, we propose a cache coherence scheme which minimizes the memory access delay and at the same time, reduces the directory overhead by using a limited directory scheme. In the proposed scheme, pointer overflow is handled an efficient invalidation mechanism using logically embedded rings. rings for transmitting control messages. A single ring architecture for small scale multiprocessor and a multiple ring hierarchical architecture for a scalable multiprocessor are evaluated. In both the architectures, wormhole routing, in conjunction with the usage of ring, introduces a snoopy behavior to the proposed scheme. We will show, with the help of execution driven simulation results, that for several applications our techniques outperform the full map directory scheme, a...
The cache coherence scheme for a scalable distributed shared memory multiprocessor should be effi... more The cache coherence scheme for a scalable distributed shared memory multiprocessor should be efficient in terms of memory overhead for maintaining the directories, as well as network latency for a memory request. In this paper, we propose a cache coherence scheme which minimizes the memory access delay and at the same time, reduces the directory overhead by using a limited directory scheme. In the proposed scheme, pointer overflow is handled by using a logically embedded ring for transmitting control messages. Wormhole routing, in conjunction with the usage of ring, introduces a snoopy behavior into the proposed scheme. We will show, with the help of several exection driven simulation results, that for real applications our technique outperforms the full map directory scheme, as well as the traditional implementations of limited directory schemes.
Due to ever increasing demand on computation power and having reached the physical speed up limit... more Due to ever increasing demand on computation power and having reached the physical speed up limitations of uniprocessor based computers, all the major computer vendors have started designing multiprocessors. Multiprocessors can be broadly classified into shared memory and message passing. Shared memory ones are more popular due to the ease of programming and the availability of parallelizing compilers. For small scale systems a bus has been used successfully for connecting the processors together. For the system to be scalable, other networks such as MIN, Mesh, Hypercube etc. have to be used. Among these hypercube has been used for many multiprocessor designs. When a point to point network is used for shared memory the memory access time is reduced by using caches near each processor. In such a situation directories are used for maintaining cache coherence. Several "limited directory" schemes have been proposed, which limit the directory storage overhead but result in incr...
A centralised synchronising device 20 for use in a data processing system having a plurality of d... more A centralised synchronising device 20 for use in a data processing system having a plurality of devices 50, 52, 54, 80, 82, 84 and an interconnect 10 interconnecting the devices. The synchronising device comprising: input/output port 25, buffer 32 storing pending system synchronising requests, arbitration circuitry 34 for selecting the next pending system synchronising request and forwarding it to synchronising request generator 37 and multicast circuitry 39. The synchronising request generator 37 and multicast circuitry 39 generate synchronising requests in response to the system synchronising request and output the requests as a multicast to at least some of the devices within the data processing system. The devices which are the target of the multicast may be specified in target lists 35. Gather circuitry 40 collects responses to the synchronising requests and being configured output a response to the system synchronising request via response generator 45 to when responses to all...
Midwest Symposium on Circuits and Systems, 1997
The most common analysis of power systems, load flow requires the solution of a set of thousands ... more The most common analysis of power systems, load flow requires the solution of a set of thousands of nonlinear algebraic equations. In this paper we develop a methodology to evaluate the efficiency of parallelization of load flow analysis techniques on different multiprocessor architectures. Parallelization is essential for obtaining analysis in real time
ABSTRACT The cache coherence scheme for a scalable distributed shared memory multiprocessor shoul... more ABSTRACT The cache coherence scheme for a scalable distributed shared memory multiprocessor should be efficient in terms of memory overhead for maintaining the directories, as well as network latency for a memory request. In this paper, we propose a cache coherence scheme which minimizes the memory access delay and at the same time, reduces the directory overhead by using a limited directory scheme. In the proposed scheme, pointer overflow is handled an efficient invalidation mechanism using logically embedded rings. rings for transmitting control messages. A single ring architecture for small scale multiprocessor and a multiple ring hierarchical architecture for a scalable multiprocessor are evaluated. In both the architectures, wormhole routing, in conjunction with the usage of ring, introduces a snoopy behavior to the proposed scheme. We will show, with the help of execution driven simulation results, that for several applications our techniques outperform the full map directory scheme, as well as the traditional implementations of limited directory schemes.