David Nagle - Profile on Academia.edu (original) (raw)
Papers by David Nagle
Operating Systems Design and Implementation, Oct 22, 2000
Sigplan Notices, Oct 1, 1998
This paper describes the Network-Attached Secure Disk (NASD) storage architecture, prototype impl... more This paper describes the Network-Attached Secure Disk (NASD) storage architecture, prototype implementations oj NASD drives, array management for our architecture, and three,filesystems built on our prototype. NASD provides scalable storage bandwidth without the cost of servers used primarily ,fijr trut&rring data from peripheral networks (e.g. SCSI) to client networks (e.g. ethernet). Increasing datuset sizes, new attachment technologies, the convergence of peripheral and interprocessor switched networks, and the increased availability of on-drive transistors motivate and enable this new architecture. NASD is based on four main principles: direct transfer to clients, secure interfaces via cryptographic support, asynchronous non-critical-path oversight, and variably-sized data objects. Measurements of our prototype system show that these services can be cost-#ectively integrated into a next generation disk drive ASK. End-to-end measurements of our prototype drive andfilesysterns suggest that NASD cun support conventional distributed filesystems without per$ormance degradation. More importantly, we show scaluble bandwidth for NASD-specialized filesystems. Using a parallel data mining application, NASD drives deliver u linear scaling of 6.2 MB/s per clientdrive pair, tested with up to eight pairs in our lab.
Fine-grained measurements" involve the monitoring of events that can change as frequently as once... more Fine-grained measurements" involve the monitoring of events that can change as frequently as once every machine cycle.
Operating Systems Design and Implementation, Oct 8, 2018
Design and Implementation is sponsored by USENIX.
Abstract: "The primary method for protecting networks today is to use a firewall: a boundary... more Abstract: "The primary method for protecting networks today is to use a firewall: a boundary separating the protected network from the untrusted Internet. However, these firewalls offer no protection from internal attacks, scale poorly due to limited firewall processing capacity, and do not support mobile computing. Distributing a firewall to each network host avoids many of these problems, but weakens the security guarantees of the network since it places the firewall under the control of the host OS. Leveraging the increasing capability of embedded-VLSI, including network-specific processors, we propose a Network Interface Card (NIC) based distributed firewall. Supporting the same (and more) functions as a centralized firewall, NIC-based firewalls provide significant benefits including: scalability, easier client customization, sharing application/OS state to enable application-level filtering, and the ability to block misbehaving hosts at the source, the host itself. We desc...
Proceedings of 21 International Symposium on Computer Architecture
The allocation of die area to different processor components is a central issue in the design of ... more The allocation of die area to different processor components is a central issue in the design of single-chip microprocessors. Chip area is occupied by both core execution logic, such as ALU and FPU datapaths, and memory structures, such as caches, TLBs, and write buffers. This work focuses on the allocation of die area to memory structures through a cost/benefit analysis. The cost of memory structures with different sizes and associativities is estimated by using an established area model for on-chip memory. The performance benefits of selecting a given structure are measured through a collection of methods including on-the-fly hardware monitoring, trace-driven simulation and kernel-based analysis. Special consideration is given to operating systems that support multiple application programming interfaces (APIs), a software trend that substantially affects on-chip memory allocation decisions. Results: Small adjustments in cache and TLB design parameters can significantly impact overall performance. Operating systems that support multiple APIs, such as Mach 3.0, increase the relative importance of on-chip instruction caches and TLBs when compared against single-API systems such as Ultrix.
Lecture Notes in Computer Science, 1999
Computer security is of growing importance in the increasingly networked computing environment.Th... more Computer security is of growing importance in the increasingly networked computing environment.This work examines the issue of high-performance network security, specifically integrity, by focusing on integrating security into network storage system. Emphasizing the cost-constrained environment of storage, we examine how current software-based cryptography cannot support storage's Gigabit/sec transfer rates. To solve this problem, we introduce a novel message authentication code, based on stored message digests. This allows storage to deliver high-performance, a factor of five improvement in our prototype's integrity protected bandwidth, without hardware acceleration for common read operations. For receivers, where precomputation cannot be done, we outline an inline message authentication code that minimizes buffering requirements.
Proceedings of the 36th annual ACM/IEEE Design Automation Conference, 1999
ACM Transactions on Computer Systems, 1994
An increasing number of architectures provide virtual memory support through software-managed TLB... more An increasing number of architectures provide virtual memory support through software-managed TLBs. However, software management can impose considerable penalties that are highly dependent on the operating system's structure and its use of virtual memory. This work explores software-managed TLB design tradeoffs and their interaction with a range of monolithic and microkernel operating systems. Through hardware monitoring and simulation, we explore TLB performance for benchmarks running on a MIPS R2000-based workstation running Ultrix, OSF/1, and three versions of Mach 3.0.
ACM SIGMETRICS Performance Evaluation Review, 1994
ACM SIGOPS Operating Systems Review, 1994
Tapeworm II is a software-based simulation tool that evaluates the cache and TLB performance of m... more Tapeworm II is a software-based simulation tool that evaluates the cache and TLB performance of multiple-task and operating system intensive workloads. Tapeworm resides in an OS kernel and causes a host machine's hardware to drive simulations with kernel traps instead of with address traces, as is conventionally done. This allows Tapeworm to quickly and accurately capture complete memory referencing behavior with a limited degradation in overall system performance. This paper compares trap-driven simulation, as implemented in Tapeworm, with the more common technique of trace-driven memory simulation with respect to speed, accuracy, portability and flexibility.
ACM Transactions on Computer Systems, 2013
Minimizing floating-point power dissipation via bit-width reduction
... We compare our variable bit-width multiplier with a baseline fixed-width 24x24 bit Wallace Tr... more ... We compare our variable bit-width multiplier with a baseline fixed-width 24x24 bit Wallace Treemultiplier. The layout of this Wallace Tree multiplier was generated by Epoch's cell gen-erator in the same 0.5u process as used in the design of the digit-serial multiplier. ...
Fine-grained measurements" involve the monitoring of events that can change as frequently as once... more Fine-grained measurements" involve the monitoring of events that can change as frequently as once every machine cycle.
This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth ... more This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth storage in High Performance Computing (HPC) environments. An HPC environment requires a storage system to scale to very large sizes and performance without sacrificing cost-effectiveness nor ease of sharing and managing data. Traditional storage solutions, including disk-per-node, Storage-Area Networks (SAN), and Network-Attached Storage (NAS) implementations, fail to find a balance between performance, ease of use, and cost as the storage system scales up. In contrast, building storage systems as specialized storage clusters using commodity-off-the-shelf (COTS) components promise excellent price-performance at scale provided that binding them into a single system image and linking them to HPC compute clusters can be done without introducing bottlenecks or management complexities. While a file interface (typified by NAS systems) at each storage cluster component is too high-level to provide scalable bandwidth and simple management across large numbers of components, and a block interface (typified by SAN systems) is too low-level to avoid synchronization bottlenecks in a shared storage cluster, an object interface (typified by the inode layer of traditional file system implementations) is at the intermediate level needed for independent, highly parallel operation at each storage cluster component under centralized, but infrequently applied, control. The Object Storage Device (OSD) interface achieves this independence by storing an unordered collection of named variable-length byte arrays, called objects, and embedding extendable attributes, fine-grain capability-based access control, and encapsulated data layout and allocation into each object. With this higher-level interface, object storage clusters are capable of highly parallel data transfers between storage and compute cluster node under the infrequently applied control of the out-of-band metadata managers. Object Storage Architectures support single-system-image file systems with the traditional sharing and management features of NAS systems and the resource consolidation and scalable performance of SAN systems.
ACM Transactions on Modeling and Computer Simulation, 1997
Trap-driven simulation is a new approach for analyzing the performance of memory-system component... more Trap-driven simulation is a new approach for analyzing the performance of memory-system components such as caches and translation-lookaside buffers (TLBs). Unlike the more traditional trace-driven approach to simulating memory systems, trap-driven simulation uses the hardware of a host machine to drive simulations with operating-system kernel traps instead of with address traces. As a workload runs, a trap-driven simulator dynamically modifies access to memory in such a way as to make memory traps correspond exactly to misses in a simulated cache structure. Because traps are handled inside the kernel of the host operating system, a trap-driven simulator can monitor all components of multitask workloads including the operating system itself. Compared to trace-driven simulators, a trap-driven simulator causes relatively little slowdown to the host system because traps occur only in the infrequent case of simulated cache misses. Unfortunately, because they require special forms of hardware support to cause memory-access traps, trap-driven simulators are difficult to port, and they are not as flexible as trace-driven simulators in the types of memory configurations that they can model. Several researchers have recently begun to use trap-driven techniques in their studies of memory-system design tradeoffs, but little is known about how the speed and accuracy of the technique varies with the type of simulations conducted, or about the nature of its drawbacks with respect to portability and flexibility. In this article, we use a prototype trap-driven simulator, named Tapeworm II, to explore these issues. We expose both the strengths and the weaknesses of trap-driven simulation with respect to speed, accuracy, completeness, portability, flexibility, ease-of-use, and memory overhead. Although the results are drawn from a specific implementation of trap-driven simulation, we believe that many of our results from Tapeworm hold true for trap-driven simulation in general.
Network-attached storage enables network-striped data transfers directly between client and stora... more Network-attached storage enables network-striped data transfers directly between client and storage to provide clients with scalable bandwidth on large transfers. Network-attached storage also decouples policy and enforcement of access control, avoiding unnecessary reverification of protection checks, reducing file manager work and increasing scalability. It eliminates the expense of a server computer devoted to copying data between peripheral network and client network. This architecture better matches storage technology's sustained data rates, now 80 Mb/s and growing at 40% per year. Finally, it enables self-managing storage to counter the increasing cost of data management. The availability of cost-effective network-attached storage depends on it becoming a storage commodity, which in turn depends on its utility to a broad segment of the storage market. Specifically, multiple distributed and parallel filesystems must benefit from network-attached storage's requirement for secure, direct access between client and storage, for reusable, asynchronous access protection checks, and for increased license to efficiently manage underlying storage media. In this paper, we describe a prototype network-attached secure disk interface and filesystems adapted to network-attached storage implementing Sun's NFS, Transarc's AFS, a network-striped NFS variant, and an informed prefetching NFS variant. Our experimental implementations demonstrate bandwidth and workload scaling and aggressive optimization of application access patterns. Our experience with applications and filesystems adapted to run on network-attached secure disks emphasizes the much greater cost of client network messaging relative to peripheral bus messaging, which offsets some of the expected scaling results.
Modeling and Scheduling of MEMS-Based Storage Devices
Performance evaluation review, Jun 1, 1997
By providing direct data transfer between storage and client, network-attached storage devices ha... more By providing direct data transfer between storage and client, network-attached storage devices have the potential to improve scalability for existing distributed file systems (by removing the server as a bottleneck) and bandwidth for new parallel and distributed file systems (through network striping and more efficient data paths). Together, these advantages influence a large enough fraction of the storage market to make commodity network-attached storage feasible. Realizing the technology's full potential requires careful consideration across a wide range of file system, networking and security issues. This paper contrasts two network-attached storage architectures-(1) Networked SCSI disks (NetSCSI) are networkattached storage devices with minimal changes from the familiar SCSI interface, while (2) Network-Attached Secure Disks (NASD) are drives that support independent client access to drive object services. To estimate the potential performance benefits of these architectures, we develop an analytic model and perform tracedriven replay experiments based on AFS and NFS traces. Our results suggest that NetSCSI can reduce file server load during a burst of NFS or AFS activity by about 30%. With the NASD architecture, server load (during burst activity) can be reduced by a factor of up to five for AFS and up to ten for NFS. This research was sponsored by DARPA/ITO through ARPA Order D306 under contract N00174-96-0002 and in part by an ONR graduate fellowship. The project team is indebted to generous contributions from the member companies of the Parallel Data
Operating Systems Design and Implementation, Oct 22, 2000
Sigplan Notices, Oct 1, 1998
This paper describes the Network-Attached Secure Disk (NASD) storage architecture, prototype impl... more This paper describes the Network-Attached Secure Disk (NASD) storage architecture, prototype implementations oj NASD drives, array management for our architecture, and three,filesystems built on our prototype. NASD provides scalable storage bandwidth without the cost of servers used primarily ,fijr trut&rring data from peripheral networks (e.g. SCSI) to client networks (e.g. ethernet). Increasing datuset sizes, new attachment technologies, the convergence of peripheral and interprocessor switched networks, and the increased availability of on-drive transistors motivate and enable this new architecture. NASD is based on four main principles: direct transfer to clients, secure interfaces via cryptographic support, asynchronous non-critical-path oversight, and variably-sized data objects. Measurements of our prototype system show that these services can be cost-#ectively integrated into a next generation disk drive ASK. End-to-end measurements of our prototype drive andfilesysterns suggest that NASD cun support conventional distributed filesystems without per$ormance degradation. More importantly, we show scaluble bandwidth for NASD-specialized filesystems. Using a parallel data mining application, NASD drives deliver u linear scaling of 6.2 MB/s per clientdrive pair, tested with up to eight pairs in our lab.
Fine-grained measurements" involve the monitoring of events that can change as frequently as once... more Fine-grained measurements" involve the monitoring of events that can change as frequently as once every machine cycle.
Operating Systems Design and Implementation, Oct 8, 2018
Design and Implementation is sponsored by USENIX.
Abstract: "The primary method for protecting networks today is to use a firewall: a boundary... more Abstract: "The primary method for protecting networks today is to use a firewall: a boundary separating the protected network from the untrusted Internet. However, these firewalls offer no protection from internal attacks, scale poorly due to limited firewall processing capacity, and do not support mobile computing. Distributing a firewall to each network host avoids many of these problems, but weakens the security guarantees of the network since it places the firewall under the control of the host OS. Leveraging the increasing capability of embedded-VLSI, including network-specific processors, we propose a Network Interface Card (NIC) based distributed firewall. Supporting the same (and more) functions as a centralized firewall, NIC-based firewalls provide significant benefits including: scalability, easier client customization, sharing application/OS state to enable application-level filtering, and the ability to block misbehaving hosts at the source, the host itself. We desc...
Proceedings of 21 International Symposium on Computer Architecture
The allocation of die area to different processor components is a central issue in the design of ... more The allocation of die area to different processor components is a central issue in the design of single-chip microprocessors. Chip area is occupied by both core execution logic, such as ALU and FPU datapaths, and memory structures, such as caches, TLBs, and write buffers. This work focuses on the allocation of die area to memory structures through a cost/benefit analysis. The cost of memory structures with different sizes and associativities is estimated by using an established area model for on-chip memory. The performance benefits of selecting a given structure are measured through a collection of methods including on-the-fly hardware monitoring, trace-driven simulation and kernel-based analysis. Special consideration is given to operating systems that support multiple application programming interfaces (APIs), a software trend that substantially affects on-chip memory allocation decisions. Results: Small adjustments in cache and TLB design parameters can significantly impact overall performance. Operating systems that support multiple APIs, such as Mach 3.0, increase the relative importance of on-chip instruction caches and TLBs when compared against single-API systems such as Ultrix.
Lecture Notes in Computer Science, 1999
Computer security is of growing importance in the increasingly networked computing environment.Th... more Computer security is of growing importance in the increasingly networked computing environment.This work examines the issue of high-performance network security, specifically integrity, by focusing on integrating security into network storage system. Emphasizing the cost-constrained environment of storage, we examine how current software-based cryptography cannot support storage's Gigabit/sec transfer rates. To solve this problem, we introduce a novel message authentication code, based on stored message digests. This allows storage to deliver high-performance, a factor of five improvement in our prototype's integrity protected bandwidth, without hardware acceleration for common read operations. For receivers, where precomputation cannot be done, we outline an inline message authentication code that minimizes buffering requirements.
Proceedings of the 36th annual ACM/IEEE Design Automation Conference, 1999
ACM Transactions on Computer Systems, 1994
An increasing number of architectures provide virtual memory support through software-managed TLB... more An increasing number of architectures provide virtual memory support through software-managed TLBs. However, software management can impose considerable penalties that are highly dependent on the operating system's structure and its use of virtual memory. This work explores software-managed TLB design tradeoffs and their interaction with a range of monolithic and microkernel operating systems. Through hardware monitoring and simulation, we explore TLB performance for benchmarks running on a MIPS R2000-based workstation running Ultrix, OSF/1, and three versions of Mach 3.0.
ACM SIGMETRICS Performance Evaluation Review, 1994
ACM SIGOPS Operating Systems Review, 1994
Tapeworm II is a software-based simulation tool that evaluates the cache and TLB performance of m... more Tapeworm II is a software-based simulation tool that evaluates the cache and TLB performance of multiple-task and operating system intensive workloads. Tapeworm resides in an OS kernel and causes a host machine's hardware to drive simulations with kernel traps instead of with address traces, as is conventionally done. This allows Tapeworm to quickly and accurately capture complete memory referencing behavior with a limited degradation in overall system performance. This paper compares trap-driven simulation, as implemented in Tapeworm, with the more common technique of trace-driven memory simulation with respect to speed, accuracy, portability and flexibility.
ACM Transactions on Computer Systems, 2013
Minimizing floating-point power dissipation via bit-width reduction
... We compare our variable bit-width multiplier with a baseline fixed-width 24x24 bit Wallace Tr... more ... We compare our variable bit-width multiplier with a baseline fixed-width 24x24 bit Wallace Treemultiplier. The layout of this Wallace Tree multiplier was generated by Epoch's cell gen-erator in the same 0.5u process as used in the design of the digit-serial multiplier. ...
Fine-grained measurements" involve the monitoring of events that can change as frequently as once... more Fine-grained measurements" involve the monitoring of events that can change as frequently as once every machine cycle.
This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth ... more This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth storage in High Performance Computing (HPC) environments. An HPC environment requires a storage system to scale to very large sizes and performance without sacrificing cost-effectiveness nor ease of sharing and managing data. Traditional storage solutions, including disk-per-node, Storage-Area Networks (SAN), and Network-Attached Storage (NAS) implementations, fail to find a balance between performance, ease of use, and cost as the storage system scales up. In contrast, building storage systems as specialized storage clusters using commodity-off-the-shelf (COTS) components promise excellent price-performance at scale provided that binding them into a single system image and linking them to HPC compute clusters can be done without introducing bottlenecks or management complexities. While a file interface (typified by NAS systems) at each storage cluster component is too high-level to provide scalable bandwidth and simple management across large numbers of components, and a block interface (typified by SAN systems) is too low-level to avoid synchronization bottlenecks in a shared storage cluster, an object interface (typified by the inode layer of traditional file system implementations) is at the intermediate level needed for independent, highly parallel operation at each storage cluster component under centralized, but infrequently applied, control. The Object Storage Device (OSD) interface achieves this independence by storing an unordered collection of named variable-length byte arrays, called objects, and embedding extendable attributes, fine-grain capability-based access control, and encapsulated data layout and allocation into each object. With this higher-level interface, object storage clusters are capable of highly parallel data transfers between storage and compute cluster node under the infrequently applied control of the out-of-band metadata managers. Object Storage Architectures support single-system-image file systems with the traditional sharing and management features of NAS systems and the resource consolidation and scalable performance of SAN systems.
ACM Transactions on Modeling and Computer Simulation, 1997
Trap-driven simulation is a new approach for analyzing the performance of memory-system component... more Trap-driven simulation is a new approach for analyzing the performance of memory-system components such as caches and translation-lookaside buffers (TLBs). Unlike the more traditional trace-driven approach to simulating memory systems, trap-driven simulation uses the hardware of a host machine to drive simulations with operating-system kernel traps instead of with address traces. As a workload runs, a trap-driven simulator dynamically modifies access to memory in such a way as to make memory traps correspond exactly to misses in a simulated cache structure. Because traps are handled inside the kernel of the host operating system, a trap-driven simulator can monitor all components of multitask workloads including the operating system itself. Compared to trace-driven simulators, a trap-driven simulator causes relatively little slowdown to the host system because traps occur only in the infrequent case of simulated cache misses. Unfortunately, because they require special forms of hardware support to cause memory-access traps, trap-driven simulators are difficult to port, and they are not as flexible as trace-driven simulators in the types of memory configurations that they can model. Several researchers have recently begun to use trap-driven techniques in their studies of memory-system design tradeoffs, but little is known about how the speed and accuracy of the technique varies with the type of simulations conducted, or about the nature of its drawbacks with respect to portability and flexibility. In this article, we use a prototype trap-driven simulator, named Tapeworm II, to explore these issues. We expose both the strengths and the weaknesses of trap-driven simulation with respect to speed, accuracy, completeness, portability, flexibility, ease-of-use, and memory overhead. Although the results are drawn from a specific implementation of trap-driven simulation, we believe that many of our results from Tapeworm hold true for trap-driven simulation in general.
Network-attached storage enables network-striped data transfers directly between client and stora... more Network-attached storage enables network-striped data transfers directly between client and storage to provide clients with scalable bandwidth on large transfers. Network-attached storage also decouples policy and enforcement of access control, avoiding unnecessary reverification of protection checks, reducing file manager work and increasing scalability. It eliminates the expense of a server computer devoted to copying data between peripheral network and client network. This architecture better matches storage technology's sustained data rates, now 80 Mb/s and growing at 40% per year. Finally, it enables self-managing storage to counter the increasing cost of data management. The availability of cost-effective network-attached storage depends on it becoming a storage commodity, which in turn depends on its utility to a broad segment of the storage market. Specifically, multiple distributed and parallel filesystems must benefit from network-attached storage's requirement for secure, direct access between client and storage, for reusable, asynchronous access protection checks, and for increased license to efficiently manage underlying storage media. In this paper, we describe a prototype network-attached secure disk interface and filesystems adapted to network-attached storage implementing Sun's NFS, Transarc's AFS, a network-striped NFS variant, and an informed prefetching NFS variant. Our experimental implementations demonstrate bandwidth and workload scaling and aggressive optimization of application access patterns. Our experience with applications and filesystems adapted to run on network-attached secure disks emphasizes the much greater cost of client network messaging relative to peripheral bus messaging, which offsets some of the expected scaling results.
Modeling and Scheduling of MEMS-Based Storage Devices
Performance evaluation review, Jun 1, 1997
By providing direct data transfer between storage and client, network-attached storage devices ha... more By providing direct data transfer between storage and client, network-attached storage devices have the potential to improve scalability for existing distributed file systems (by removing the server as a bottleneck) and bandwidth for new parallel and distributed file systems (through network striping and more efficient data paths). Together, these advantages influence a large enough fraction of the storage market to make commodity network-attached storage feasible. Realizing the technology's full potential requires careful consideration across a wide range of file system, networking and security issues. This paper contrasts two network-attached storage architectures-(1) Networked SCSI disks (NetSCSI) are networkattached storage devices with minimal changes from the familiar SCSI interface, while (2) Network-Attached Secure Disks (NASD) are drives that support independent client access to drive object services. To estimate the potential performance benefits of these architectures, we develop an analytic model and perform tracedriven replay experiments based on AFS and NFS traces. Our results suggest that NetSCSI can reduce file server load during a burst of NFS or AFS activity by about 30%. With the NASD architecture, server load (during burst activity) can be reduced by a factor of up to five for AFS and up to ten for NFS. This research was sponsored by DARPA/ITO through ARPA Order D306 under contract N00174-96-0002 and in part by an ONR graduate fellowship. The project team is indebted to generous contributions from the member companies of the Parallel Data