Jon Weissman | University of Minnesota - Twin Cities (original) (raw)

Papers by Jon Weissman

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 2018

In the era of rapid experimental expansion data analysis needs are rapidly outpacing the capabili... more In the era of rapid experimental expansion data analysis needs are rapidly outpacing the capabilities of small institutional clusters and looking to integrate HPC resources into their workflow. We propose one way of reconciling ondemand needs of experimental analytics with the batch managed HPC resources within a system that dynamically moves nodes between an on-demand cluster configured with cloud technology (OpenStack) and a traditional HPC cluster managed by a batch scheduler (Torque). We evaluate this system experimentally both in the context of real-life traces representing two years of a specific institutional need, and via experiments in the context of synthetic traces that capture generalized characteristics of potential batch and on-demand workloads. Our results for the real-life scenario show that our approach could reduce the current investment in on-demand infrastructure by 82% while at the same time improving the mean batch wait time almost by an order of magnitude (8x). Index Terms-Computers and information processing, Distributed computing, Metacomputing, Grid computing.

ArXiv, 2012

This report has two objectives. First, we describe a set of the production distributed infrastruc... more This report has two objectives. First, we describe a set of the production distributed infrastructures currently available, so that the reader has a basic understanding of them. This includes explaining why each infrastructure was created and made available and how it has succeeded and failed. The set is not complete, but we believe it is representative. Second, we describe the infrastructures in terms of their use, which is a combination of how they were designed to be used and how users have found ways to use them. Applications are often designed and created with specific infrastructures in mind, with both an appreciation of the existing capabilities provided by those infrastructures and an anticipation of their future capabilities. Here, the infrastructures we discuss were often designed and created with specific applications in mind, or at least specific types of applications. The reader should understand how the interplay between the infrastructure providers and the users leads...

My thanks go to the many Mentat team members past and present that I have had the opportunity to ... more My thanks go to the many Mentat team members past and present that I have had the opportunity to work with over the years. All of you have helped build a system infrastructure from which some wonderful research has blossomed. This dissertation would not have been possible without these efforts. My examining committee, Bill Wulf, Andrew Grimshaw, James Ortega, Paul Reynolds, and James Aylor provided a careful reading of the dissertation and made many helpful suggestions. Special thanks go to my advisor Andrew Grimshaw who taught me that good research is based on commitment and hard work, but that great research is built on faith. His vision of a wide-area virtual computer has been an inspiration in my work. I am truly honored to be his first Ph.D. student. Robert Ferraro and the NASA-Jet Propulsion Laboratory supported me through a GSRP research fellowship. The fellowship provided a unique opportunity to collaborate and meet with NASA scientists. This collaboration and interaction improved the quality of this dissertation greatly. Finally, the support of my friends and family including my wife Susan, my brother Steve, and my parents, kept me feeling positive and helped me weather the tough times. My wife Susan was a constant source of motivation and understanding and it is with much love that I dedicate this dissertation to her.

CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005., 2005

This paper presents an architecture and implementation for a dynamic OGSA-based Grid service arch... more This paper presents an architecture and implementation for a dynamic OGSA-based Grid service architecture that extends GT3 to support dynamic service hosting-where to host and re-host a service within the Grid in response to service demand and resource fluctuation. Our model goes beyond current OGSI implementations in which the service is presumed to be "pre-installed" at all sites (and only service instantiation is dynamic). In dynamic virtual organizations (VOs), we believe dynamic service hosting provides an important flexibility. Our model also defines several new adaptive Grid service classes that support adaptation at multiple levels. Dynamic service deployment allows new services to be added or replaced without "taking down" a site for reconfiguration and allows a VO to respond effectively to dynamic resource availability and demand. The preliminary results suggest that the cost of dynamic installation, deployment, and invocation, is tolerable. 1

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016

One of the factors that limits the scale, performance, and sophistication of distributed applicat... more One of the factors that limits the scale, performance, and sophistication of distributed applications is the difficulty of concurrently executing them on multiple distributed computing resources. In part, this is due to a poor understanding of the general properties and performance of the coupling between applications and dynamic resources. This paper addresses this issue by integrating abstractions representing distributed applications, resources, and execution processes into a pilot-based middleware. The middleware provides a platform that can specify distributed applications, execute them on multiple resource and for different configurations, and is instrumented to support investigative analysis. We analyzed the execution of distributed applications using experiments that measure the benefits of using multiple resources, the late-binding of scheduling decisions, and the use of backfill scheduling.

2016 IEEE International Conference on Cloud Engineering (IC2E), 2016

Today, many organizations need to operate on data that is distributed around the globe. This is i... more Today, many organizations need to operate on data that is distributed around the globe. This is inevitable due to the nature of data that is generated in different locations such as video feeds from distributed cameras, log files from distributed servers, and many others. Although centralized cloud platforms have been widely used for data-intensive applications, such systems are not suitable for processing geo-distributed data due to high data transfer overheads. An alternative approach is to use an Edge Cloud which reduces the network cost of transferring data by distributing its computations globally. While the Edge Cloud is attractive for geo-distributed data-intensive applications, extending existing cluster computing frameworks to a wide-area environment must account for locality. We propose Awan : a new locality-aware resource manager for geo-distributed dataintensive applications. Awan allows resource sharing between multiple computing frameworks while enabling high locality scheduling within each framework. Our experiments with the Nebula Edge Cloud on PlanetLab show that Awan achieves up to a 28% increase in locality scheduling which reduces the average job turnaround time by approximately 18% compared to existing cluster management mechanisms.

Proceedings of the 2nd International Workshop on Software-Defined Ecosystems - BigSystem '15, 2015

Many Cloud applications exploit the diversity of storage options in a data center to achieve desi... more Many Cloud applications exploit the diversity of storage options in a data center to achieve desired cost, performance, and durability tradeoffs. It is common to see applications using a combination of memory, local disk, and archival storage tiers within a single data center to meet their needs. For example, hot data can be kept in memory using ElastiCache, and colder data in cheaper, slower storage such as S3, using Amazon as an example. For user-facing applications, a recent trend is to exploit multiple data centers for data placement to enable better latency of access from users to their data. The conventional wisdom is that co-location of computation and storage within the same data center is a key to application performance, so that applications running within a data center are often still limited to access local data. In this paper, using experiments on Amazon, Microsoft, and Google clouds, we show that this assumption is false, and that accessing data in nearby data centers may be faster than local access at different or even same points in the storage hierarchy. This can lead to not only better performance, but also reduced cost, simpler consistency policies and reconsidering data locality in multiple DCs environment. This argues for an expansion of cloud storage tiers to consider non-local storage options, and has interesting implications for the design of a distributed storage system.

Lecture Notes in Computer Science, 1997

Despite nearly 20 years of progress toward ubiquitous computer connectivity, distributed computin... more Despite nearly 20 years of progress toward ubiquitous computer connectivity, distributed computing systems have only recently emerged to play a serious role in industry and society. Perhaps this explains why so few distributed systems are reliable in the sense of tolerating failures automatically, guaranteeing properties such as performance or response time, or offering security against intentional threats. In many ways the engineering discipline of reliable distributed computing is still in its infancy.

Proceedings of the second international workshop on MapReduce and its applications - MapReduce '11, 2011

MapReduce is a highly-popular paradigm for high-performance computing over large data sets in lar... more MapReduce is a highly-popular paradigm for high-performance computing over large data sets in large-scale platforms. However, when the source data is widely distributed and the computing platform is also distributed, e.g. data is collected in separate data center locations, the most efficient architecture for running Hadoop jobs over the entire data set becomes non-trivial. In this paper, we show the traditional single-cluster MapReduce setup may not be suitable for situations when data and compute resources are widely distributed. Further, we provide recommendations for alternative (and even hierarchical) distributed MapReduce setup configurations, depending on the workload and data set.

2014 2nd IEEE International Conference on Mobile Cloud Computing, Services, and Engineering, 2014

In this paper, we present our vision for datadriven cloud-based mobile computing. We identify the... more In this paper, we present our vision for datadriven cloud-based mobile computing. We identify the concept of Region of interest (RoI) that reflects the profile of the user in how they access information or interact with applications. Such information enables a series of data-driven optimizations: filtering, aggregation, and speculation, that go beyond the wellresearched benefit of mobile outsourcing. These optimizations can improve performance, reliability, and energy usage. A novel aspect of our approach is to exploit the unique ability of the cloud to collect and analyze large amounts of user profile data, cache shared data, and even enable sharing of computations, across different mobile users. We implement two exemplar mobile-cloud applications on an Android/Amazon Elastic Cloud Compute (EC2)-based mobile outsourcing platform, that utilize the RoI abstraction for data-driven optimizations. We present results driven by workload traces derived from Twitter feeds and Wikipedia document editing to illustrate the opportunities of using such optimizations.

Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07), 2007

Scientific computing is being increasingly deployed over volunteer-based distributed computing en... more Scientific computing is being increasingly deployed over volunteer-based distributed computing environments consisting of idle resources on donated user machines. A fundamental challenge in these environments is the dissemination of data to the computation nodes, with the successful completion of jobs being driven by the efficiency of collective data download across compute nodes, and not only the individual download times. This paper considers the use of a data network consisting of data distributed across a set of data servers, and focuses on the server selection problem: how do individual nodes select a server for downloading data to minimize the communication makespan-the maximal download time for a data file. Through experiments conducted on a Pastry network running on PlanetLab, we demonstrate that nodes in a volunteer-based network are heterogeneous in terms of several metrics, such as bandwidth, load, and capacity, which impact their download behavior. We propose new server selection heuristics that incorporate these metrics, and demonstrate that these heuristics outperform traditional proximity-based server selection, reducing average makespans by at least 30%. We further show that incorporating information about download concurrency avoids overloading servers, and improves performance by about 17-43% over heuristics considering only proximity and bandwidth.

IEEE Transactions on Cloud Computing, 2016

MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but ... more MapReduce has proven remarkably effective for a wide variety of data-intensive applications, but it was designed to run on large single-site homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed, including skewed workloads, iterative applications, and heterogeneous computing environments. This paper continues this exploration by applying MapReduce across geo-distributed data over geo-distributed computation resources. Using Hadoop, we show that network and node heterogeneity and the lack of data locality lead to poor performance, because the interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. To address these problems, we take a two-pronged approach: We first develop a model-driven optimization that serves as an oracle, providing high-level insights. We then apply these insights to design cross-phase optimization techniques that we implement and demonstrate in a real-world MapReduce implementation. Experimental results in both Amazon EC2 and PlanetLab show the potential of these techniques as performance is improved by 7%-18% depending on the execution environment and application.

Proceedings of the 15th International Middleware Conference on - Middleware '14, 2014

A system that is aware of the different storage's • Performance characteris7cs • Interfaces • Dur... more A system that is aware of the different storage's • Performance characteris7cs • Interfaces • Durability characteris7cs • Cost model

2012 IEEE Fifth International Conference on Cloud Computing, 2012

Mobile devices, such as smart phones and tablets, are becoming the universal interface to online ... more Mobile devices, such as smart phones and tablets, are becoming the universal interface to online services and applications. However, such devices have limited computational power and battery life, which limits their ability to execute resource-intensive applications. Computation outsourcing to external resources has been proposed as a technique to alleviate this problem. Most existing work on mobile outsourcing has focused on either single application optimization or outsourcing to fixed, local resources, with the assumption that wide-area latency is prohibitively high. However, the opportunity of improving the outsourcing performance by utilizing the relation among multiple applications and optimizing the server provisioning is neglected. In this paper, we present the design and implementation of an Android/Amazon EC2-based mobile application outsourcing framework, leveraging the cloud for scalability, elasticity, and multiuser code/data sharing. Using this framework, we empirically demonstrate that the cloud is not only feasible but desirable as an offloading platform for latency-tolerant applications. We have proposed to use data mining techniques to detect data sharing across multiple applications, and developed novel scheduling algorithms that exploit such data sharing for better outsourcing performance. Additionally, our platform is designed to dynamically scale to support a large number of mobile users concurrently. Experiments show that our proposed techniques and algorithms substantially improve application performance, while achieving high efficiency in terms of computation resource and network usage.

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, 2013

Distributed data-intensive workflow applications are increasingly relying on and integrating remo... more Distributed data-intensive workflow applications are increasingly relying on and integrating remote resources including community data sources, services, and computational platforms. Increasingly, these are made available as data, SAAS, and IAAS clouds. The execution of distributed data-intensive workflow applications can expose network bottlenecks between clouds that compromise performance. In this paper, we focus on alleviating network bottlenecks by using a proxy network. In particular, we show how proxies can eliminate network bottlenecks by smart routing and perform in-network computations to boost workflow application performance. A novel aspect of our work is the inclusion of multiple proxies to accelerate different workflow stages optimizing different performance metrics. We show that the approach is effective for workflow applications and broadly applicable. Using Montage 1 as an exemplar workflow application, results obtained through experiments on PlanetLab showed how different proxies acting in a variety of roles can accelerate distinct stages of Montage. Our microbenchmarks also show that routing data through select proxies can accelerate network transfer for TCP/UDP bandwidth, delay, and jitter, in general.

2014 International Conference on Collaboration Technologies and Systems (CTS), 2014

Centralized cloud infrastructures have become the de-facto platform for data-intensive computing ... more Centralized cloud infrastructures have become the de-facto platform for data-intensive computing today. However, they suffer from inefficient data mobility due to the centralization of cloud resources, and hence, are highly unsuited for disperseddata-intensive applications, where the data may be spread at multiple geographical locations. In this paper, we present Nebula: a dispersed cloud infrastructure that uses voluntary edge resources for both computation and data storage. We describe the lightweight Nebula architecture that enables distributed dataintensive computing through a number of optimizations including location-aware data and computation placement, replication, and recovery. We evaluate Nebula's performance on an emulated volunteer platform that spans over 50 PlanetLab nodes distributed across Europe, and show how a common data-intensive computing framework, MapReduce, can be easily deployed and run on Nebula. We show Nebula MapReduce is robust to a wide array of failures and substantially outperforms other wide-area versions based on a BOINC like model.

2008 The 28th International Conference on Distributed Computing Systems, 2008

Large-scale distributed systems provide an attractive scalable infrastructure for network applica... more Large-scale distributed systems provide an attractive scalable infrastructure for network applications. However, the loosely-coupled nature of this environment can make data access unpredictable, and in the limit, unavailable. We introduce the notion of accessibility to capture both availability and performance. An increasing number of dataintensive applications require not only considerations of node computation power but also accessibility for adequate job allocations. For instance, selecting a node with intolerably slow connections can offset any benefit to running on a fast node. In this paper, we present accessibility-aware resource selection techniques by which it is possible to choose nodes that will have efficient data access to remote data sources. We show that the local data access observations collected from a node's neighbors are sufficient to characterize accessibility for that node. We then present resource selection heuristics guided by this principle, and show that they significantly outperform standard techniques. The suggested techniques are also shown to be stable even under churn despite the loss of prior observations.

Proceedings of the fourth international workshop on Data-intensive distributed computing - DIDC '11, 2011

Current cloud infrastructures are important for their ease of use and performance. However, they ... more Current cloud infrastructures are important for their ease of use and performance. However, they suffer from several shortcomings. The main problem is inefficient data mobility due to the centralization of cloud resources. We believe such clouds are highly unsuited for dispersed-data-intensive applications, where the data may be spread at multiple geographical locations (e.g., distributed user blogs). Instead, we propose a new cloud model called Nebula: a dispersed, context-aware, and cost-effective cloud. We provide experimental evidence for the need for Nebulas using a distributed blog analysis application followed by the system architecture and components of our system.

Cloud Computing for Data-Intensive Applications, 2014

MapReduce has been designed to accommodate large-scale data-intensive workloads running on large ... more MapReduce has been designed to accommodate large-scale data-intensive workloads running on large singlesite homogeneous clusters. Researchers have begun to explore the extent to which the original MapReduce assumptions can be relaxed including skewed workloads, iterative applications, and heterogeneous computing environments. Our work continues this exploration by applying MapReduce across widely distributed data over distributed computation resources. This problem arises when datasets are generated at multiple sites as is common in many scientific domains and increasingly ecommerce applications. It also occurs when multi-site resources such as geographically separated data centers are applied to the same MapReduce job. Using Hadoop, we show that the absence of network and node homogeneity and locality of data lead to poor performance. The problem is that interaction of MapReduce phases becomes pronounced in the presence of heterogeneous network behavior. In this paper, we propose new cross-phase optimization techniques that enable independent MapReduce phases to influence one another. We propose techniques that optimize the push and map phases to enable push-map overlap and to allow map behavior to feed back into push dynamics. Similarly, we propose techniques that optimize the map and reduce phases to enable shuffle cost to feed back and affect map scheduling decisions. We evaluate the benefits of our techniques in both Amazon EC2 and PlanetLab. The experimental results show the potential of these techniques as performance is improved from 7%-18% depending on the execution environment and application.

Journal of Parallel and Distributed Computing, 1994

A metasystem is a single computing resource composed of a heterogeneous group of autonomous compu... more A metasystem is a single computing resource composed of a heterogeneous group of autonomous computers linked together by a network. The interconnection network needed to construct large metasystems will soon be in place. To fully exploit these new systems, software that is easy to use, supports large degrees of parallelism, and hides the complexity of the underlying physical architecture must be developed. In this paper we describe our metasystem vision, our approach to constructing a metasystem testbed, and early experimental results. Our approach combines features from earlier work on both parallel processing systems and heterogeneous distributed computing systems. Using the testbed we have found that data coercion costs are not a serious obstacle to high performance, but that load imbalance induced by differing processor capabilities can limit performance. We then present a mechanism to overcome load imbalance that utilizes user-provided callbacks. 1