Kesheng Wu | Lawrence Berkeley National Laboratory (original) (raw)
Papers by Kesheng Wu
Sensors, Jun 10, 2023
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
This document was prepared as an account of work sponsored by t h e United States Government. Whi... more This document was prepared as an account of work sponsored by t h e United States Government. While this document is believed to contain correct information, neither the United States Government nor any agency thereof, nor The Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or The Regents of the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of t h e United States Government or any agency thereof, or The Regents of t h e University of California. This report has been reproduced directly from the best available copy.
Sensors and Actuators B-chemical, 2023
2021 IEEE International Conference on Big Data (Big Data), Dec 15, 2021
Logistic regression has long been the gold standard for choice modeling in the transportation fie... more Logistic regression has long been the gold standard for choice modeling in the transportation field. Despite the rising popularity of machine learning (ML), few is applied to predicting the household vehicle transactions. To address the research gap, this paper presents a first use case of ML application to predicting household vehicle transaction decisions by leveraging a newly processed national panel data set. Model performances are reported for four ML models and the traditional multinomial logit model (MNL). Instead of treating the gold standard and ML models as competitors, this paper tries to use ML tools to inform the MNL model building process. We find the two gradient boosting based methods, CatBoost and LightGBM, are the best performing ML models; and improving logistic models with SHAP interpretation tools can achieve similar performance levels to the best performing ML methods.
In this work, we study the use of decision tree-based models to predict the transfer rates in dif... more In this work, we study the use of decision tree-based models to predict the transfer rates in different parts of the data pipeline that sends experiment data from Linac Coherent Light Source (LCLS) at SLAC National Accelerator Laboratory (SLAC) to National Energy Research Scientific Computing Center (NERSC). The system monitoring the data pipeline collects a number of characteristics such as the file size, source file system, start time and so on, all of which are known at the start of the file transfer. However, these static variables do not capture the dynamic information such as current state of the networking system. In this work, we explore a number of different ways to capture the state of the network and other dynamic information. We find that in addition to using static features, using these dynamic features can improve the transfer performance predictions by up to 10-15%. We additionally study a couple of different well-known decision-tree based models and find that Gradient-Tree Boosting algorithm performs better overall.
Toxins, Jun 5, 2019
In this paper, a highly sensitive plasmonic enzyme-linked immunosorbent assay (pELISA) was develo... more In this paper, a highly sensitive plasmonic enzyme-linked immunosorbent assay (pELISA) was developed for the naked-eye detection of fumonisin B 1 (FB 1). Glucose oxidase (GOx) was used as an alternative to horseradish peroxidase as the carrier of the competing antigen. GOx catalyzed the oxidation of glucose to produce hydrogen peroxide, which acted as a reducing agent to reduce Au 3+ to Au on the surface of gold seeds (5 nm), This reaction led to a color change in the solution from colorless to purple, which was observable to the naked eye. Various parameters that could influence the detection performance of pELISA were investigated. The developed method exhibited a considerably high sensitivity for FB 1 qualitative naked-eye detection, with a visible cutoff limit of 1.25 ng/mL. Moreover, the proposed pELISA showed a good linear range of 0.31-10 ng/mL with a half maximal inhibitory concentration (IC 50) of 1.86 ng/mL, which was approximately 13-fold lower than that of a horseradish peroxidase-(HRP)-based conventional ELISA. Meanwhile, the proposed method was highly specific and accurate. In summary, the new pELISA exhibited acceptable accuracy and precision for sensitive naked-eye detection of FB 1 in maize samples and can be applied for the detection of other chemical contaminants.
Large scientific projects are increasing relying on analyses of data for their new discoveries; a... more Large scientific projects are increasing relying on analyses of data for their new discoveries; and a number of different data management systems have been developed to serve this scientific projects. In the work-in-progress paper, we describe an effort on understanding the data access patterns of one of these data management systems, dCache. This particular deployment of dCache acts as a disk cache in front of a large tape storage system primarily containing high-energy physics data. Based on the 15-month dCache logs, the cache is only accessing the tape system once for over 50 file requests, which indicates that it is effective as a disk cache. The on-disk files are repeated used, more than three times a day. We have also identified a number of unusual access patterns that are worth further investigation. CCS CONCEPTS • Information systems → Information storage technologies; • Computing methodologies → Model development and analysis.
Foods, Jun 13, 2022
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
Sensors and Actuators B-chemical, Jun 1, 2018
This study reports a novel enzyme-induced metallization plasmonic enzyme-linked immunosorbent ass... more This study reports a novel enzyme-induced metallization plasmonic enzyme-linked immunosorbent assay (pELISA) for ultrasensitive detection of ochratoxin A (OTA) in rice, corn, wheat, and white wine samples. OTA-labeled urease was used as competing antigen to hydrolyze urea into ammonia. Silver ions were reduced by the formyl group from glucose to generate a silver shell on the surface of gold nanoflowers (33 nm, AuNFs) in the presence of ammonia molecules. The color of the solution changed from blue to brownish red. Various parameters that influenced the sensitivity of colorimetric pELISA were investigated and optimized. Under the optimized conditions, the colorimetric pELISA exhibited a high sensitivity for qualitative detection of OTA by the naked eye with a cutoff limit of 40 pg/mL, and a favorable linear range of 5.0-640 pg/mL for quantitative detection of OTA with a limit of detection at 8.205 pg/mL. These values are 15.6-and 14.3-folds lower than those of horseradish peroxidase (HRP)-based ELISA. The method also showed excellent specificity against four other mycotoxins including deoxynivalenol, zearalenone, fumonisin B 1 , and aflatoxin B 1. Moreover, the recoveries for OTA-spiked rice, corn, wheat, and white wine samples ranged from 81.5% to 106%, with coefficient of variation ranging from 6.13% to 18.7%. These results showed a good agreement with those obtained by an ultra-performance liquid chromatographyfluorescence detector (UPLC-FLD) method. Hence, the proposed method exhibited excellent robustness and reliability for quantitative detection of OTA in different food samples. This work provides a simple, sensitive, robust, and high-throughput screening method for qualitative or quantitative detection of mycotoxins or other pollutants in food safety monitoring.
Journal of Big Data, May 17, 2023
Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs ... more Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units used by hundreds to thousands of users simultaneously. Applications from large numbers of users have diverse characteristics, such as varying computation, communication, memory, and I/O intensity. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, I/O performance is becoming increasingly important as data sizes rapidly increase and large-scale applications, such as simulation and model training, are widely adopted. However, predicting I/O performance is difficult because I/O systems are shared among all users and involve many layers of software and hardware stack, including the application, network interconnect, operating system, file system, and storage devices. Furthermore, updates to these layers and changes in system management policy can significantly alter the I/O behavior of applications and the entire system. To improve the prediction of the I/O performance on HPC systems, we propose integrating information from several different system logs and developing a regression-based approach to predict the I/O performance. Our proposed scheme can dynamically select the most relevant features from the log entries using various feature selection algorithms and scoring functions, and can automatically select the regression algorithm with the best accuracy for the prediction task. The evaluation results show that our proposed scheme can predict the write performance with up to 90% prediction accuracy and the read performance with up to 99% prediction accuracy using the real logs from the Cori supercomputer system at NERSC.
2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), May 1, 2022
Neural networks are powerful solutions to many scientific applications; however, they usually suf... more Neural networks are powerful solutions to many scientific applications; however, they usually suffer from long model training times due to the typical data size and model size being large. Research has been focused on developing numerical optimization algorithms and parallel processing to reduce the training time. In this work, we propose a multi-resolution strategy that can reduce the training time by training the model with the reduced-resolution data samples at the beginning and later switching to the original resolution data samples. This strategy is motivated by the fact that many scientific applications run faster when using a coarse version of the problem, for example, data whose resolution is reduced statistically. When applying the idea to neural network training, coarse data can have a similar effect on the learning curves at the early stage as the dense data but requires less time. Once the curves no longer improve significantly, our strategy switches to using the data in original resolution. We use two real-world scientific applications, CosmoFlow and DeepCAM, to evaluate the proposed mixedresolution training strategy. Our experiment results demonstrate that the proposed training strategy effectively reduces the end-toend training time while achieving a comparable accuracy to that of the training only with the original data. While maintaining the same model accuracy, our multi-resolution training strategy reduces the end-to-end training time up to 30% and 23% for CosmoFlow and DeepCAM, respectively.
The demands of increasingly large scientific application workflows lead to the need for more powe... more The demands of increasingly large scientific application workflows lead to the need for more powerful supercomputers. As the scale of supercomputing systems have grown, the prediction of fault tolerance has become an increasingly critical area of study, since the prediction of system failures can improve performance by saving checkpoints in advance. We propose a real-time failure detection algorithm that adopts an event-based prediction model. The prediction model is a convolutional neural network that utilizes both traditional event attributes and additional spatio-temporal features. We present a case study using our proposed method with six years of reliability, availability, and serviceability event logs recorded by Mira, a Blue Gene/Q supercomputer at Argonne National Laboratory. In the case study, we have shown that our failure prediction model is not limited to predict the occurrence of failures in general. It is capable of accurately detecting specific types of critical failures such as coolant and power problems within reasonable lead time ranges. Our case study shows that the proposed method can achieve a F1 score of 0.56 for general failures, 0.97 for coolant failures, and 0.86 for power failures.
The journal of financial data science, Oct 31, 2019
As algorithms replace a growing number of tasks performed by humans in the markets, there have be... more As algorithms replace a growing number of tasks performed by humans in the markets, there have been growing concerns about an increased likelihood of cascading events, similar to the Flash Crash of May 6, 2010. To address these concerns, researchers have employed a number of scientific data analysis tools to monitor the risk of such cascading events. As an example, the authors of this article investigate the natural gas (NG) futures market in the frequency domain and the interaction between weather forecasts and NG price data. They observe that Fourier components with high frequencies have become more prominent in recent years and are much stronger than could be expected from an analytical model of the market. Additionally, a significant amount of trading activity occurs in the first few seconds of every minute, which is a tell-tale sign of time-based algorithmic trading. To illustrate the potential of cascading events, the authors further study how weather forecasts drive NG prices and show that, after separating the time series by season to account for the different mechanisms that relate temperature to NG price, the temperature forecast is indeed cointegrated with NG price. They also show that the variations in temperature forecasts contribute to a significant percentage of the average daily price fluctuations, which confirms the possibility that a forecast error could significantly affect the price of NG futures. TOPICS: Statistical methods, simulations, big data/machine learning Key Findings • High-frequency components in the trading data are stronger than expected from a model assuming uniform trading during market hours. • The dominance of the high-frequency components have been increasing over the years. • Relatively small changes in temperature could create a large price fluctuation in natural gas futures contracts.
For in transit processing, one of the fundamental challenges is the efficient movement of data fr... more For in transit processing, one of the fundamental challenges is the efficient movement of data from producers to consumers. Exploiting the flexibility offered by the SENSEI generic in situ framework, we have developed a number of different in transit data transport mechanisms. In this work, we focus on the transport mechanism that leverages the HDF5 parallel I/O library, and investigate the performance characteristics of this transport mechanism. For in transit use cases at scale on HPC platforms, one might expect that an in transit data transport mechanism that uses faster layers of the storage hierarchy, such as DRAM memory, would always outperform a transport that uses slower layers of the storage hierarchy, such as an NVRAM-based persistent storage presented as a distributed file system. However, our test results show that the performance of the transport using NVRAM is competitive with the transport that uses socket-based data movement across varying levels of producer and consumer concurrency. CCS CONCEPTS • Software and its engineering → Massively parallel systems; • Theory of computation → Parallel computing models; • Computing methodologies → Massively parallel algorithms; Massively parallel and high-performance simulations;
Accurately predicting network traffic volume is beneficial for congestion control, improving rout... more Accurately predicting network traffic volume is beneficial for congestion control, improving routing, allocating network resources and network optimization. Traffic congestion happens when a network device is receiving more data packets than its processing capability. The number of retransmissions per flow, packet duplication and synthetic reordering can seriously degrade the overall TCP performance. An unsupervised/supervised technique to accurately identify TCP anomalies occurring during file transfers based on passive measurements of TCP traffic collected using Tstat is proposed. This method will be validated on real large datasets collected from several data transfer nodes. The preliminary results indicate that the percentage of TCP anomalies correlate well with the average throughput in any given time window.
Distributed and Parallel Databases, Feb 28, 2019
Sensors, Jun 10, 2023
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
This document was prepared as an account of work sponsored by t h e United States Government. Whi... more This document was prepared as an account of work sponsored by t h e United States Government. While this document is believed to contain correct information, neither the United States Government nor any agency thereof, nor The Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or The Regents of the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of t h e United States Government or any agency thereof, or The Regents of t h e University of California. This report has been reproduced directly from the best available copy.
Sensors and Actuators B-chemical, 2023
2021 IEEE International Conference on Big Data (Big Data), Dec 15, 2021
Logistic regression has long been the gold standard for choice modeling in the transportation fie... more Logistic regression has long been the gold standard for choice modeling in the transportation field. Despite the rising popularity of machine learning (ML), few is applied to predicting the household vehicle transactions. To address the research gap, this paper presents a first use case of ML application to predicting household vehicle transaction decisions by leveraging a newly processed national panel data set. Model performances are reported for four ML models and the traditional multinomial logit model (MNL). Instead of treating the gold standard and ML models as competitors, this paper tries to use ML tools to inform the MNL model building process. We find the two gradient boosting based methods, CatBoost and LightGBM, are the best performing ML models; and improving logistic models with SHAP interpretation tools can achieve similar performance levels to the best performing ML methods.
In this work, we study the use of decision tree-based models to predict the transfer rates in dif... more In this work, we study the use of decision tree-based models to predict the transfer rates in different parts of the data pipeline that sends experiment data from Linac Coherent Light Source (LCLS) at SLAC National Accelerator Laboratory (SLAC) to National Energy Research Scientific Computing Center (NERSC). The system monitoring the data pipeline collects a number of characteristics such as the file size, source file system, start time and so on, all of which are known at the start of the file transfer. However, these static variables do not capture the dynamic information such as current state of the networking system. In this work, we explore a number of different ways to capture the state of the network and other dynamic information. We find that in addition to using static features, using these dynamic features can improve the transfer performance predictions by up to 10-15%. We additionally study a couple of different well-known decision-tree based models and find that Gradient-Tree Boosting algorithm performs better overall.
Toxins, Jun 5, 2019
In this paper, a highly sensitive plasmonic enzyme-linked immunosorbent assay (pELISA) was develo... more In this paper, a highly sensitive plasmonic enzyme-linked immunosorbent assay (pELISA) was developed for the naked-eye detection of fumonisin B 1 (FB 1). Glucose oxidase (GOx) was used as an alternative to horseradish peroxidase as the carrier of the competing antigen. GOx catalyzed the oxidation of glucose to produce hydrogen peroxide, which acted as a reducing agent to reduce Au 3+ to Au on the surface of gold seeds (5 nm), This reaction led to a color change in the solution from colorless to purple, which was observable to the naked eye. Various parameters that could influence the detection performance of pELISA were investigated. The developed method exhibited a considerably high sensitivity for FB 1 qualitative naked-eye detection, with a visible cutoff limit of 1.25 ng/mL. Moreover, the proposed pELISA showed a good linear range of 0.31-10 ng/mL with a half maximal inhibitory concentration (IC 50) of 1.86 ng/mL, which was approximately 13-fold lower than that of a horseradish peroxidase-(HRP)-based conventional ELISA. Meanwhile, the proposed method was highly specific and accurate. In summary, the new pELISA exhibited acceptable accuracy and precision for sensitive naked-eye detection of FB 1 in maize samples and can be applied for the detection of other chemical contaminants.
Large scientific projects are increasing relying on analyses of data for their new discoveries; a... more Large scientific projects are increasing relying on analyses of data for their new discoveries; and a number of different data management systems have been developed to serve this scientific projects. In the work-in-progress paper, we describe an effort on understanding the data access patterns of one of these data management systems, dCache. This particular deployment of dCache acts as a disk cache in front of a large tape storage system primarily containing high-energy physics data. Based on the 15-month dCache logs, the cache is only accessing the tape system once for over 50 file requests, which indicates that it is effective as a disk cache. The on-disk files are repeated used, more than three times a day. We have also identified a number of unusual access patterns that are worth further investigation. CCS CONCEPTS • Information systems → Information storage technologies; • Computing methodologies → Model development and analysis.
Foods, Jun 13, 2022
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
Sensors and Actuators B-chemical, Jun 1, 2018
This study reports a novel enzyme-induced metallization plasmonic enzyme-linked immunosorbent ass... more This study reports a novel enzyme-induced metallization plasmonic enzyme-linked immunosorbent assay (pELISA) for ultrasensitive detection of ochratoxin A (OTA) in rice, corn, wheat, and white wine samples. OTA-labeled urease was used as competing antigen to hydrolyze urea into ammonia. Silver ions were reduced by the formyl group from glucose to generate a silver shell on the surface of gold nanoflowers (33 nm, AuNFs) in the presence of ammonia molecules. The color of the solution changed from blue to brownish red. Various parameters that influenced the sensitivity of colorimetric pELISA were investigated and optimized. Under the optimized conditions, the colorimetric pELISA exhibited a high sensitivity for qualitative detection of OTA by the naked eye with a cutoff limit of 40 pg/mL, and a favorable linear range of 5.0-640 pg/mL for quantitative detection of OTA with a limit of detection at 8.205 pg/mL. These values are 15.6-and 14.3-folds lower than those of horseradish peroxidase (HRP)-based ELISA. The method also showed excellent specificity against four other mycotoxins including deoxynivalenol, zearalenone, fumonisin B 1 , and aflatoxin B 1. Moreover, the recoveries for OTA-spiked rice, corn, wheat, and white wine samples ranged from 81.5% to 106%, with coefficient of variation ranging from 6.13% to 18.7%. These results showed a good agreement with those obtained by an ultra-performance liquid chromatographyfluorescence detector (UPLC-FLD) method. Hence, the proposed method exhibited excellent robustness and reliability for quantitative detection of OTA in different food samples. This work provides a simple, sensitive, robust, and high-throughput screening method for qualitative or quantitative detection of mycotoxins or other pollutants in food safety monitoring.
Journal of Big Data, May 17, 2023
Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs ... more Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units used by hundreds to thousands of users simultaneously. Applications from large numbers of users have diverse characteristics, such as varying computation, communication, memory, and I/O intensity. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, I/O performance is becoming increasingly important as data sizes rapidly increase and large-scale applications, such as simulation and model training, are widely adopted. However, predicting I/O performance is difficult because I/O systems are shared among all users and involve many layers of software and hardware stack, including the application, network interconnect, operating system, file system, and storage devices. Furthermore, updates to these layers and changes in system management policy can significantly alter the I/O behavior of applications and the entire system. To improve the prediction of the I/O performance on HPC systems, we propose integrating information from several different system logs and developing a regression-based approach to predict the I/O performance. Our proposed scheme can dynamically select the most relevant features from the log entries using various feature selection algorithms and scoring functions, and can automatically select the regression algorithm with the best accuracy for the prediction task. The evaluation results show that our proposed scheme can predict the write performance with up to 90% prediction accuracy and the read performance with up to 99% prediction accuracy using the real logs from the Cori supercomputer system at NERSC.
2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), May 1, 2022
Neural networks are powerful solutions to many scientific applications; however, they usually suf... more Neural networks are powerful solutions to many scientific applications; however, they usually suffer from long model training times due to the typical data size and model size being large. Research has been focused on developing numerical optimization algorithms and parallel processing to reduce the training time. In this work, we propose a multi-resolution strategy that can reduce the training time by training the model with the reduced-resolution data samples at the beginning and later switching to the original resolution data samples. This strategy is motivated by the fact that many scientific applications run faster when using a coarse version of the problem, for example, data whose resolution is reduced statistically. When applying the idea to neural network training, coarse data can have a similar effect on the learning curves at the early stage as the dense data but requires less time. Once the curves no longer improve significantly, our strategy switches to using the data in original resolution. We use two real-world scientific applications, CosmoFlow and DeepCAM, to evaluate the proposed mixedresolution training strategy. Our experiment results demonstrate that the proposed training strategy effectively reduces the end-toend training time while achieving a comparable accuracy to that of the training only with the original data. While maintaining the same model accuracy, our multi-resolution training strategy reduces the end-to-end training time up to 30% and 23% for CosmoFlow and DeepCAM, respectively.
The demands of increasingly large scientific application workflows lead to the need for more powe... more The demands of increasingly large scientific application workflows lead to the need for more powerful supercomputers. As the scale of supercomputing systems have grown, the prediction of fault tolerance has become an increasingly critical area of study, since the prediction of system failures can improve performance by saving checkpoints in advance. We propose a real-time failure detection algorithm that adopts an event-based prediction model. The prediction model is a convolutional neural network that utilizes both traditional event attributes and additional spatio-temporal features. We present a case study using our proposed method with six years of reliability, availability, and serviceability event logs recorded by Mira, a Blue Gene/Q supercomputer at Argonne National Laboratory. In the case study, we have shown that our failure prediction model is not limited to predict the occurrence of failures in general. It is capable of accurately detecting specific types of critical failures such as coolant and power problems within reasonable lead time ranges. Our case study shows that the proposed method can achieve a F1 score of 0.56 for general failures, 0.97 for coolant failures, and 0.86 for power failures.
The journal of financial data science, Oct 31, 2019
As algorithms replace a growing number of tasks performed by humans in the markets, there have be... more As algorithms replace a growing number of tasks performed by humans in the markets, there have been growing concerns about an increased likelihood of cascading events, similar to the Flash Crash of May 6, 2010. To address these concerns, researchers have employed a number of scientific data analysis tools to monitor the risk of such cascading events. As an example, the authors of this article investigate the natural gas (NG) futures market in the frequency domain and the interaction between weather forecasts and NG price data. They observe that Fourier components with high frequencies have become more prominent in recent years and are much stronger than could be expected from an analytical model of the market. Additionally, a significant amount of trading activity occurs in the first few seconds of every minute, which is a tell-tale sign of time-based algorithmic trading. To illustrate the potential of cascading events, the authors further study how weather forecasts drive NG prices and show that, after separating the time series by season to account for the different mechanisms that relate temperature to NG price, the temperature forecast is indeed cointegrated with NG price. They also show that the variations in temperature forecasts contribute to a significant percentage of the average daily price fluctuations, which confirms the possibility that a forecast error could significantly affect the price of NG futures. TOPICS: Statistical methods, simulations, big data/machine learning Key Findings • High-frequency components in the trading data are stronger than expected from a model assuming uniform trading during market hours. • The dominance of the high-frequency components have been increasing over the years. • Relatively small changes in temperature could create a large price fluctuation in natural gas futures contracts.
For in transit processing, one of the fundamental challenges is the efficient movement of data fr... more For in transit processing, one of the fundamental challenges is the efficient movement of data from producers to consumers. Exploiting the flexibility offered by the SENSEI generic in situ framework, we have developed a number of different in transit data transport mechanisms. In this work, we focus on the transport mechanism that leverages the HDF5 parallel I/O library, and investigate the performance characteristics of this transport mechanism. For in transit use cases at scale on HPC platforms, one might expect that an in transit data transport mechanism that uses faster layers of the storage hierarchy, such as DRAM memory, would always outperform a transport that uses slower layers of the storage hierarchy, such as an NVRAM-based persistent storage presented as a distributed file system. However, our test results show that the performance of the transport using NVRAM is competitive with the transport that uses socket-based data movement across varying levels of producer and consumer concurrency. CCS CONCEPTS • Software and its engineering → Massively parallel systems; • Theory of computation → Parallel computing models; • Computing methodologies → Massively parallel algorithms; Massively parallel and high-performance simulations;
Accurately predicting network traffic volume is beneficial for congestion control, improving rout... more Accurately predicting network traffic volume is beneficial for congestion control, improving routing, allocating network resources and network optimization. Traffic congestion happens when a network device is receiving more data packets than its processing capability. The number of retransmissions per flow, packet duplication and synthetic reordering can seriously degrade the overall TCP performance. An unsupervised/supervised technique to accurately identify TCP anomalies occurring during file transfers based on passive measurements of TCP traffic collected using Tstat is proposed. This method will be validated on real large datasets collected from several data transfer nodes. The preliminary results indicate that the percentage of TCP anomalies correlate well with the average throughput in any given time window.
Distributed and Parallel Databases, Feb 28, 2019