Ahsan Javed Awan | Ericsson (original) (raw)

Papers by Ahsan Javed Awan

Euromicro Digital System Design, 2021

Real-world applications are now processing big-data sets, often bottlenecked by the data movement... more Real-world applications are now processing big-data sets, often bottlenecked by the data movement between the compute units and the main memory. Near-memory computing (NMC), a modern data-centric computational paradigm, can alleviate these bottlenecks, thereby improving the performance of applications. The lack of NMC system availability makes simulators the primary evaluation tool for performance estimation. However, simulators are usually time-consuming, and methods that can reduce this overhead would accelerate the earlystage design process of NMC systems. This work proposes Near-Memory computing Profiling and Offloading (NMPO), a highlevel framework capable of predicting NMC offloading suitability employing an ensemble machine learning model. NMPO predicts NMC suitability with an accuracy of 85.6% and, compared to prior works, can reduce the prediction time by using hardwaredependent applications features by up to 3 order of magnitude.

Ericsson Technology Review, 2020

With a vastly distributed system (the telco network) already in place, the telecom industry has a... more With a vastly distributed system (the telco network) already in place,
the telecom industry has a significant advantage in the transition
toward distributed cloud computing. To deliver best-in-class application
performance, however, operators must also have the ability to fully
leverage heterogeneous compute and storage capabilities.

MECO, 2020

Modern radio telescopes like the Square Kilometer Array (SKA) will need to process in real-time e... more Modern radio telescopes like the Square Kilometer Array (SKA) will need to process in real-time exabytes of radio-astronomical signals to construct a high-resolution map of the sky. Near-Memory Computing (NMC) could alleviate the performance bottlenecks due to frequent memory accesses in a state-of-the-art radio-astronomy imaging algorithm. In this paper, we show that a sub-module performing a two-dimensional fast Fourier transform (2D FFT) is memory bound using CPI breakdown analysis on IBM Power9. Then, we present an NMC approach on FPGA for 2D FFT that outperforms a CPU by up to a factor of 120x and performs comparably to a high-end GPU, while using less bandwidth and memory.

pre-print, 2019

Real-time clustering of big performance data generated by the telecommunication networks requires... more Real-time clustering of big performance data generated by the telecommunication networks requires domain-specific high performance compute infrastructure to detect anomalies. In this paper, we evaluate noisy intermediate-scale quantum (NISQ) computers characterized by low decoherence times, for K-means clustering and propose three strategies to generate shorter-depth quantum circuits needed to overcome the limitation of NISQ computers. The strategies are based on exploiting; i) quantum interference, ii) negative rotations and iii) destructive interference. By comparing our implementations on IBMQX2 machine for representative data sets, we show that NISQ computers can solve the K-means clustering problem with the same level of accuracy as that of classical computers.

pre-print, 2019

Support vector machine algorithms are considered essential for the implementation of automation i... more Support vector machine algorithms are considered essential for the implementation of automation in a radio access network. Specifically, they are critical in the prediction of the quality of user experience for video streaming based on device and network-level metrics. Quantum SVM is the quantum analogue of the classical SVM algorithm, which utilizes the properties of quantum computers to speed up the algorithm exponentially. In this work, we derive an optimized preprocessing unit for a quantum SVM that allows classifying any two-dimensional datasets that are linearly separable. We further provide a result readout method of the kernel matrix generation circuit to avoid quantum tomography that, in turn, reduces the quantum circuit depth. We also derive a quantum SVM system based on an optimized HHL quantum circuit with reduced circuit depth. Index Terms-quantum support vector machine, noisy intermediate scale quantum computers, HHL, algorithm

Journal of Microprocessors and Microsystemss, 2019

The conventional approach of moving data to the CPU for computation has become a significant perf... more The conventional approach of moving data to the CPU for computation has become a significant performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in 3D integration technologies has made the decade-old concept of coupling compute units close to the memory-called near-memory computing (NMC)-more viable. Processing right at the "home" of data can significantly diminish the data movement problem of data-intensive applications. In this paper, we survey the prior art on NMC across various dimensions (architecture, applications, tools, etc.) and identify the key challenges and open issues with future research directions. We also provide a glimpse of our approach to near-memory computing that includes i) NMC specific microarchitecture independent application characterization ii) a compiler framework to offload the NMC kernels on our target NMC platform and iii) an analytical model to evaluate the potential of NMC.

Euromicro Conference on Digital System Design (DSD), 2019

Near-memory Computing (NMC) promises improved performance for the applications that can exploit t... more Near-memory Computing (NMC) promises improved performance for the applications that can exploit the features of emerging memory technologies such as 3D-stacked memory. However, it is not trivial to find such applications and specialized tools are needed to identify them. In this paper, we present PISA-NMC, which extends a state-of-the-art hardware agnostic profiling tool with metrics concerning memory and parallelism, which are relevant for NMC. The metrics include memory entropy, spatial locality, data-level, and basic-block-level parallelism. By profiling a set of representative applications and correlating the metrics with the application's performance on a simulated NMC system, we verify the importance of those metrics. Finally, we demonstrate which metrics are useful in identifying applications suitable for NMC architectures.

22nd ACM International Workshop on Software and Compilers for Embedded Systems (SCOPES '19), 2019

Emerging computing architectures such as near-memory computing (NMC) promise improved performance... more Emerging computing architectures such as near-memory computing (NMC) promise improved performance for applications by reducing the data movement between CPU and memory. However, detecting such applications is not a trivial task. In this ongoing work, we extend the state-of-the-art platform-independent software analysis tool with NMC related metrics such as memory en-tropy, spatial locality, data-level, and basic-block-level parallelism. These metrics help to identify the applications more suitable for NMC architectures. CCS CONCEPTS • Software and its engineering → Dynamic analysis.

The conventional approach of moving stored data to the CPU for computation has become a major per... more The conventional approach of moving stored data to the CPU for computation has become a major performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in integration technologies have made the decade-old concept of coupling compute units close to the memory (called Near-Memory Computing) more viable. Processing right at the home of data can completely diminish the data movement problem of data-intensive applications. This paper focuses on analyzing and organizing the extensive body of literature on near-memory computing across various dimensions: starting from the memory level where this paradigm is applied, to the granularity of the application that could be executed on the near-memory units. We highlight the challenges as well as the critical need of evaluation methodologies that can be employed in designing these special architectures. Using a case study, we present our methodology and also identify topics for future research to unlock the full potential of near-memory computing.

Analyzing massive amounts of data and extracting value from it has become key across different di... more Analyzing massive amounts of data and extracting value from it has become key across different disciplines. Clustering is a common technique to find patterns in the data. Existing clustering algorithms require parameters to be set a priori. The parameters are usually determined through trial and error in several iterations or through pre-clustering algorithms, which do not scale well for the massive amounts of data. In this paper, we thus take one such pre-clustering algorithm, Canopy, and develop a parallel version based on MPI. As we show, doing so is not straightforward and without optimization, a considerable amount of time is spent waiting for synchronisation, severely limiting scalability. We thus optimize our approach to spend as little time as possible with idle cores and synchronization barriers. As our experiments show, our approach scales near linear with increasing dataset size.

—Neuromorphic hardware like SpiNNaker offers massive parallelism and efficient communication of s... more —Neuromorphic hardware like SpiNNaker offers massive parallelism and efficient communication of small pay-loads to accelerate the simulation of spiking neurons in neural networks. In this paper, we demonstrate that this hardware is also beneficial for other for applications which require massive parallelism and the large-scale exchange of small messages. More specifically, we study the scalability of PageRank on SpiNNaker and compare it to an implementation on traditional hardware. In our experiments, we show that PageRank on SpiNNaker scales better than on traditional multicore architectures.

While cluster computing frameworks are continuously evolving to provide real-time data analysis c... more While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analyt-ics. Recent studies propose scale-in clusters with in-storage processing devices to process big data analytics with Spark However the proposal is based solely on the memory band-width characterization of in-memory data analytics and also does not shed light on the specification of host CPU and memory. Through empirical evaluation of in-memory data analytics with Apache Spark on an Ivy Bridge dual socket server, we have found that (i) simultaneous multi-threading is effective up to 6 cores (ii) data locality on NUMA nodes can improve the performance by 10% on average, (iii) disabling next-line L1-D prefetchers can reduce the execution time by up to 14%, (iv) DDR3 operating at 1333 MT/s is sufficient and (v) multiple small executors can provide up to 36% speedup over single large executor.

—While cluster computing frameworks are continuously evolving to provide real-time data analysis ... more —While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare the micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only. If the input data rates are small, stream processing workloads are front-end bound. However, the front end bound stalls are reduced at larger input data rates and instruction retirement is improved. Moreover, Spark workloads using DataFrames have improved instruction retirement over workloads using RDDs.

Sheer increase in volume of data over the last decade has triggered research in cluster computin... more Sheer increase in volume of data over the last decade has
triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on
a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behavior with the garbage collector to improve performance of applications between 1.6x to 3x.

In last decade, data analytics have rapidly pro- gressed from traditional disk-based processing ... more In last decade, data analytics have rapidly pro-
gressed from traditional disk-based processing to modern in-
memory processing. However, little effort has been devoted
in performance enhancement at micro-architecture level. This
paper characterizes the performance of in-memory data analytic
applications using Apache Spark framework. It uses a single
node NUMA machine and identifies the bottlenecks hampering
the scalability of the workloads. In doing so, it quantifies the
inefficiencies at micro architectural level for various big data
applications. Through empirical evaluation, we show that spark
workloads do not scale linearly beyond twelve threads, due to
work time inflation and thread level load imbalance. Further,
at the micro-architecture level, we observe that memory bound
latency is one of the major cause of work time inflation

In this paper, the design space that optimizes the performance of operational amplifier in terms ... more In this paper, the design space that optimizes the performance of operational amplifier in terms of current consumption and unity gain band width product has been explored using Cadence. A two stage indirect compensated active load cascode operational amplifier with a current consumption of 335uA and a speed as high as 23MHz has been presented. A novel indirect compensated multistage opamp designed in AMS 0.35um technology further reduces the current to 120uA and increases the speed to 35MHz. The layout of both designs incorporates Common Centroid method for improved matching among devices

Brain Computer Interface (BCI) is a communication system, which avoiding the brain's normal out... more Brain Computer Interface (BCI) is a
communication system, which avoiding the brain's normal
output pathways of muscles and peripheral nerves and allows a
patient to control its external world only by means of brain
signals. For successful implementation of BCI, dimensionality
reduction and classification are fundamental task. In this
paper, we used a publically available EEG signals data of the
Upper Limb Motion. First the dimensionality of the data is
being reduced by using Principal Component Analysis (PCA)
followed by classification of the reduced dimensioned dataset
by well-known classifiers e.g. Artificial Neural Networks
(ANN), Linear Discriminant Analysis (LDA) and Decision
trees (DT). To identify a classifier which does the classification
task more efficiently, we compare their performances on the
basis of Confusion Matrices and Percentage Accuracies. The
experimental results show that ANN is the best classifier for
the classification of brain signals and has the percentage
accuracy of 81.6%.

Thesis Chapters by Ahsan Javed Awan

The sheer increase in the volume of data over the last decade has triggered research in cluster c... more The sheer increase in the volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark defines the state of the art in big data analytics platforms for (i) exploiting data-flow and in-memory computing and (ii) for exhibiting superior scale-out performance on the commodity machines, little effort has been devoted to understanding the performance of in-memory data analytics with Spark on modern scale-up servers. This thesis characterizes the performance of in-memory data analytics with Spark on scale-up servers. Through empirical evaluation of representative benchmark workloads on a dual socket server, we have found that in-memory data analytics with Spark exhibit poor multi-core scalability beyond 12 cores due to thread level load imbalance and work-time inflation (the additional CPU time spent by threads in a multi-threaded computation beyond the CPU time required to perform the same work in a sequential computation). We have also found that workloads
are bound by the latency of frequent data accesses to the memory. By
enlarging input data size, application performance degrades significantly due to the substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization).

For data accesses, we have found that simultaneous multi-threading is
effective in hiding the data latencies. We have also observed that (i) data
locality on NUMA nodes can improve the performance by 10% on average, (ii) disabling next-line L1-D prefetchers can reduce the execution time by upto 14%. For garbage collection impact, we match memory behavior with the garbage collector to improve the performance of applications between 1.6x to 3x and recommend using multiple small Spark executors that can provide up to 36% reduction in execution time over single large executor. Based on the characteristics of workloads, the thesis envisions near-memory and near storage hardware acceleration to improve the single-node performance of scale-out frameworks like Apache Spark. Using modeling techniques, it
estimates the speed-up of 4x for Apache Spark on scale-up servers augmented with near-data accelerators.

Euromicro Digital System Design, 2021

Ericsson Technology Review, 2020

MECO, 2020

pre-print, 2019

Journal of Microprocessors and Microsystemss, 2019

Euromicro Conference on Digital System Design (DSD), 2019

22nd ACM International Workshop on Software and Compilers for Embedded Systems (SCOPES '19), 2019

This paper presents a modified fuzzy neural network (MFNN) for the in-flight estimation of nonlin... more This paper presents a modified fuzzy neural network (MFNN) for the in-flight estimation of nonlinear, highly coupled and time varying attitude dynamics of unmanned aerial platforms. The hybrid adaptive learning algorithm combining recursive least squares and gradient descent has been used to update the linear and nonlinear parameters of MFNN. The performance of the devised architecture as an attitude dynamics identifier is validated in real time through the test flight of Kadet Senior UAV. Mean square error of 0.018 degrees between the actual pitch and output of MFNN based pitch identifier proves the applicability of approach.

While cluster computing frameworks are continuously evolving to provide real-time data analysis c... more While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data an-alytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that while batch processing workloads are bounded on the latency of frequent data accesses to DRAM, stream processing workloads are curbed by L1 instruction cache misses. For data accesses we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance with up to 12%, (ii) disabling next-line L1-D prefetchers can reduce the execution time by up-to 15% and (iii) multiple small executors can provide up-to 36% speedup over single large executor.

Through empirical evaluation of representative benchmark workloads on a dual socket server, we have found that in-memory data analytics with Spark exhibit poor multi-core scalability beyond 12 cores due to thread level load imbalance and work-time inflation. We have also found that workloads are bound by the latency of frequent data accesses to DRAM. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization).

For data accesses we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10% on average, (ii) disabling next-line L1-D prefetchers can reduce the execution time by up-to 14%. For GC impact, we match memory behaviour with the garbage collector to improve performance of applications between 1.6x to 3x. and recommend to use multiple small executors that can provide up-to 36% speedup over single large executor.

Norm Optimal Iterative Learning Control (NOILC) is state of the art control strategy that is be... more Norm Optimal Iterative Learning Control (NOILC) is state of the art control strategy that is being used in Gantry robot and rehabilitation robotics. However due to the memory requirement for storing trial data and matrix manipulations, the fore mentioned applications use
desktop PC which is not an efficient solution in terms of area, power, cost and mobility. This thesis concerns the use of FPGA as an implementation platform for NOILC. In this regard floating point norm optimal iterative learning controller for Gantry robot is developed
and synthesized on Vertex 5 Xc5vlx110T FPGA chip. The comparison with general purpose processor based implementation
shows that the proposed FPGA implementation reduces the
execution time from 830ms to 1.47ms.

FPGAs are increasingly being deployed in the cloud to accelerate diverse applications. They are t... more FPGAs are increasingly being deployed in the cloud to accelerate diverse applications. They are to be shared among multiple tenants to improve the total cost of ownership. Partial reconfiguration technology enables multitenancy on FPGA by partitioning it into regions, each hosting a specific application's accelerator. However, the region's size can not be changed once they are defined, resulting in the underutilization of FPGA resources. This paper argues to divide the acceleration requirements of an application into multiple small computation modules. The devised FPGA shell can reconfigure the available PR regions with those modules and enable them to communicate with each other over Crossbar interconnect with the Wishbone bus interface. For each PR region being reconfigured, it updates the register file with the valid destination addresses and the bandwidth allocation of the interconnect. Any invalid communication request originating from the Wishbone master interface is masked in the corresponding master port of the crossbar. The allocated bandwidth for the PR region is ensured by the weighted round-robin arbiter in the slave port of the crossbar. Finally, the envisioned resource manager can increase or decrease the number of PR regions allocated to an application based on its acceleration requirements and PR regions' availability.