Ayaz H Khan | King Fahd University of Petroleum and Minerals (original) (raw)

Papers by Ayaz H Khan

Computers, Materials & Continua

International Journal of Networked and Distributed Computing, 2014

The advances of Graphic Processing Units (GPU) technology and the introduction of CUDA programmin... more The advances of Graphic Processing Units (GPU) technology and the introduction of CUDA programming model facilitates developing new solutions for sparse and dense linear algebra solvers. Matrix Transpose is an important linear algebra procedure that has deep impact in various computational science and engineering applications. Several factors hinder the expected performance of large matrix transpose on GPU devices. The degradation in performance involves the memory access pattern such as coalesced access in the global memory and bank conflict in the shared memory of streaming multiprocessors within the GPU. In this paper, two matrix transpose algorithms are proposed to alleviate the aforementioned issues of ensuring coalesced access and conflict free bank access. The proposed algorithms have comparable execution times with the NVIDIA SDK bank conflict-free matrix transpose implementation. The main advantage of proposed algorithms is that they eliminate bank conflicts while allocating shared memory exactly equal to the tile size (T x T) of the problem space. However, to the best of our knowledge an extra space of Tx(T+1) needs to be allocated in the published research. We have also applied the proposed transpose algorithm to recursive gaussian implementation of NVIDIA SDK and achieved about 6% improvement in performance.

International Journal of Computational Intelligence Systems, 2021

A Software Product Line (SPL) is a collection of software for configuring software products in wh... more A Software Product Line (SPL) is a collection of software for configuring software products in which sets of features are configured by different teams of product developers. This process often leads to inconsistencies (or dissatisfaction of constraints) in the resulting product configurations, whose resolution consumes considerable business resources. In this paper, we aim to solve this problem by learning, or mathematically modeling, all previous patterns of feature selection by SPL developers, and then use these patterns to predict inconsistent configuration patterns at runtime. We propose and implement an informative Predictive Analytics tool called predictive Software Product LIne Tool (p-SPLIT) which provides runtime decision support to SPL developers in three ways: 1) by identifying configurations of feature selections (patterns) that lead to inconsistent product configurations, 2) by identifying feature selection patterns that lead to consistent product configurations, and 3) by predicting feature inconsistencies in the product that is currently being configured (at runtime). p-SPLIT provides the first application of Predictive Analytics for the SPL feature modeling domain at the application engineering level. With different experiments in representative SPL settings, we obtained 85% predictive accuracy for p-SPLIT and a 98% Area Under the Curve (AUC) score. We also obtained subjective feedback from the practitioners who validate the usability of p-SPLIT in providing runtime decision support to SPL developers. Our results prove that p-SPLIT technology is a potential addition for the global SPL product configuration community, and we further validate this by comparing p-SPLIT's characteristics with state-of-the-art SPL development solutions.

IEEE Access, 2020

Feature modeling is a common approach for configuring and capturing commonalities and variations ... more Feature modeling is a common approach for configuring and capturing commonalities and variations among different Software Product Lines (SPL) products. This process is carried out by a set of SPL design teams, each working on a different configuration of the desired product. The integration of these configurations leads to inconsistencies in the final product design. The typical solution involves extensive deliberation and unnecessary resource usage, which makes SPL inconsistency resolution an expensive and unoptimized process. We present the first comprehensive evaluation of swarm intelligence (using Particle Swarm Optimization) to the problem of resolving inconsistencies in a configured integrated SPL product. We call it o-SPLIT (optimization-based Software Product LIne Tool) and validate o-SPLIT with standard ERP, SPLOT (Software Product Lines Online Tools), and BeTTy (BEnchmarking and TesTing on the analYsis) product configurations along with diverse feature set sizes. The results show that Particle Swarm Optimization can successfully optimize SPL product configurations. Finally, we implement o-SPLIT as a decision-support tool in a real, local SPL setting and acquire subjective feedback from SPL designers which shows that the teams are convinced of the usability and high-level decision support provided by o-SPLIT.

Multimedia Tools and Applications, Aug 29, 2020

Twitter is a social media platform which has been proven to be a great tool for insights of emoti... more Twitter is a social media platform which has been proven to be a great tool for insights of emotions about products, policies etc. through a 280-character message called tweet, containing direct and unfiltered emotions by a large amount of user population. Twitter has attracted the attention of many researchers owing to the fact that every tweet is by default, public in nature which is not the case with Facebook. This paper proposes a model for multi-lingual (English and Roman Urdu) classification of tweets over diversely ranged classes (non-hierarchical architecture). Previous work in tweet classification is narrowly focused either on single language or either on uniform set of classes at most (Positive, Extremely Positive, Negative and Extremely Negative). The proposed model is based on semi-supervised learning and proposed feature selection approach makes it less dependent and highly adaptive for grabbing trending terms. This makes it a strong contender of choice for streaming data. In the methodology, using Naïve Bayes learning algorithm for each phase, obtained remarkable accuracy of up to 87.16% leading from both KNN and SVM models which are popular for NLP and Text classification domains.

2019 International Symposium on Recent Advances in Electrical Engineering (RAEE)

RNA secondary structure prediction is a core task to find out the relationship between it’s struc... more RNA secondary structure prediction is a core task to find out the relationship between it’s structure and function. Methods based on grammars, dynamic programming, matching and evolutionary algorithms have been developed for modeling and analysis of RNA secondary structure. Particularly, a Stochastic Context Free Grammar (SCFG) for parsing can be used to drive a 2-D RNA secondary structure from a 1-D RNA sequence avoiding ambiguous grammars with multithreading model by distributing the conflicts among multiple threads. This intuitive multithreaded model for RNA prediction is not scalable to a longer RNA sequence. Currently, it only supports sequences up to 20 characters due to extensive use of memory to store intermediate parse trees, parsing actions, and states for handling large number of parsing conflicts. This paper presents a parallel implementation of LR parsing for predicting RNA secondary structure using Message Passing Interface (MPI) to be executed on large computing cluster to enhance the scalability and performance of finding a valid parse tree with maximum probability.

ADCAIJ: ADVANCES IN DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE JOURNAL

k-Nearest Neighbor (k-NN) is a non-parametric algorithm widely used for the estimation and classi... more k-Nearest Neighbor (k-NN) is a non-parametric algorithm widely used for the estimation and classification of data points especially when the dataset is distributed in several classes. It is considered to be a lazy machine learning algorithm as most of the computations are done during the testing phase instead of performing this task during the training of data. Hence it is practically inefficient, infeasible and inapplicable while processing huge datasets i.e. Big Data. On the other hand, clustering techniques (unsupervised learning) greatly affect results if you do normalization or standardization techniques, difficult to determine "k" Value. In this paper, some novel techniques are proposed to be used as pre-state mechanism of state-of-the-art k-NN Classification Algorithm. Our proposed mechanism uses unsupervised clustering algorithm on large dataset before applying k-NN algorithm on different clusters that might running on single machine, multiple machines or different...

The Journal of Supercomputing

The abundant data parallelism available in many-core GPUs has been a key interest to improve accu... more The abundant data parallelism available in many-core GPUs has been a key interest to improve accuracy in scientific and engineering simulation. In many cases, most of the simulation time is spent in linear solver involving sparse matrix–vector multiply. In forward petroleum oil and gas reservoir simulation, the application of a stencil relationship to structured grid leads to a family of generalized hepta-diagonal solver matrices with some regularity and structural uniqueness. We present a customized storage scheme that takes advantage of generalized hepta-diagonal sparsity pattern and stencil regularity by optimizing both storage and matrix–vector computation. We also present an in-kernel optimization for implementing sparse matrix–vector multiply (SpMV) and biconjugate gradient stabilized (BiCG-Stab) solver. In-kernel is intended to avoid the multiple kernels invocation associated with the use of the numerical library operators. To keep in-kernel, a lock-free inter-block synchronization is used in which completing thread blocks are assigned some independent computations to avoid repeatedly polling the global memory. Other optimizations enable combining reductions and collective write operations to memory. The in-kernel optimization is particularly useful for the iterative structure of BiCG-Stab for preserving vector data locality and to avoid saving vector data back to memory and reloading on each kernel exit and re-entry. Evaluation uses generalized hepta-diagonal matrices that derives from a range of forward reservoir simulation’s structured grids. Results show the profitability of proposed generalized hepta-diagonal custom storage scheme over standard library storage like compressed sparse row, hybrid sparse, and diagonal formats. Using proposed optimizations, SpMV and BiCG-Stab have been noticeably accelerated compared to other implementations using multiple kernel exit–re-entry when the solver is implemented by invoking numerical library operators.

International Journal of Parallel, Emergent and Distributed Systems, 2014

Writing optimised compute unified device architecture (CUDA) program for graphic processing units... more Writing optimised compute unified device architecture (CUDA) program for graphic processing units (GPUs) is complex even for experts. We present a design methodology for a restructuring tool that converts C-loops into optimised CUDA kernels based on a three-step algorithm which are loop tiling, coalesced memory access and resource optimisation. A method for finding possible loop tiling solutions with coalesced memory access is developed and a simplified algorithm for restructuring C-loops into an efficient CUDA kernel is presented. In the evaluation, we implement matrix multiply (MM), matrix transpose (M-transpose), matrix scaling (M-scaling) and matrix vector multiply (MV) using the proposed algorithm. We present the analysis of the execution time and GPU throughput for the above applications, which favourably compare to other proposals. Evaluation is carried out while scaling the problem size and running under a variety of kernel configurations. The obtained speedup is about 28–35% for M-transpose compared to NVIDIA Software Development Kit, 33% speedup for MV compared to general purpose computation on graphics processing unit compiler, and more than 80% speedup for MM and M-scaling compared to CUDA-lite.

Arabian Journal for Science and Engineering, 2014

Information sharing between mobile devices has gained immense popularity in recent times, owing t... more Information sharing between mobile devices has gained immense popularity in recent times, owing to advances in network bandwidth and sophistication of mobile applications. Developing such applications to facilitate seamless information sharing between heterogeneous mobile devices can be cumbersome. The Object Management Group DDS (Data Distribution Service) specification provides a standard for a range of real-time mobile systems and embedded computing environments, from small networked embedded systems up to large-scale information backbones, to communicate with each other. The service exhibits features such as asynchronous interaction, Quality of Service (QoS) support, and a dynamic discovery mechanism to support smooth hand-off during communication. In this paper, we propose a service architecture model to facilitate uninterrupted mobile communications, using the DDS specification, so as to minimize disconnections during the data communication between mobile nodes due to mobility factors. We also introduce the application of the DDS QoS module for evaluating our mobility-aware data transfer model. Several experiments were conducted to identify the capabilities of the proposed approach in a heterogeneous environment in terms of latency and throughput, using a two mobile node scenario, and the results were found to be promising.

International Journal of Computational Intelligence Systems, 2021

International Journal of Parallel Programming, 2016

Recent development in Graphic Processing Units (GPUs) has opened a new challenge in harnessing th... more Recent development in Graphic Processing Units (GPUs) has opened a new challenge in harnessing their computing power as a new general-purpose computing paradigm with its CUDA parallel programming. However, porting applications to CUDA remains a challenge to average programmers. In this thesis work we have developed a restructuring software compiler (RT-CUDA) with best possible kernel optimizations to bridge the gap between high-level languages and the machine dependent CUDA environment. RT-CUDA is based upon a set of compiler optimizations. RT-CUDA takes a C-like program and convert it into an optimized CUDA kernel with user directives in a configuration file for guiding the compiler. While the invocation of external libraries is not possible with OpenACC commercial compiler, RT-CUDA allows transparent invocation of the most optimized xvi external math libraries like cuSparse and cuBLAS. For this, RT-CUDA uses interfacing APIs, error handling interpretation, and user transparent programming. This enables efficient design of linear algebra solvers (LAS). Evaluation of RT-CUDA has been performed on Tesla K20c GPU with a variety of basic linear algebra operators (M+, MM, MV, VV, etc.) as well as the programming of solvers of systems of linear equations like Jacobi and Conjugate Gradient. We obtained significant speedup over other compilers like OpenACC and GPGPU compilers. RT-CUDA facilitates the design of efficient parallel software for developing parallel simulators (reservoir simulators, molecular dynamics, etc.) which are critical for Oil & Gas industry in KSA. We expect RT-CUDA to be needed by many KSA industries dealing with science and engineering simulation on massively parallel computers like NVIDIA GPUs.

Indian Journal of Science and Technology, 2020

Cricket is the second most popular game around the globe, particularly it breeds a high level of ... more Cricket is the second most popular game around the globe, particularly it breeds a high level of enthusiasm in Asia, Australia and UK. However, it is generally known and globally mentioned that Pakistan is an "unpredictable" cricket team, which leads to extreme reactions from the citizens in case of a loss, e.g., verbal anger, breaking of television sets and burning of players' effigies. Objectives: In this study, we leverage machine learning techniques to demonstrate that the use of the "unpredictable" tag for Pakistan's cricket performance is unjustified as the match outcome can be predicted with a pretty high confidence. Method: We produce a new dataset by scrapping latest statistics from cricinfo.com, the most reliable online source. Also, we propose a novel feature "consecutive wins" that incorporates recent performance trend of the team. With extensive experimental setup, state-of-the-art machine learning methodology was employed to prove effectiveness of proposed tool. Findings: Pakistan's cricket performance can be predicted with 82% accuracy, i.e., it is possible to understand the patterns (in advance) which may lead to a winning or losing situation. Hence, using pre-match analysis, it is possible to avoid any prejudiced opinion or potentially dangerous reactions. Novelty: We employ state-of-the-art machine learning methodology based on application of various algorithms, feature selection and data splitting methods. Eventually, state-of-the-art prediction accuracy is achieved by exploiting all potential avenues in a structured way.

IEEE Access, 2020

Computing Journal, Springer, 2019

Recent advances in GPUs opened a new opportunity in harnessing their computing power for general ... more Recent advances in GPUs opened a new opportunity in harnessing their computing power for general purpose computing. CUDA, an extension to C programming, is developed for programming NVIDIA GPUs. However, efficiently programming GPUs using CUDA is very tedious and error prone even for the expert programmers. Programmer has to optimize the resource occupancy and manage the data transfers between host and GPU, and across the memory system. This paper presents the basic architectural optimizations and explore their implementations in research and industry compilers. The focus of the presented review is on accelerating computational science applications such as the class of structured grid computation (SGC). It also discusses the mismatch between current compiler techniques and the requirements for implementing efficient iterative linear solvers. It explores the approaches used by computational scientists to program SGCs. Finally, a set of tools with the main optimization functionalities for an integrated library are proposed to ease the process of defining complex SGC data structure and optimizing solver code using intelligent high-level interface and domain specific annotations.

International Journal of Advanced Computer Science and Applications(IJACSA), 2019

The goal of big data analytics is to analyze datasets with a higher amount of volume, velocity, a... more The goal of big data analytics is to analyze datasets with a higher amount of volume, velocity, and variety for large-scale business intelligence problems. These workloads are normally processed with the distribution on massively parallel analytical systems. Deep learning is part of a broader family of machine learning methods based on learning representations of data. Deep learning plays a significant role in the information analysis by adding value to the massive amount of unsupervised data. A core domain of research is related to the development of deep learning algorithms for auto-extraction of complex data formats at a higher level of abstraction using the massive volumes of data. In this paper, we present the latest research trends in the development of parallel algorithms, optimization techniques, tools and libraries related to big data analytics and deep learning on various parallel architectures. The basic building blocks for deep learning such as Restricted Boltzmann Machines (RBM) and Deep Belief Networks (DBN) are identified and analyzed for parallelization of deep learning models. We proposed a parallel software API based on PyTorch, Hadoop Distributed File System (HDFS), Apache Hadoop MapReduce and MapReduce Job (MRJob) for developing large-scale deep learning models. We obtained about 5-30% reduction in the execution time of the deep auto-encoder model even on a single node Hadoop cluster. Furthermore, the complexity of code development is significantly reduced to create multi-layer deep learning models.

… and Grid Computing (PDGC), 2012 2nd …, 2012

Graphic processing Units (GPUs) are gaining ground in high-performance computing. CUDA (an extens... more Graphic processing Units (GPUs) are gaining ground in high-performance computing. CUDA (an extension to C) is most widely used parallel programming framework for general purpose GPU computations. However, the task of writing optimized CUDA program is complex even for experts. We present a method for restructuring loops into an optimized CUDA kernels based on a 3-step algorithm which are loop tiling, coalesced memory access, and resource optimization. We also establish the relationships between the influencing parameters and propose a method for finding possible tiling solutions with coalesced memory access that best meets the identified constraints. We also present a simplified algorithm for restructuring loops and rewrite them as an efficient CUDA Kernel. The execution model of synthesized kernel consists of uniformly distributing the kernel threads to keep all cores busy while transferring a tailored data locality which is accessed using coalesced pattern to amortize the long latency of the secondary memory. In the evaluation, we implement some simple applications using the proposed restructuring strategy and evaluate the performance in terms of execution time and GPU throughput.

Ad-hoc Networks

This paper proposes an efficient GPU based parallel algorithm image reconstruction. It has been i... more This paper proposes an efficient GPU based parallel algorithm image reconstruction. It has been implemented on a system having a general purpose graphical processing unit (GPU), Nvidia graphics card GTX-275. Experimental results reveal that an image of size 256 x 256 with 90 projections can be reconstructed in real time with single iteration. The results enable us to use MART algorithm for online applications.

Computers, Materials & Continua

International Journal of Networked and Distributed Computing, 2014

International Journal of Computational Intelligence Systems, 2021

IEEE Access, 2020

Multimedia Tools and Applications, Aug 29, 2020

2019 International Symposium on Recent Advances in Electrical Engineering (RAEE)

ADCAIJ: ADVANCES IN DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE JOURNAL

The Journal of Supercomputing

International Journal of Parallel, Emergent and Distributed Systems, 2014

Arabian Journal for Science and Engineering, 2014

International Journal of Computational Intelligence Systems, 2021

International Journal of Parallel Programming, 2016

Indian Journal of Science and Technology, 2020

IEEE Access, 2020

Computing Journal, Springer, 2019

International Journal of Advanced Computer Science and Applications(IJACSA), 2019

… and Grid Computing (PDGC), 2012 2nd …, 2012

Ad-hoc Networks