MALAYSIAN JOURNAL OF COMPUTER SCIENCE (MJCS) (original) (raw)

Papers by MALAYSIAN JOURNAL OF COMPUTER SCIENCE (MJCS)

Ontological lexicons are considered a rich source of knowledge for the development of various nat... more Ontological lexicons are considered a rich source of knowledge for the development of various natural language processing tools and applications; however, they are expensive to build, maintain, and extend. In this paper, we present the Badea system for the semi-automated extraction of lexical relations, specifically antonyms using a pattern-based approach to support the task of ontological lexicon enrichment. The approach is based on an ontology of " seed " pairs of antonyms in the Arabic language; we identify patterns in which the pairs occur and then use the patterns identified to find new antonym pairs in an Arabic textual corpora. Experiments are conducted on Badea using texts from three Arabic textual corpuses: KSUCCA, KACSTAC, and CAC. The system is evaluated and the patterns' reliability and system performance is measured. The results from our experiments on the three Arabic corpora show that the pattern-based approach can be useful in the ontological enrichment task, as the evaluation of the system resulted in the ontology being updated with over 300 new antonym pairs, thereby enriching the lexicon and increasing its size by over 400%. Moreover, the results show important findings on the reliability of patterns in extracting antonyms for Arabic. The Badea system will facilitate the enrichment of ontological lexicons that can be very useful in any Arabic natural language processing system that requires semantic relation extraction.

This paper presents an integrated language model to improve document relevancy for text-queries. ... more This paper presents an integrated language model to improve document relevancy for text-queries. To be precise, an integrated stemming-lemmatization (S-L) model was developed and its retrieval performance was compared at three document levels, that is, at top 5, 10 and 15. A prototype search engine was developed and fifteen queries were executed. The mean average precisions revealed the S-L model to outperform the baseline (i.e. no language processing), stemming and also the lemmatization models at all three levels of the documents. These results were also supported by the histogram precisions which illustrated the integrated model to improve the document relevancy. However, it is to note that the precision differences between the various models were insignificant. Overall the study found that when language processing techniques, that is, stemming and lemmatization are combined, more relevant documents are retrieved. 1.0 INTRODUCTION The use of internet all over the world has caused information size to increase, hence making it possible for large volumes of information to be retrieved by the users. However, this phenomenon also makes it difficult for users to find relevant information, therefore proper information retrieval techniques are needed. Information retrieval can be defined as " a problem-oriented discipline concerned with the problem of the effective and efficient transfer of desired information between human generator and human user " [1]. In short, information retrieval aims to provide users with those documents that will satisfy their information need. Many information retrieval algorithms were proposed, and some of the popular ones include the traditional Boolean model (i.e. based on binary decisions), vector space model (i.e. compares user queries with documents found in collections and computes their similarities), and probabilistic model (i.e. based on the probability theory to model uncertainties involved in retrieving data), among others. Over the years, information retrieval has evolved to include text retrieval in different languages, and thus giving birth to language models. The language model is particularly concerned with identifying how likely it is for a particular string in a specific language to be repeated [2]. A popular technique used in the language model is the N-gram model which predicts a preceding word based on previous N-1 words [3]. Other popular techniques include stemming and lemmatization.

Nowadays, distributed storage is adopted to alleviate Delay-tolerant networking (DTN) congestion,... more Nowadays, distributed storage is adopted to alleviate Delay-tolerant networking (DTN) congestion, but reliability transmission during the congestion period remains an issue. In this paper, we propose a multi-custodians distributed storage (MCDS) framework that includes a set of algorithms to determine when should appropriate bundles be migrated (or retrieved) to (or from) the suitable custodians, so that we can solve DTN congestion and improve reliability transmission simultaneously. MCDS adopts multiple custodians to temporarily store the duplications of migrated bundle for releasing DTN congestion. Thus, MCDS has more opportunities to retrieve the migrated bundles when network congestion is mitigated. Two performance metrics are used to evaluate the simulation results: the goodput ratio (GR) that represents QoS of data transmission, and the retrieved loss ratio (RLR) that reflects the performance of reliability transmission. We also use another distributed storage mechanism based on single-custodian distributed storage (SCDS) to evaluate MCDS. Simulation results show that MCDS has better GR and RLR in almost all simulation cases. For various scenarios, the GR and RLR of MCDS are in the range of 10.6%-18.4% and 23.2%-36.8%, respectively, which are higher than those of SCDS.

Many studies have been conducted for modeling the underlying non-linear relationship between pric... more Many studies have been conducted for modeling the underlying non-linear relationship between pricing attributes and price of property to forecast the housing sales prices. In recent years, more advanced non-linear modeling techniques such as Artificial Neural Networks (ANN) and Fuzzy Inference Systems (FIS) have emerged as effective techniques to predict the house prices. In this paper, we propose a fuzzy least-squares regression-based (FLSR) model to predict the prices of real estates. A comprehensive comparison studies in terms of prediction accuracy and computational complexity of ANN, Adaptive Neuro Fuzzy Inference System (ANFIS) and FLSR has been carried out. ANN has been widely used to forecast the price of real estates for many years while ANFIS has been introduced recently. On the other hand, FLSR is comparatively new. To the best of our knowledge, no property prices prediction using FLSR was developed until recently. Besides, a detailed comparative evaluation on the performance of FLSR with other modeling approaches on property price prediction could not be found in the existing literature. Simulation results show that FLSR provides a superior prediction function as compared to ANN and FIS in capturing the functional relationship between dependent and independent real estate variables and has the lowest computational complexity. 1.0 INTRODUCTION A real estate entity is an embodiment of the physical land and all its improvements together with all the rights, interests, benefits and liabilities arising from ownership of the entity. The valuation of real estate is thus an exercise in providing a quantitative measure of the benefits and liabilities accruing from the ownership [1]. The conduct of professional real estate valuation is the domain of the appraisers or assessors. In arriving at a value estimate, the professionals need to relate to important economic principles that underlie the operation of the real estate market. These include the principles of supply and demand, competition, substitution, anticipation and change. Common to all these principles is their direct and indirect effect on the degree of utility and productivity of a property. Consequently, it may be stated that the utility of real estate reflects the combined influences of all market forces that come to bear upon the value of a real estate parcel. In practice, the sales comparison method or market-value approach has been the traditional and by far the most common method adopted for real estate valuation, particularly for residential real estate [2]. Using this method, the value of a subject real estate is estimated on the basis of sales of other comparable properties. This is to establish an estimate of the market value of the subject real estate, which is deemed to be the estimated amount for which a real estate should exchange on the date of valuation between a willing buyer and a willing seller in an arm's length transaction after proper marketing wherein the parties have each acted knowledgeably, prudently and without compulsion [3]. The sales comparison method rests on the notion that there exists a direct relationship between the market value of a subject real estate and the selling prices of the comparable properties whereby the latter represent competitive investment alternatives in the market. Value influences such as the physical characteristics and location qualities are considered in analysing the comparability, in a process that embeds the consideration of supply and demand in leading to the final opinion on the value estimate. In practice, valuers utilise historical

The energy dissipated as heat for each utilization level of a data center server is empirically m... more The energy dissipated as heat for each utilization level of a data center server is empirically measured and stored as the thermal-profile. These thermal-profiles are used to predict the outlet temperatures of the related servers for current and future utilization. The predicted outlet temperature is an important parameter for energy efficient thermal-aware workload scheduling and workload migration in green data centers. This paper presents three models for outlet temperature prediction on virtualized data center servers based on thermal-profile. The best case scenario managed to predict the outlet temperature with a negligible error of 0.3 degree Celsius. Monitoring systems in data centers look-over the environment and performance of tens and hundreds of servers periodically. There are various parameters (e.g., temperature and utilization) monitored for each server. This data is required by data center infrastructure management (DCIM) [1] tools and workload scheduling systems [2] to achieve energy efficient and proficient utilization on data center servers. Apart from the idle energy consumption, the total energy consumed and dissipated as heat by each server is increased for every utilization level increment and vice versa. The energy usage of a virtualized server involves the virtualized instances of operating systems called virtual machines (VMs). From now on, the word server is used interchangeably with virtualized server. As long as the hardware configuration of a server is not altered, each heterogeneous server will dissipate a different but certain amount of heat for each discrete utilization level. This symbolic heat is empirically measured and stored as a thermal-profile for each server. The thermal-profile is used to predict the outlet temperature of a server by a given value of CPU usage through thermal-prediction modeling. A thermal prediction model eliminates at least one of the parameters e.g., outlet temperature, from the data center monitoring system without the loss of performance and accuracy. Similarly, the number of thermal sensors used for outlet temperature monitoring is reduced. The ability of a monitoring system to generate accurate thermal predictions offline, makes the prediction to become an essential parameter for thermal-aware workload scheduling and thermal-aware workload migration for load balancing. Based upon thermal-profile, this paper presents multiple prediction models to predict outlet temperature of data center servers along with a matrix comparison among these models. The test results show that the outlet

Mutation testing has been neglected by researchers because of the high cost associated with the t... more Mutation testing has been neglected by researchers because of the high cost associated with the technique. To manage this issue, researchers have developed cost reduction strategies that aim to reduce the overall cost of mutation, while maintaining the effectiveness and the efficiency of testing. The purpose of this research paper is to present a new cost reduction strategy that cuts the cost of mutation testing through reducing the number of mutation operators used. The experimental part of the paper focuses on the implementation of this strategy on five different java applications. The results of the experiment areused to evaluate the efficiency and quantify the savings of our approach compared to two other existing mutation testing strategies.

Software architectures have become one of the most crucial aspects of software engineering. Softw... more Software architectures have become one of the most crucial aspects of software engineering. Software architec-tures let designers specify systems in terms of components and their relations (i.e., connectors). These components along with their relations can then be verified to check whether their behaviours meet designers' expectations. XCD is a novel architecture description language, which promotes contractual specification of software archi-tectures and their automated formal verifications in SPIN. XCD allows designers to formally verify their system specifications for a number of properties, i.e., (i) incomplete functional behaviour of components, (ii) wrong use of services operated by system components, (iii) deadlock, (iv) race-conditions, and (v) buffer overflows in the case of asynchronous (i.e., event-based) communications. In addition to these properties, designers can specify their own properties in linear temporal logic and check their correctness. In this paper, I discuss XCD and its support for formal verification of software architectures through a simple shared-data access case study.

Distributed Virtual Environment (DVE) is a shared application consisting many objects, which can ... more Distributed Virtual Environment (DVE) is a shared application consisting many objects, which can be accessed by many users. There have been many methods used to scale the DVE such as dividing simulation workload, dynamic load balancing among servers, and creating alternative architectures. However, they may not accommodate many objects and users. In this paper, we explore all approaches used to scale the DVE and then determine the characteristics of the existing approaches. With those characteristics, we compared existing approaches based on three parameters: the number of simulation per region, implementation, and the number of objects managed by simulator. The results show that all approaches use the same viewpoint, called present viewpoint, in developing the DVE. It views DVE as a world where all objects and activities are managed by a simulator. The results also show that this viewpoint contributes in terms of limitations of the current DVEs performance. In response to these results, we further propose a new viewpoint, called object-based viewpoint, to generate object-based simulators architecture. The experiment results show that our proposed architecture can provide a large scale DVE with better performances than the previous architectures.

Automatic script identification in archives of documents is essential for searching a specific do... more Automatic script identification in archives of documents is essential for searching a specific document in order to choose an appropriate Optical Character Recognizer (OCR) for recognition. Besides, identification of one of the oldest historical documents such as Indus scripts is challenging and interesting because of inter script similarities. In this work, we propose a new robust script identification system for Indian scripts that includes Indus documents and other scripts, namely, English, Kannada, Tamil, Telugu, Hindi and Gujarati which helps in selecting an appropriate OCR for recognition. The proposed system explores the spatial relationship between dominant points,namely, intersection points, end points and junction points of the connected components in the documents to extract the structure of the components. The degree of similarity between the scripts is studied by computing the variances of the proximity matrices of dominant points of the respective scripts. The method is evaluated on 700 scanned document images. Experimentalresults show that the proposed system outperforms the existing methods in terms of classification rate.

With the advent of cloud computing, many businesses prefer to store their unstructured documents ... more With the advent of cloud computing, many businesses prefer to store their unstructured documents over the cloud. The preference is to store the encrypted unstructured document over the cloud for security. In most of these instances, one of the main criteria is to support fast searches without requiring any form of decryption. It is thus important to develop methods and architectures that can perform fast searches without compromising security and return the rank results for a client query. Our technique uses the enhanced version of the symmetric encryption algorithm for unstructured documents and develops a novel secure searchable hierarchical in-memory indexing scheme for each encrypted document using multiple Bloom filters and construct a dictionary over a large collection of encrypted unstructured documents. The paper also proposes a dynamic index construction method based on hierarchical in-memory index to perform fast and parallel rank searches over a large collection of encrypted unstructured documents. To the best of our knowledge, this is a novel contribution that propose methodology of constructing a dictionary using hierarchical in-memory index for performing fast and parallel rank searches over a large collection of encrypted unstructured documents. We introduce the concept of Q-gram for building the encrypted searchable index, and provide multiple Bloom filters for a given encrypted unstructured document or a chunk to build encrypted searchable indexes using separate Bloom filter for a set of bytes. Our proposed construction enables fast rank searches over encrypted unstructured documents. A detailed study of 44 billion code-words is worked out using off the shelf serves to demonstrate the effectiveness of Layer Indexing method.

Speech recognition is an emerging research area having its focus on human computer interactions (... more Speech recognition is an emerging research area having its focus on human computer interactions (HCI) and expert systems. Analyzing speech signals are often tricky for processing, due to the non-stationary nature of audio signals. The work in this paper presents a system for speaker independent speech recognition, which is tested on isolated words from three oriental languages, i.e., Urdu,Persian, and Pashto. The proposed approach combines discrete wavelet transform (DWT) and feed-forward artificial neural network (FFANN) for the purpose of speech recognition. DWT is used for feature extraction and the FFANN is utilized for the classification purpose. The task of isolated word recognition is accomplished with speech signal capturing, creating a code bank of speech samples, and then by applying pre-processing techniques.For classifying a wave sample, four layered FFANN model is used with resilient back-propagation (Rprop). The proposed system yields high accuracy for two and five classes.For db-8 level-5 DWT filter 98.40%, 95.73%, and 95.20% accuracy rate is achieved with 10, 15, and 20 classes, respectively. Haar level-5 DWT filter shows 97.20%, 94.40%, and 91% accuracy ratefor 10, 15, and 20 classes, respectively. The proposed system is also compared with a baseline method where it shows better performance. The proposed system can be utilized as a communication interface to computing and mobile devices for low literacy regions.

Evidence indicates that risks in IT projects which are not effectively managed and lack of identi... more Evidence indicates that risks in IT projects which are not effectively managed and lack of identification and management during the life cycle of a project can contribute to their failures. Traditional risk assessment methods usually model risks with objective probabilities based on the expected frequency of repeatable events. Meanwhile, managers prefer to linguistically represent likelihoods because of the uncertainty and vagueness of risk factors. The objective of this paper is to identify risk mitigation strategies in software development projects from the perspectives of software practitioners and determine the effectiveness of these strategies. We explore the use of fuzzy methods to overcome the problems associated with probabilistic modelling through a set of questionnaire surveys which was conducted among 3000 IT practitioners using Tukey-B test, Kendall's test and Post Hoc Tukey HSD test. We apply Fuzzy Membership Function (Fuzzy-MBF) as an appropriate mechanism in dealing with the subjectivity in the assessment of risk factors in different stages of a software development life cycle. The proposed Fuzzy-MBF offers a quantitative evaluation of risk factors and provides a systemic evaluation of risk and visualization of results.

Customer defection or "churn" rate is critically important since it leads to serious business los... more Customer defection or "churn" rate is critically important since it leads to serious business loss. Therefore, many telecommunication companies and operators have increased their concern about churn management and investigated statistical and data mining based approaches which can help in identifying customer churn. In this paper, a churn prediction framework is proposed aiming at enhancing the predictability of churning customers. The framework is based on combining two heuristic approaches; Fast Fuzzy C-Means (FFCM) and Genetic Programming (GP). Considering the fact that GP suffers three different major problems: sensitivity towards outliers, variable results on various runs, and resource expensive training process, FFCM was first used to cluster the data set and exclude outliers, representing abnormal customers' behaviors, to reduce the GP possible sensitivity towards outliers and training resources. After that, GP is applied to develop a classification tree. For the purpose of this work, a data set was provided by a major Jordanian telecommunication mobile operator.

The aim of this research is to develop and propose a single-layer semi-supervised feed forward ne... more The aim of this research is to develop and propose a single-layer semi-supervised feed forward neural network clustering method with one epoch training in order to solve the problems of low training speed, accuracy and high time and memory complexities of clustering. A code book of non-random weights is learned through the input data directly. Then, the best match weight (BMW) vector is mined from the code book, and consequently an exclusive total threshold of each input data is calculated based on the BMW vector. The input data are clustered based on their exclusive total thresholds. Finally, the method assigns a class label to each input data by using a K-step activation function for comparing the total thresholds of the training set and the test set. The class label of other unlabeled and unknown input test data are predicted based on their clusters or trial and error technique, and the number of clusters and density of each cluster are updated. In order to evaluate the results , the proposed method was used to cluster five datasets, namely the breast cancer Wisconsin, Iris, Spam, Arcene and Yeast from the University of California Irvin (UCI) repository and a breast cancer dataset from the University of Malaya Medical center (UMMC), and their results were compared with the results of the several related methods. The experimental results show the superiority of the proposed method.

E-learning is becoming the new paradigm of learning and training, especially in Higher Educationa... more E-learning is becoming the new paradigm of learning and training, especially in Higher Educational Institutions (HEIs) around the globe. HEIs in developing countries are struggling to shift to this new paradigm that would facilitate accommodating increasingly more learners in their own places and with their own time constraint choices. E-learning is not gaining as much attention in developing countries as anticipated in the last decade. Moreover, very little work has been done in this area of research in developing countries like Pakistan. This study contributes to formulating a hierarchical model of the challenges affecting the integration of information and communication technology in Pakistan's HEIs. This study also contributes devised strategies and recommendations to overcome challenges by providing a roadmap for the implementation of e-learning systems in developing countries. An empirical-based research method was employed, with two surveys conducted on e-learning experts from different public universities. The factor analysis method was used to categorize challenges, while the Analytical Hierarchy Process (AHP) method was utilized to prioritize the identified challenges. The findings revealed 17 critical challenges, which were then categorized into 5 dimensions. The study's implications in terms of research and practice, limitations and future research directions are also discussed.

Software Configuration Management (SCM) aims to provide a controlling mechanism for the evolution... more Software Configuration Management (SCM) aims to provide a controlling mechanism for the evolution of software artifacts created during software development process. Controlling the evolution requires many activities to perform, such as, construction and creation of versions, computation of mappings and differences between versions, combining of two or more versions and so on. Traditional SCM systems are file-based SCM systems. File-based SCM systems are not adequate for performing software configuration management activities because they consider software artifacts as a set of text files while today software development is model-driven and models are the main artifacts produced in the early phases of software life cycle. New challenges of model mappings, differencing, merging (combining two or more versions), and conflict detection (identifying conflicting changes by multiple users) arise while applying file-based solution to models. The goal of this work is to develop a configuration management solution for model representation, mappings and differences which overcomes the challenges faced by traditional SCM systems while model being the central artifact. Our solution is two-folded. First part deals with model representation. While traditional SCM systems represent models as textual files at fine-granular level, we represent models as graph structure at fine-granular level. In second part we are dealing with the issue of model diff, i.e., calculating the mappings and differences between two versions of a model. Since our model diff solution is based on our fine-granular model representation therefore we overcome not only the problem of textual representation of model but produce efficient results for model diff in terms of accuracy, execution time, tool independency and other evaluation parameters. We performed a controlled experiment using open source eclipse modeling framework and compare our approach with an open source tool EMF Compare. The results proved the efficiency of our approach.

In the early stages of learning computer programming, Computer Science (CS) minors share a miscon... more In the early stages of learning computer programming, Computer Science (CS) minors share a misconception of what programming is. In order to address this problem, FMAS, a flowchart-based multi-agent system is developed to familiarize students who have no prior knowledge of programming, with the initial stages in learning programming. The aim is to improve students' problem solving skills and to introduce them to the basic programming algorithms prior to surface structure, using an automatic text-to-flowchart conversion approach. Therefore, students can focus less on language and syntax and more on designing solutions through flowchart development. The way text-to-flowchart conversion as a visualization-based approach is employed in FMAS to engage students in flowchart development for subsequent programming stages is discussed in this paper. Finally, an experimental study is devised to assess the success of FMAS, and positive feedback is achieved. Therefore, using FMAS in practice is supported, as the results indicate considerable gains for the experimental group over the control group. The results also show that an automatic text-to-flowchart conversion approach applied in FMAS successfully motivated nearly all participants in problem solving activities. Consequently, the results suggest additional, future development of our proposed approach in the form of an Intelligent Tutoring System (ITS) to make the early stages of learning programming more encouraging for students.

One of the most exigent features of a risk is risk alteration that can exacerbate its consequence... more One of the most exigent features of a risk is risk alteration that can exacerbate its consequences and make its management difficult. Therefore, good risk management models should be able to identify risks and monitor the changes to the risk as the project progresses. This feature is not emphasized in the current risk management models, and this has resulted in a high rate of failure in software risk management. This paper discusses the development of a software risk management model that uses features of an embedded audit componet as a verifier core. Special emphasis is on managing the risks of the risk management process which is done by remonitoring the risks and activities through the verifier core. The model includes four main phases – risk identification; measurement; assessment; and mitigation and contingency plan. In order to evaluate the model, a six-month case study was conducted using the customer relationship management system of an industrial design company. The use of the proposed model produces the following results: more accurate risk classification (phase 1); more exact definition of the deviation rate from the established schedule (phase 2); the model adapts well to the changes to the risk factors, and makes better assessment of the consequences (phase 3); in implementing the mitigation and contingency plan, the dynamic verifier core successfully uncovers ignorable mistakes and also helps to reduce or lessen the consequences (phase 4). The proposed model has proven to be effective in reducing the unforeseen risks. This will improve the success rates of software projects.

This paper presentsacomparative study of feature selection methods for Urdu text categorization. ... more This paper presentsacomparative study of feature selection methods for Urdu text categorization. Fivewell-knownfeature selection methods were analyzedby means ofsixrecognized classification algorithms: support vector machines (with linear, polynomial and radial basis kernels), naive Bayes, k-nearest neighbour (KNN), and decision tree (i.e. J48). Experimentations are performed on two test collections includinga standard EMILLE collection and a naive collection. We have found that information gain, Chi statistics, and symmetrical uncertainfeature selection methods have uniformly performed in mostly cases. We also found that no solo feature selection technique is best for every classifier.That is,naive Bayes and J48 have advantage with gain ratio than other feature selection methods. Similarly, support vector machines (SVM) and KNN classifiers have shown top performance with information gain.Generally,linear SVM with any of feature selection methods out-performed other classifiers on moderate-size naive collection.Conversely, naive Bayes with any of feature selection technique has an advantage over other classifiers for a small-size EMILLE corpus.

This study was conducted to investigate the moderating effect of health professional's working ex... more This study was conducted to investigate the moderating effect of health professional's working experience on the relationship between factors of Health Information System Security Policies Compliance Behaviour (HISSPC) model. A survey (i.e., n = 454) was conducted to test the differences between high experience and low experiencehealth professionals who were Health Information System (HIS) users. The HISSPC model was tested using partial least squares (PLS) approachwith results indicating the coefficient of determination (i.e., R 2) for high experience group (i.e., 63 percent)to be slightly higher than the low experience group(i.e., 60 percent). Statistical differences were noted for the relationship between management support and user's compliance behaviour in both groups,with stronger relationship forlow experienceHIS users compared to high experience HIS users. In contrast, perceived susceptibility was found to significantly influence highly experienced users to comply with HIS security policies, however it had no significant effect for the low experience group.The overall moderating effect size for high experience userswas approximately 0.07 (i.e. small) andno moderating effect was observed for the low experience group (i.e., ƒ 2 = 0.01). It was believed that the findings will provide better guidelines to fellow researchers and policy makers in improving information security behaviour among health professionals in hospitals, particularly those with varying working experiences.