Dataset Research Papers - Academia.edu (original) (raw)

In today's society, an enormous amount of data is created that is stored in various databases. Since the data is in many cases stored in different databases, there is a demand from organizations with a lot of data to be able to merge... more

In today's society, an enormous amount of data is created that is stored in various databases. Since the data is in many cases stored in different databases, there is a demand from organizations with a lot of data to be able to merge separated data and get an extraction of this resource. Extract, Transform and Load System (ETL) is a solution that has made it possible to easily merge different databases. However, the ETL market has been owned by large actors such as Amazon and Microsoft and the solutions offered are completely owned by these actors. This leaves the consumer with little ownership of the solution. Therefore, this thesis proposes a framework to create a component based ETL which gives consumers an opportunity to own and develop their own ETL solution that they can customize to their own needs. The result of the thesis is a prototype ETL solution that is built with the idea of being able to configure and customize the prototype and it accomplishes this by being indep...

Risk modelling along with multi-objective optimization problems have been at the epicenter of attention for supply chain managers. In this paper, we introduce a dataset for risk modelling in sophisticated supply chain networks based on... more

Risk modelling along with multi-objective optimization problems have been at the epicenter of attention for supply chain managers. In this paper, we introduce a dataset for risk modelling in sophisticated supply chain networks based on formal mathematical models. We have discussed the methodology and simulation tools used to synthesize the dataset. Additionally, the underlying mathematical models are discussed in granular details along with providing directions to conducting statistical analyses or neural machine learning models. The simulation is performed using MATLAB ™Simulink and the models are illustrated as well.

In this paper, we introduce EVALution 1.0, a dataset designed for the training and the evaluation of Distributional Semantic Models (DSMs). This version consists of almost 7.5K tuples, instantiating several semantic relations between word... more

In this paper, we introduce EVALution 1.0, a dataset designed for the training and the evaluation of Distributional Semantic Models (DSMs). This version consists of almost 7.5K tuples, instantiating several semantic relations between word pairs (including hypernymy, synonymy, antonymy, meronymy). The dataset is enriched with a large amount of additional information (i.e. relation domain, word frequency, word POS, word semantic field, etc.) that can be used for either filtering the pairs or performing an in-depth analysis of the results. The tuples were extracted from a combination of ConceptNet 5.0 andWordNet 4.0, and subsequently filtered through automatic methods and crowdsourcing in order to ensure their quality. The dataset is freely downloadable. An extension in RDF format, including also scripts for data processing, is under development.

The social investment perspective has become a reference framework for comparative welfare state analysis, and a powerful idea influencing the European social dimension since the Lisbon Strategy. A number of empirical studies in the... more

The social investment perspective has become a reference framework for comparative welfare state analysis, and a powerful idea influencing the European social dimension since the Lisbon Strategy. A number of empirical studies in the field have focused on the budgetary side of welfare state change, tracking the dynamics of “new” social investment versus “old” social protection spending. Still, many data limitations (e.g. scarce country/years coverage) and the prevailing use of rough spending-over-the-GDP indicators have hindered the progress of our empirical knowledge over social investment in Europe. This working paper presents a new data set and methodology for the comparative analysis of welfare state budgets from the perspective of social investment. Based on various Eurostat data sources, the Social Investment Welfare Expenditure data set (SIWE) includes social spending data finely disaggregated into welfare functions for 29 countries (EU-28 less Croatia, plus Norway and Switzerland), and covers years from 1995 to 2014. Building on previous contributions, I develop a new methodology for measuring “budgetary welfare effort” (BWE), that
is, the effort effectively put by governments on selected welfare programmes, net of the interferences due to economic and demographic oscillations. I also construct two composite BWE indices that allow to directly compare the whole social investment and social protection dimensions of welfare state budgets. This provides researchers with a fresh tool for empirical analyses of the dynamics, causes and consequences of welfare state change from the perspective of social investment. The SIWE data set can be requested from the author’s web page.

Diabetes comprises of noisy features. This feature hampers classification and prediction for Artificial Intelligence (AI) system. The optimization of diabetes dataset using Genetic Algorithm (GA) exploring its fundamentals identifies the... more

Diabetes comprises of noisy features. This feature hampers classification and prediction for Artificial Intelligence (AI) system. The optimization of diabetes dataset using Genetic Algorithm (GA) exploring its fundamentals identifies the focus of this research. The dataset obtained from Biostat comprising of random samples (fifteen: 15) and parameter variables five: cholestrol, high-density lipoprotein, age, height and weight was used for the optimization. The simulation was Matrix Laboratory (MATLB). The optimized dataset was validated using standard optimization equation resulting in percentage score of Forty-one (41%) percent. This dataset will be using in classifying fuzzy system

Risk modelling along with multi-objective optimization problems have been at the epicenter of attention for supply chain managers. In this paper, we introduce a dataset for risk modelling in sophisticated supply chain networks based on... more

Risk modelling along with multi-objective optimization problems have been at the epicenter of attention for supply chain managers. In this paper, we introduce a dataset for risk modelling in sophisticated supply chain networks based on formal mathematical models. We have discussed the methodology and simulation tools used to synthesize the dataset. Additionally, the underlying mathematical models are discussed in granular details along with providing directions to conducting statistical analyses or neural machine learning models. The simulation is performed using MATLAB ™Simulink and the models are illustrated as well.

Forged documents specifically passport, driving licence and VISA stickers are used for fraud purposes including robbery, theft and many more. So detecting forged characters from documents is a significantly important and challenging task... more

Forged documents specifically passport, driving licence and VISA stickers are used for fraud purposes including robbery, theft and many more. So detecting forged characters from documents is a significantly important and challenging task in digital forensic imaging. Forged characters detection has two big challenges. First challenge is, data for forged characters detection is extremely difficult to get due to several reasons including limited access of data, unlabeled data or work is done on private data. Second challenge is, deep learning (DL) algorithms require labeled data, which poses a further challenge as getting labeled is tedious, time-consuming, expensive and requires domain expertise. To end these issues, in this paper we propose a novel algorithm, which generates the three datasets namely forged characters detection for passport (FCD-P), forged characters detection for driving licence (FCD-D) and forged characters detection for VISA stickers (FCD-V). To the best of our kn...

Network forensics is a sub-branch of digital forensics relating to the monitoring and analysis of computer network traffic for the purposes of information gathering, legal evidence. Unlike other areas of digital forensics, network... more

Network forensics is a sub-branch of digital forensics relating to the monitoring and analysis of computer network traffic for the purposes of information gathering, legal evidence. Unlike other areas of digital forensics, network investigations deal with volatile and dynamic information. Network traffic is transmitted and then lost, so network forensics is often a pro-active investigation. Network forensics generally has two uses. The first, relating to security, involves monitoring a network for anomalous traffic and identifying intrusions. The second form relates to law enforcement. In this case analysis of captured network traffic can include tasks such as reassembling transferred files, searching for keywords and parsing human communication such as emails or chat sessions. Nowadays use of mobile apps to communicate with friends. Not only communication purpose it gets information about sensitive topics such as diseases, sexual or religious preferences, etc. Numerous worries have been raised about the capabilities of these portable devices to occupy the privacy of users actually becoming " tracking devices ". Above problem they influence in our work to find solution using machine learning techniques. It is used to protect the content of a packet. Our framework analyzes the network communications and leverages information available in TCP/IP packets like IP addresses and ports, together with other information like the size, the direction, and the timing. Our system, for each app they first pre-process a dataset of network packets labeled with the user actions that originated them, they cluster them in flow typologies that represent recurrent network flows, and finally it analyze them in order to create a training set that will be used to feed a classifier. The trained classifier will then be able to classify new traffic traced. Our approach results shows it accuracy and precision more than 95% for most of the considered actions.

In India, Agriculture contributes major role to Indian economy. For agriculture, Rainfall is important but during these days' rainfall prediction has become a major challenging problem. Good prediction of rainfall provides knowledge and... more

In India, Agriculture contributes major role to Indian economy. For agriculture, Rainfall is important but during these days' rainfall prediction has become a major challenging problem. Good prediction of rainfall provides knowledge and know in advance to take precautions and have better strategy about theirs crops. Global warming is also having severe effect on nature as well as mankind and it accelerates the change in climatic conditions. Because of its air is getting warmer and level of ocean is rising, leads to flood and cultivated field is changing into drought. Due to adverse climatic change leads to unseasonable and unreasonable amount of rainfall. To predict Rainfall is one of the best techniques to know about rainfall and climate. The main aim of this study revolves around providing correct climate description to the clients from various perspectives like agriculture, researchers, generation of power etc. to grasp the need of transformation in climate and its parameters like temperature, humidity, precipitation, wind speed that eventually directs to projection of rainfall. Rainfall also depends on geographic locations hence is an arduous task to predict. Machine Learning is the evolving subset of an AI, that helps in predicting the rainfall. In this research paper, we will be using UCI repository dataset with multiple attributes for predicting the rainfall. The main aim of this study is to develop the rainfall prediction system and predict the rainfall with better accuracy with the use of Machine Learning classification algorithms.

The study presents the Autism Spectrum Disorder Question Answering Dataset (ASD QA), a new Russian dataset based on the structure of the Stanford Question Answering Dataset (SQuAD), a machine reading comprehension dataset. The ASD QA... more

The study presents the Autism Spectrum Disorder Question Answering Dataset (ASD QA), a new Russian dataset based on the structure of the Stanford Question Answering Dataset (SQuAD), a machine reading comprehension dataset. The ASD QA dataset is a work in progress. The dataset version described in the paper consists of 1,134 question-answer pairs compiled by the author of the paper from the information website for individuals with autism spectrum disorders (ASD) and Asperger’s syndrome and their parents. The paper also describes several question-answering models built to analyze the dataset.

In the past few years, we have developed a research paper recommender system for our reference management software Docear. In this paper, we introduce the architecture of the recommender system and four datasets. The architecture... more

In the past few years, we have developed a research paper recommender system for our reference management software Docear. In this paper, we introduce the architecture of the recommender system and four datasets. The architecture comprises of multiple components, e.g. for crawling PDFs, generating user models, and calculating content-based recommendations. It supports researchers and developers in building their own research paper recommender systems, and is, to the best of our knowledge, the most comprehensive architecture that has been released in this field. The four datasets contain metadata of 9.4 million academic articles, including 1.8 million articles publicly available on the Web; the articles' citation network; anonymized information on 8,059 Docear users; information about the users' 52,202 mind-maps and personal libraries; and details on the 308,146 recommendations that the recommender system delivered. The datasets are a unique source of information to enable, for instance, research on collaborative filtering, content-based filtering, and the use of reference-management and mind-mapping software.

In today's life, every human is in hurry to reach their destination like home, office, college, shopping mall, restaurant, etc as quickly as possible. To reach their destination quickly people use vehicles on road use and drive them in... more

In today's life, every human is in hurry to reach their destination like home, office, college, shopping mall, restaurant, etc as quickly as possible. To reach their destination quickly people use vehicles on road use and drive them in faster mode which results in road accidents. And it is a common belief that if a person behavior is being monitored, it would be relatively safer. Driver behavior is a major cause for the road accidents. To address this problem, drive behavior analysis and prediction models need to be developed.

The ability of Minkowski Functionals to characterize local structure in different biological tissue types has been demonstrated in a variety of medical image processing tasks. We introduce anisotropic Minkowski Functionals (AMFs) as a... more

The ability of Minkowski Functionals to characterize local structure in different biological tissue types has been demonstrated in a variety of medical image processing tasks. We introduce anisotropic Minkowski Functionals (AMFs) as a novel variant that captures the inherent anisotropy of the underlying gray-level structures. To quantify the anisotropy characterized by our approach, we further introduce a method to compute a quantitative measure motivated by a technique utilized in MR diffusion tensor imaging, namely fractional anisotropy. We showcase the applicability of our method in the research context of characterizing the local structure properties of trabecular bone micro-architecture in the proximal femur as visualized on multi-detector CT. To this end, AMFs were computed locally for each pixel of ROIs extracted from the head, neck and trochanter regions. Fractional anisotropy was then used to quantify the local anisotropy of the trabecular structures found in these ROIs and to compare its distribution in different anatomical regions. Our results suggest a significantly greater concentration of anisotropic trabecular structures in the head and neck regions when compared to the trochanter region (p < 10-4). We also evaluated the ability of such AMFs to predict bone strength in the femoral head of proximal femur specimens obtained from 50 donors. Our results suggest that such AMFs, when used in conjunction with multi-regression models, can outperform more conventional features such as BMD in predicting failure load. We conclude that such anisotropic Minkowski Functionals can capture valuable information regarding directional attributes of local structure, which may be useful in a wide scope of biomedical imaging applications.

Data mining is defined as a search through large amounts of data for valuable information. The association rules, grouping, clustering, prediction, sequence modeling is some essential and most general strategies for data extraction. The... more

Data mining is defined as a search through large amounts of data for valuable information. The association rules, grouping, clustering, prediction, sequence modeling is some essential and most general strategies for data extraction. The processing of data plays a major role in the healthcare industry's disease detection. A variety of disease evaluations should be required to diagnose the patient. However, using data mining strategies, the number of examinations should be decreased. This decreased examination plays a crucial role in terms of time and results. Heart disease is a death-provoking disorder. In this recent instance, health issues are immense because of the availability of health issues and the grouping of various situations. Today, secret information is important in the healthcare industry to make decisions. For the prediction of cardiovascular problems, (Weka 3.8.3) tools for this analysis are used for the prediction of data extraction algorithms like sequential minimal optimization (SMO), multilayer perceptron (MLP), random forest and Bayes net. The data collected combine the prediction accuracy results, the receiver operating characteristic (ROC) curve, and the PRC value. The performance of Bayes net (94.5%) and random forest (94%) technologies indicates optimum performance rather than the sequential minimal optimization (SMO) and multilayer perceptron (MLP) methods.

Sales forecasting became crucial for industries in past decades with rapid globalization, widespread adoption of information technology towards e-business, understanding market fluctuations, meeting business plans, and avoiding loss of... more

Sales forecasting became crucial for industries in past decades with rapid globalization, widespread adoption of information technology towards e-business, understanding market fluctuations, meeting business plans, and avoiding loss of sales. This research precisely predicts the automotive industry sales using a bag of multiple machine learning and time series algorithms coupled with historical sales and auxiliary features. Three-year historical sales data (from 2017 till 2020) were used for the model building or training, and one-year (2020-2021) predictions were computed for 900 unique SKU's (stock-keeping units). In the present study, the SKU is a combination of sales office, core business field, and material customer group. Various data cleaning and exploratory data analysis algorithms were implemented over raw datasets before use for modeling. Mean absolute percentage error (mape) were estimated for individual predictions from time series and machine learning models. The best model was selected for unique SKU's as per the most negligible mape value.

Los Conjuntos de datos originados desde las Ciencias Naturales, y en particular desde las Ciencias Geológicas, son cada vez más grandes, requiriéndose la aplicación de una gran variedad de herramientas de visualización para su análisis y... more

Los Conjuntos de datos originados desde las Ciencias Naturales, y en particular desde las Ciencias Geológicas, son cada vez más grandes, requiriéndose la aplicación de una gran variedad de herramientas de visualización para su análisis y exploración; tal es el caso de conjuntos de datos topográficos, proyecciones cartográficas, datos geofísicos, etc., que requieren un soporte visual adecuado para su exploración. Esta línea de investigación propone el estudio e implementación de sistemas de visualización interactivos de datos geológicos y de sistemas de Realidad Aumentada, que provean un soporte adecuado para la exploración eficiente de los datos, tanto en el laboratorio como en el campo.

The rapid growth of social networking is supplementing the progression of cyberbullying activities. Most of the individuals involved in these activities belong to the younger generations, especially teenagers, who in the worst scenario... more

The rapid growth of social networking is supplementing the progression of cyberbullying activities. Most of the individuals involved in these activities belong to the younger generations, especially teenagers, who in the worst scenario are at more risk of suicidal attempts. This proposes an effective approach to detect cyber bullying messages from social media through a SVM classifier algorithm. This present ranking algorithm to access highest visited link and also provide age verification before access the particular social media. The experiments show effectiveness of our approach.

One of the most common malignant tumors in the world today is lung cancer, and it is the primary cause of death from cancer. With the continuous advancement of urbanization and industrialization, the problem of air pollution has become... more

One of the most common malignant tumors in the world today is lung cancer, and it is the primary cause of death from cancer. With the continuous advancement of urbanization and industrialization, the problem of air pollution has become more and more serious. The best treatment period for lung cancer is the early stage. However, the early stage of lung cancer often does not have any clinical symptoms and is difficult to be found. In this paper, lung nodule classification has been performed; the data have used of CT image is SPIE AAPM-Lung. In recent years, deep learning (DL) was a popular approach to the classification process. One of the DL approaches that have used is Transfer Learning (TL) to eliminate training costs from scratch and to train for deep learning with small training data. Nowadays, researchers have been trying various deep learning techniques to improve the efficiency of CAD (computer-aided system) with computed tomography in lung cancer screening. In this work, we i...

Social media has opened new avenues and opportunities for financial banking institutions to improve the quality of their products and services and to understand and to adapt to their cus-tomers' needs. By directly analyzing the feedback... more

Social media has opened new avenues and opportunities for financial banking institutions to improve the quality of their products and services and to understand and to adapt to their cus-tomers' needs. By directly analyzing the feedback of its customers, financial banking institutions can provide personalized products and services tailored to their customer needs. This paper presents a research framework for creation of a financial banking dataset in order to be used for Sentiment Classification using various Machine Learning methods and techniques. The dataset contains 2234 financial banking comments from Romanian financial banking social media collected via web scraping technique.

This paper introduces a new image-based handwritten historical digit dataset named Arkiv Digital Sweden (ARDIS). The images in ARDIS dataset are extracted from 15,000 Swedish church records which were written by different priests with... more

This paper introduces a new image-based handwritten historical digit dataset named Arkiv Digital Sweden (ARDIS). The images in ARDIS dataset are extracted from 15,000 Swedish church records which were written by different priests with various handwriting styles in the nineteenth and twentieth centuries. The constructed dataset consists of three single-digit datasets and one-digit string dataset. The digit string dataset includes 10,000 samples in red-green-blue color space, whereas the other datasets contain 7600 single-digit images in different color spaces. An extensive analysis of machine learning methods on several digit datasets is carried out. Additionally, correlation between ARDIS and existing digit datasets Modified National Institute of Standards and Technology (MNIST) and US Postal Service (USPS) is investigated. Experimental results show that machine learning algorithms, including deep learning methods, provide low recognition accuracy as they face difficulties when trained on existing datasets and tested on ARDIS dataset. Accordingly, convolutional neural network trained on MNIST and USPS and tested on ARDIS provide the highest accuracies 58:80% and 35:44%, respectively. Consequently, the results reveal that machine learning methods trained on existing datasets can have difficulties to recognize digits effectively on our dataset which proves that ARDIS dataset has unique characteristics. This dataset is publicly available for the research community to further advance handwritten digit recognition algorithms.
The data sets are freely available from:
https://ardisdataset.github.io/ARDIS/

Los Conjuntos de datos originados desde las Ciencias Naturales, y en particular desde las Ciencias Geológicas, son cada vez más grandes, requiriéndose la aplicación de una gran variedad de herramientas de visualización para su análisis y... more

Los Conjuntos de datos originados desde las Ciencias Naturales, y en particular desde las Ciencias Geológicas, son cada vez más grandes, requiriéndose la aplicación de una gran variedad de herramientas de visualización para su análisis y exploración; tal es el caso de conjuntos de datos topográficos, proyecciones cartográficas, datos geofísicos, etc., que requieren un soporte visual adecuado para su exploración. Esta línea de investigación propone el estudio e implementación de sistemas de visualización interactivos de datos geológicos y de sistemas de Realidad Aumentada, que provean un soporte adecuado para la exploración eficiente de los datos, tanto en el laboratorio como en el campo.

COVID-19's outbreak affected and compelled people from all walks of life to self-quarantine in their houses in order to prevent the virus from spreading. As a result of adhering to the exceedingly strict guideline, many people developed... more

COVID-19's outbreak affected and compelled people from all walks of life to self-quarantine in their houses in order to prevent the virus from spreading. As a result of adhering to the exceedingly strict guideline, many people developed mental illnesses. Because the educational institution was closed at the time, students remained at home and practiced self-quarantine. As a result, it is necessary to identify the students who developed mental illnesses at that time. To develop AiPsych, a mobile application-based artificial psychiatrist, we train supervised and deep learning algorithms to predict the mental illness of students during the COVID-19 situation. Our experiment reveals that supervised learning outperforms deep learning, with a 97% accuracy of the Support Vector Machine (SVM) for mental illness prediction. Random Forest (RF) achieves the best accuracy of 91% for the recovery suggestion prediction. Our android application can be used by parents, educational institutes, or the government to get the predicted result of a student's mental illness status and take proper measures to overcome the situation.

360° videos and Head-Mounted Displays (HMDs) are geing increasingly popular. However, streaming 360° videos to HMDs is challenging. is is because only video content in viewers' Field-of-Views (FoVs) are rendered, and thus sending complete... more

360° videos and Head-Mounted Displays (HMDs) are geing increasingly popular. However, streaming 360° videos to HMDs is challenging. is is because only video content in viewers' Field-of-Views (FoVs) are rendered, and thus sending complete 360° videos wastes resources, including network bandwidth, storage space, and processing power. Optimizing the 360° video streaming to HMDs is, however, highly data and viewer dependent, and thus dictates real datasets. However, to our best knowledge, such datasets are not available in the literature. In this paper, we present our datasets of both content data (such as image saliency maps and motion maps derived from 360° videos) and sensory data (such as viewer head positions and orientations derived from HMD sensors). We put extra eorts to align the content and sensory data using the timestamps in the raw log les. e resulting datasets can be used by researchers, engineers, and hobbyists to either optimize existing 360° video streaming applications (like rate-distortion optimization) and novel applications (like crowd-driven camera movements). We believe that our datasets will stimulate more research activities along this exciting new research direction.

Developing hardware, algorithms and protocols, as well as collecting data in sensor networks are all important challenges in building good systems. We describe a vertical system integration of a sensor node and a toolkit of machine... more

Developing hardware, algorithms and protocols, as well as collecting data in sensor networks are all important challenges in building good systems. We describe a vertical system integration of a sensor node and a toolkit of machine learning algorithms. Based on a dataset that combines sensor data with additional introduced data we predict the number of persons in a closed space.

With the growth of the Internet and its potential, more and more people are getting connected to the Internet every day to take advantage of the e-Commerce. On one side, the Internet brings in tremendous potential to business in terms of... more

With the growth of the Internet and its potential, more and more people are getting connected to the Internet every day to take advantage of the e-Commerce. On one side, the Internet brings in tremendous potential to business in terms of reaching the end users. At the same time it also brings in lot of security risk to the business over the network. With the growth of cyber-attacks, information safety has become an important issue all over the world. Intrusion detection systems (IDSs) are an essential element for network security infrastructure and play a very important role in detecting large number of attacks. This survey paper introduces a detailed analysis of the network security problems and also represents a review of the current research. The main aim of the paper is to finds out the problem associated with network security for that various existing approaches related to intrusion detection and preventions are discussed. This survey focuses on presenting the different issues that must be addressed to build fully functional and practically usable intrusion detection systems (IDSs). It points out the state of the art in each area and suggests important open research issues.

Edge computing is a paradigm that can distribute the complexity of predictive analysis into smaller pieces, physically placed at the source of contextual information. This allows in processing large amounts of data where it is intricate... more

Edge computing is a paradigm that can distribute the complexity of predictive analysis into smaller pieces, physically placed at the source of contextual information. This allows in processing large amounts of data where it is intricate to use a centralized cloud. Edge Computing makes this possible by taking control of data and services from central hubs, which reduces computational latency on servers. Humidity is one of the main factors that maintain the life of the surface. This article explains how to perform computational analysis at the "edge" by using humidity data sets, and also shows that the most modern data is sufficient for data analysis. Linear Regression and Random Forest Regression algorithms are utilized for data analysis. In addition, this article illustrates the importance of data series for predicting humidity by comparing the analysis of unshuffled data and shuffled data. Metrics have been used to assess the accuracy and point out the importance of sequential data feeds for analysis.

Lung cancer is one of the most common types of cancer worldwide. It is also one of the deadliest types of cancer. Using computed tomography scans of the human lungs radiologists can detect dangerous nodules in early stages. When more... more

Lung cancer is one of the most common types of cancer worldwide. It is also one of the deadliest types of cancer. Using computed tomography scans of the human lungs radiologists can detect dangerous nodules in early stages. When more people adopting for these scans, the workload on the radiologists rises. Automatic detection systems that automatically detect these nodules can support radiologists and reduce their workload. Histopathological analysis of whole-slide images is one of the most widely used techniques for diagnosis of lung cancers. In this study, a fully automated pipeline was developed to detect cancer from histopathology slides of lung tissue. We obtained 220023images of dataset consisting of 89116 Histopathologic cancer images, 130907 not affected by Histopathologic cancer and used them to test the proposed methodology. At preprocessing step, we trained a classification model to detect clinically relevant patches of images using statistical measurements. In the next step, cells and nuclei of the cells were segmented and various texture and morphology features were extracted from images and segmented objects. At the final step, different classification models were applied to distinguish between malignant tissues and adjacent normal cells. The results indicates that the usage of machine learning algorithms at pre-processing step for detecting relevant sections of whole slide images improves the performance of automated cancer detection systems substantially.

The aim of sentiment analysis is to automatically extract the opinions from a certain text and decide its sentiment. In this paper, we introduce the first publicly-available Twitter dataset on Sunnah and Shia (SSTD), as part of a... more

The aim of sentiment analysis is to automatically extract the opinions from a certain text and decide its sentiment. In this paper, we introduce the first publicly-available Twitter dataset on Sunnah and Shia (SSTD), as part of a religious hate speech which is a sub problem of the general hate speech. We, further, provide a detailed review of the data collection process and our annotation guidelines such that a reliable dataset annotation is guaranteed. We employed many stand-alone classification algorithms on the Twitter hate speech dataset, including Random Forest, Complement NB, DecisionTree, and SVM and two deep learning methods CNN and RNN. We further study the influence of word embedding dimensions FastText and word2vec. In all our experiments, all classification algorithms are trained using a random split of data (66% for training and 34% for testing). The two datasets were stratified sampling of the original dataset. The CNN-FastText achieves the highest F-Measure (52.0%) followed by the CNN-Word2vec (49.0%), showing that neural models with FastText word embedding outperform classical feature-based models.

Forged documents specifically passport, driving licence and VISA stickers are used for fraud purposes including robbery, theft and many more. So detecting forged characters from documents is a significantly important and challenging task... more

Forged documents specifically passport, driving licence and VISA stickers are used for fraud purposes including robbery, theft and many more. So detecting forged characters from documents is a significantly important and challenging task in digital forensic imaging. Forged characters detection has two big challenges. First challenge is, data for forged characters detection is extremely difficult to get due to several reasons including limited access of data, unlabeled data or work is done on private data. Second challenge is, deep learning (DL) algorithms require labeled data, which poses a further challenge as getting labeled is tedious, time-consuming, expensive and requires domain expertise. To end these issues, in this paper we propose a novel algorithm, which generates the three datasets namely forged characters detection for passport (FCD-P), forged characters detection for driving licence (FCD-D) and forged characters detection for VISA stickers (FCD-V). To the best of our knowledge, we are the first to release these datasets. The proposed algorithm starts by reading plain document images, simulates forging simulation tasks on five different countries' passports, driving licences and VISA stickers. Then it keeps the bounding boxes as a track of the forged characters as a labeling process. Furthermore, considering the real world scenario, we performed the selected data augmentation accordingly. Regarding the stats of datasets, each dataset consists of 15000 images having size of 950 x 550 of each. For further research purpose we release our algorithm code 1 and, datasets i.

Intrusion Detection System (IDS) has been an effective way to achieve higher security in detecting malicious activities for the past couple of years. Anomaly detection is an intrusion detection system. Current anomaly detection is often... more

Intrusion Detection System (IDS) has been an effective way to achieve higher security in detecting malicious activities for the past couple of years. Anomaly detection is an intrusion detection system. Current anomaly detection is often associated with high false alarm rates and only moderate accuracy and detection rates because it's unable to detect all types of attacks correctly. An experiment is carried out to evaluate the performance of the different machine learning algorithms using KDD-99 Cup and NSL-KDD datasets. Results show which approach has performed better in term of accuracy, detection rate with reasonable false alarm rate.

Rainfall prediction is one of the demanding operational responsibility carried out by meteorological services worldwide. Planning is referred to as the roadmap to success as failing to plan implies planning for failure. Information about... more

Rainfall prediction is one of the demanding operational responsibility carried out by meteorological services worldwide. Planning is referred to as the roadmap to success as failing to plan implies planning for failure. Information about future happenings instrumental to efficient and effective planning. Effects of natural disaster such as flooding, drought an only be prevented with effective planning. The different techniques that can be applied are as-decision tree, clustering, K-mean, fuzzy logic. In this paper, we have used the fuzzy logic technique to predict the rainfall, given the temperature of that particular geographical location.

Big Data is the extremely large sets of data that their sizes are beyond the ability of capturing, managing, processing and storage by most software tools and people which is ever increasing day-by-day. In most enterprise scenarios the... more

Big Data is the extremely large sets of data that their sizes are beyond the ability of capturing, managing, processing and storage by most software tools and people which is ever increasing day-by-day. In most enterprise scenarios the data is too big or it moves too fast that extremely exceeds current processing capacity. The term big data is also used by vendors, may refer to the technology which includes tools and processes that an organization requires to handle the large amounts of data and storage facilities. This advancement in technology leads to make relationship marketing a reality for today’s competitive world. But at the same time this huge amount of data cannot be analyzed in a traditional manner, by using manual data analysis. For this, technologies such as data warehousing and data mining have made customer relationship management as a new area where business firms can gain a competitive advantage for identifying their customer behaviors and needs. This paper mainly focuses on data mining technique that performs the extraction of hidden predictive information from large databases and organizations can identify valuable customers and predicts future user behaviors. This enables different organizations to make proactive, knowledge-driven decisions. Data mining tools answer business questions that in the past were too time-consuming, this makes customer relationship management possible. For this in this paper, we are trying explain the use of data mining technique to accomplish the goals of today’s customer relationship management and Decision making for different companies that deals with big data.

El desarrollo del proyecto Linked Open Data ha permitido la publicación y enlazamiento de datos abiertos bajo los principios de Linked Data en la Web de Datos. En la actualidad existen datasets de diferentes dominios enlazados entre sí.... more

Disease detection by the use of technology becomes need of the hour. Lack of time and ignorance causes the problems to increase substantially. Historical medical records of persons can be used to analyse the patterns and discover the... more

Disease detection by the use of technology becomes need of the hour. Lack of time and ignorance causes the problems to increase substantially. Historical medical records of persons can be used to analyse the patterns and discover the disease if any or the future outcomes in terms of disease to the person. This paper presents the comprehensive review of techniques under pattern mining used to discover distinct patterns from the given dataset. In addition sequential pattern mining is considered base to predict the diseases and techniques like pattern growth, incremental growth, prefix span etc. are comparatively analysed giving advantages and disadvantages of each. In other words Apriori based algorithms are analysed using proposed literature. Future enhancements are also suggested using the proposed literature.

The focus of the research study was analysis of diabetes dataset and how it will perform if we try to do a prediction of diabetes with different machine learning algorithms. We used the original dataset from the National Institute of... more

The focus of the research study was analysis of
diabetes dataset and how it will perform if we try to do a
prediction of diabetes with different machine learning
algorithms. We used the original dataset from the National
Institute of Diabetes, and Digestive and Kidney Diseases. The
dataset can be used to predict whether or not a patient has
diabetes, based on certain diagnostics. For analysis we used
Amazon Web Services. We used AWS S3 service to store our
dataset, and Amazon Sagemaker to perform an analysis. For
the given dataset we applied three classification models:
Logistic Regression Model, K-nearest Neighbors and
Support Vector Machines. For each of the models we also
performed a performance measurement. We also compared
all the results we got and according to the results, Support
Vector Machines has the best performance. Insights and
recommendations are provided.

In this paper, we have discussed how we made a complete dataset of Bangla homograph and also discussed a methodology for the disambiguation of the homograph words. We have read the entire Bangla-to-Bangla Dictionary and collected all 1156... more

In this paper, we have discussed how we made a complete dataset of Bangla homograph and also discussed a methodology for the disambiguation of the homograph words. We have read the entire Bangla-to-Bangla Dictionary and collected all 1156 homograph words and for the collected words we did crowdsourcing and collected 50/60 sentences for each word. Then we applied TFIDF with Naive Bayes for the disambiguation of the homo

The lack of publicly accessible datasets with a reliable ground truth has prevented in the past a fair and coherent comparison of different methods proposed in the mobile robot Simultaneous Localization and Mapping (SLAM) literature.... more

The lack of publicly accessible datasets with a reliable ground truth has prevented in the past a fair and coherent comparison of different methods proposed in the mobile robot Simultaneous Localization and Mapping (SLAM) literature. Providing such a ground truth becomes specially challenging in the case of visual SLAM, where the world model is 3-dimensional and the robot path is 6-dimensional. This work addresses both the practical and theoretical issues found while building a collection of six outdoor datasets. It is discussed how to estimate the 6-d vehicle path from readings of a set of three Real Time Kinematics (RTK) GPS receivers, as well as the associated uncertainty bounds that can be employed to evaluate the performance of SLAM methods. The vehicle was also equipped with several laser scanners, from which reference point clouds are built as a testbed for other algorithms such as segmentation or surface fitting. All the datasets, calibration information and associated software tools are available for download http://babel.isa.uma.es/mrpt/papers/dataset2009/.

Digital Imaging and Communications in Medicine (DICOM) standard is an image archive system which allows itself to serve as an image manager that control the acquisition, retrieval, and distributions of medical images within entire picture... more

Digital Imaging and Communications in Medicine (DICOM) standard is an image archive system which allows itself to serve as an image manager that control the acquisition, retrieval, and distributions of medical images within entire picture archiving and communication systems (PACS).The DICOM technology is suitable when sending images between different departments within hospitals or/and other hospitals, and consultants. However, some hospitals lack the DICOM system. In this paper proposed algorithm view and converts.dcm image files into bmp, png, jpeg standard image, whereby the image should be viewable, with common image viewing programs and its size should be small.

The significance of the labeled dataset is not obscure from artificial intelligence practitioners. We have seen much phenomenal work, in natural language processing, for many languages (like English, Chinese, and Arabic, etc.), due to the... more

The significance of the labeled dataset is not obscure from artificial intelligence practitioners. We have seen much phenomenal work, in natural language processing, for many languages (like English, Chinese, and Arabic, etc.), due to the reason for the availability of substantial data. For the Urdu language, despite the third largest spoken language in the world, very little research work is shown; hence, it is adjudged as a ‘morphologically rich’ but ‘resource-poor’ language. Further, the researchers working on Urdu natural language processing are in a quandary due to the lack of availability of labeled/annotated datasets. This paper shares the data, “Urdu Sentiment Corpus” (USC), and insights therein, of Urdu tweets for the sentiment analysis and polarity detection. The dataset is consisting of tweets, such that it casts a political shadow and presents a competitive environment between two separate political parties versus the government of Pakistan. Overall, the dataset is comprising over 17, 185 tokens with 52% records as positive, and 48% records as negative. This paper shares the visual insights (from document-level to word-level) into the textual similarities, manifold-learning, etc. In addition to it, this paper also presents a Part-of-Speech wise analysis and an unpretentious technique for the extraction of sentiment lexicons from the corpus.

Aluminium 6082 as a light weight structural material with specific strength, good damping properties and machine ability, is one of the most attractive material in applications of transportation and mobile electronics for effectively... more

Aluminium 6082 as a light weight structural material with specific strength, good damping properties and machine ability, is one of the most attractive material in applications of transportation and mobile electronics for effectively reducing weight. In 1991 a solid state joining process named Friction Stir Welding was developed and this technique has attracted considerable interest from the aerospace and automotive industries, since it is able to produce defect free joints particularly for light metals i.e. magnesium and aluminium alloy. In this investigation an attempt has been made to study the effect of friction stir welding parameters on mechanical and metallurgical properties of Aluminium 6082 alloy. The selected material was welded using combination of single and double pass Friction stir welding at tool rotational speed of 1600 rpm, welding speed (40 mm/min, 60 mm/min and 80 mm/min) and doping Silicon Carbide particles. Doping of Silicon Carbide particles improves the tensile strength and microhardness as compared to base material as received. It has been observed that high microhardness at the rate of two times was achieved by doping SiC particles during Friction Stir Welding.The maximum value of microhardness at the rate of 137 Hv is achieved with double pass friction stir welding. High tensile strength of 300 N/m2 is obtained at the welding speed of 60 mm/min with double pass friction stir welding. The fine and equiaxed grains were obtained due to dynamic recrystallization at the Stir Zone of Friction Stir welded joints. Double pass Friction stir welding causes homogeneous dispersion of SiC particles. It was also observed that excessive heat generation and insufficient flow of plasticized material at higher values of tool rotational speed, leading to formation of defects which ultimately results in failure of weld joints between SZ and TMAZ.