Data Dimensionality Reduction Techniques : Review (original) (raw)

Dimensionality Reduction in EH&S Data Analysis

Academia Letters, 2022

Utilizing tools from big data analytics-which deal with large, complex data sets containing characteristics of volume, variety, velocity, veracity, value, and complexity, multiple occupational sectors have successfully employed these tools to assist in problem-solving and decision making. Big data includes unstructured (text-heavy and unorganized) and multi-structured data (including human-machine interactions). With ever-increasing volumes of data generated, the quantity is challenging to be handled from the standpoints of analysis as well as storage and the sustainability of the data. One such tool is Dimensionality Reduction, which refers to the transformation of data from a high dimensional space to one of lower-dimensional space that retains some meaningful properties of the original data (Van Der Maaten, Postm & Van den Herik, 2009). To state more simply, it is a method of simplifying data to extract as much useful info from as little data as appropriate. In the field of big data analytics, dimensionality reduction is performed after data has been collected, and uses a variety of mathematical and statistical methods to make determinations regarding which data to keep and which to disregard as irrelevant. The techniques used, however, accomplish the same goal-reducing a vast amount of data into something more manageable. The principles of dimensionality reduction can be used ahead of and during data acquisition/collection as well, allowing for a pragmatic approach to collecting data that serves the needs of the individual collecting or analyzing it, with the hope of streamlining or making more efficient the process of collecting EHS data and making decisions from it. To illustrate the application of dimensionality reduction principles, let us examine a case study of a laboratory facility with a varied set of operations trying to assess the prevalence and scope of incidents and injuries. This facility has been collecting data for a period of 10 years, and in that time has utilized an instrument or form to collect information about each incident

Data Rationalisation

High dimensional data is difficult to visualize and as the dimensionality increases, the data starts behaving in an unexpected manner. Dimensionality Reduction is a technique used to summarize a large set of input parameters into a smaller set with little or no redundancy, and to analyse the reduced form of the high- dimensional data. Redundancy in data leads to the parameters that can characterize other sets of units not becoming independent from each other. Therefore, the units that can be replaced by others are removed and the data set is made smaller. All these factors have increased the demand for dimensionality reduction techniques in various industries such as healthcare, This paper discusses existing linear and non-linear techniques in dimensionality reduction, and aims to find the best technique for performing dimensionality reduction across industries such as engineering, health and medicine, banking, marketing, finance and so on. The paper first discusses a linear method, known as Principal Component Analysis (PCA). PCA models have been created in Python to prove that they improve the performance and efficiency of Machine Learning algorithms. In certain cases, PCA fails to deliver results which leads to finding better techniques such as Kernel Principal Component Analysis, Linear Discriminant Analysis and t-Distributed Stochastic Neighbouring Embedding. These techniques are more flexible with the data available. Therefore, the second part of the paper discusses these three non-linear techniques in detail and compares their performance with that of PCA. Enhancements are made on the applied models to provide better results with four different data sets. All the techniques used have been judged on their accuracy scores, time taken for operation, computational complexity and applicability in different scenarios. The uncertainty involved with all the experiments has also been illustrated in this report. The paper provides mathematical derivations, design for the experiment, Python codes developed for experimentation, applications of dimensionality reduction techniques across different industries, results of the experiments and analysis of the results for all the methodologies discussed, thus providing the best techniques for dimensionality reduction.

A Study on Issues, Challenges and Application in Data Science

"Data science, also known as data-driven science, is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.Data science is about dealing with large quality of data for the purpose of extracting meaningful and logical results/conclusions/patterns. It's a newly emerging field that encompasses a number of activities, such as data mining and data analysis. It employs techniques ranging from mathematics, statistics, and information technology, computer programming, data engineering, pattern recognition and learning, visualization, and high performance computing. This paper gives a clear idea about the different data science technologies used in Big data Analytics. Data science is a ""concept to unify statistics, data analysis and their related methods"" in order to ""understand and analyze actual phenomena"" with data. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization.Data Science is much more than simply analysing data. There are many people who enjoy analysing data who could happily spend all day looking at histograms and averages, but for those who prefer other activities, data science offers a range of roles and requires a range of skills. Data science includes data analysis as an important component of the skill set required for many jobs in the area, but is not the only skill. In this paper the authors effort will concentrated on to explore the different issues, implementation and challenges in Data science. Mukul Varshney | Shivani Garg | Jyotsna | Abha Kiran Rajpoot""A Study on Issues, Challenges and Application in Data Science "" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-1 | Issue-5 , August 2017, URL: http://www.ijtsrd.com/papers/ijtsrd2340.pdf Article URL: http://www.ijtsrd.com/computer-science/other/2340/a-study-on-issues-challenges-and-application-in-data-science-/mukul-varshney"

Recent Dimensions of Data Science: A Survey

2020

Nowadays, huge amount of data has been generated and collected in every instance of time. So to analyze them is the toughest task to do. Data are generated and collected in a huge amount from unlike sources such as social media, business transactions, public data, etc. This greater amount of data may be structured, semi-structured, and unstructured one. The data in which analysis is to be performed these days are not only of massive amount but also varies each other by its types, at which speed it is generated and by its value and also varies by different characteristics which is termed as big data. So to examine this vast amount of data and get the relevant information from it, analysis should be done and to analyze this huge amount of data is a greater challenge these days. So to analyze these vast amount of data we need the help of several data analytics tools and methods so that it will be easier to deal with it. This survey paper talks about different tools and techniques used ...

Data Science: The Way How to Use Data

International Journal of Innovative Research in Science,Engineering and Technology

Data Science, a new discovery illustration, is potentially one of the most significant advances of the early 21st century. Derived from scientific discovery, it is being applied to every human effort for which there is sufficient data. Significant and remarkable successes have been achieved; even greater claims have been made. Along with benefits, challenge and risks are abound. The science underlying data science has yet to emerge. This claim is based on observing the centuries-long developments of its predecessor paradigms - empirical, theoretical, and Jim Gray's Fourth Paradigm of Scientific Discovery (Hey, Tansley& Tolle, 2009) (aka eScience, data-intensive, computational, procedural). This paper is mainly focuses on essential questions for data science- what is data science, importance of data science in our everyday life, data wrangling, data science algorithms.

Dimension reduction

2008

When data objects that are the subject of analysis using machine learning techniques are described by a large number of features (ie the data are high dimension) it is often beneficial to reduce the dimension of the data. Dimension reduction can be beneficial not only for reasons of computational efficiency but also because it can improve the accuracy of the analysis.

Dimensionality Reduction

2008

A dimension refers to a measurement of a certain aspect of an object. Dimensionality reduction is the study of methods for reducing the number of dimensions describing the object. Its general objectives are to remove irrelevant and redundant data to reduce the computational cost and avoid data over-fitting (1) and to improve the quality of data for efficient data-intensive processing tasks such as pattern recognition and data mining.

Data Reduction Techniques: A Comparative Study

Journal of Kufa for Mathematics and Computer, 2022

Data preprocessing in general and data reduction in specific represent the main steps in data mining techniques and algorithms since data in real world due to its vastness, the analysis will take a long time to complete .Almost all mining techniques including classification, clustering, association and others have high time and space complexities due to the huge amount of data and the algorithm behavior itself. That is the reason why data reduction represent an important phase in Knowledge Discovery in Databases (KDD) process. Many researchers introduced important solutions in this field. The study in this paper represents a comparative study for about 22 research papers in data reduction fields that covers different data reduction techniques such as dimensionality reduction, numerisoty reduction, sampling, clustering data cube aggregation and other techniques. From the conducted study, it can be concluded that the appropriate technique that can be used in data reduction is highly dependent on the data type, the dataset size, the application goal, the availability of noise and outliers and the compromise between the reduced data and the knowledge required from the analysis.

An Analysis of Data Science and its Applications

International Journal of Mechanical Engineering, 2021

Data Science is a combination of multiple disciplines such as statistics, data analysis and machine learning that are used to perform data analysis and to extract knowledge from it. It is used to find patterns of data through data analysis and thereby make decisions. Through data science, organizations are able to make better decisions, predictive analysis and pattern discoveries. In this paper, some of the data science applications created through pycharm software are explained of about how each application would read data from csv, excel, json and mongodb to produce patterns of data and an explanation of several applications of data science has been provided.