Using linear regression and ANN techniques in determining variable importance (original) (raw)

The use of Neural Networks in chemical engineering is well documented. There has also been an increase in research concerned with the explanatory capacity of Neural Networks although this has been hindered by the regard of Artificial Neural Networks (ANN’s) as a black box technology. Determining variable importance in complex systems that have many variables as found in the fields of ecology, water treatment, petrochemical production, and metallurgy, would reduce the variables to be used in optimisation exercises, easing complexity of the model and ultimately saving money. In the process engineering field, the use of data to optimise processes is limited if some degree of process understanding is not present. The project objective is to develop a methodology that uses Artificial Neural Network (ANN) technology and Multiple Linear Regression (MLR) to identify explanatory variables in a dataset and their importance on process outputs. The methodology is tested by using data that exhibits defined and well known numeric relationships. The numeric relationships are presented using four equations. The research project assesses the relative importance of the independent variables by using the “dropping method” on a regression model and ANN’s. Regression used traditionally to determine variable contribution could be unsuccessful if a highly non-linear relationship exists. ANN’s could be the answer for this shortcoming. For differentiation, the explanatory variables that do not contribute significantly towards the output will be named “suspect variables”. Ultimately the suspect variables identified in the regression model and ANN should be the same, assuming a good regression model and network. The dummy variables introduced to the four equations are successfully identified as suspect variables. Furthermore, the degree of variable importance was determined using linear regression and ANN models. As the equations complexity increased, the linear regression models accuracy decreased, thus suspect variables are not correctly identified. The complexity of the equations does not affect the accuracy of the ANN model, and the suspect variables are correctly identified. The use of R2 and average error in establishing a criterion for identifying suspect variables is explored. It is established that the cumulative variable importance percentage (additive percentage), has to be below 5% for the explanatory variable to be considered a suspect variable. Combining linear regression and ANN provides insight into the importance of explanatory variables and indeed suspect variables and their contribution can be determined. Suspect variables can be eliminated from the model once identified simplifying the model, and increasing accuracy of the model.