Spectrophotometric variable selection by mutual information (original) (raw)
Related papers
Mutual information for the selection of relevant variables in spectrometric nonlinear modelling
Computing Research Repository, 2007
Data from spectrophotometers form vectors of a large number of exploitable variables. Building quantitative models using these variables most often requires using a smaller set of variables than the initial one. Indeed, a too large number of input variables to a model results in a too large number of parameters, leading to overfitting and poor generalization abilities. In this paper, we suggest the use of the mutual information measure to select variables from the initial set. The mutual information measures the information content in input variables with respect to the model output, without making any assumption on the model that will be used; it is thus suitable for nonlinear modelling. In addition, it leads to the selection of variables among the initial set, and not to linear or nonlinear combinations of them. Without decreasing the model performances compared to other variable projection methods, it allows therefore a greater interpretability of the results.
The existence of the local maximum of information obtained in univariate analysis was discovered for the first time. The information content was shown to be maximal at the point where the predicted concentration is close to the average of the calibration set concentrations. This result was obtained using a new expression for the systematic error of the measured concentration caused by systematic errors in the preparation of standard samples. The information content of the analysis of multicomponent mixtures was calculated numerically using simulated Gaussian spectra and the spectrum of Excedrin tablet components. It was shown that the informational approach allows balancing the ratio between random and systematic errors. The use of information measures for calculating the optimal number of components for the gasoline octane number determination by the Partial Least Squares method was found to be not effective.
The existence of the local maximum of information obtained in univariate analysis was discovered for the first time. The information content was shown to be maximal at the point where the predicted concentration is close to the average of the calibration set concentrations. This result was obtained using a new expression for the systematic error of the measured concentration caused by systematic errors in the preparation of standard samples. The information content of the analysis of multicomponent mixtures was calculated numerically using simulated Gaussian spectra and the spectrum of Excedrin tablet components. It was shown that the informational approach allows balancing the ratio between random and systematic errors. The use of information measures for calculating the optimal number of components for the gasoline octane number determination by the Partial Least Squares method was found to be not effective.
Chemometrics and Intelligent Laboratory Systems, 2004
Data from spectrophotometers form spectra that are sets of a great number of exploitable variables in quantitative chemical analysis; calibration models using chemometric methods must be established to exploit these variables. In order to design these calibration models which are specific to each analyzed parameter, it is advisable to select a reduced number of spectral variables. This paper presents a new incremental method (step by step) for the selection of spectral variables, using linear regression or neural networks, and based on an objective validation (external) of the calibration model; this validation is carried out on data that are independent from those used during calibration. The advantages of the method are discussed and highlighted, in comparison to the current calibration methods used in quantitative chemical analysis by spectrophotometry. D
Bayesian approach to spectrophotometric analysis of multicomponent substances
2003
The spectrophotometric analysis of a chemical substance is based on the interpretation of the measurement data acquired by means of a spectrophotometer, i.e., on estimation of the concentrations of its components. In this paper, a Bayesian approach to the estimation of those concentrations is proposed. Its effective application requires a considerable amount of statistical a priori information, viz., the probability density functions characterizing the distributions of the concentrations, of the errors in the data, and of the residual components in the analyzed substance whose concentrations are not estimated. The proposed approach is studied using synthetic data generated on the basis of some realworld reference spectra. The results of study are compared with those obtained by means of the currently used method for estimation of concentrations, viz., constrained least-squares curve fitting.
A review of information theory in analytical chemometrics
Journal of Chemometrics, 1990
Information theory makes it possible to judge and evaluate methods and results in chemical analysis. The obtained information can be expressed in different ways. One way is to define information as the decrease of uncertainity after analysis. Conditional probabilities are therefore considered when evaluating the information provided by qualitative analyses. However, the use of other information measures, such as the information gain, is often preferable. In multicomponent analysis the translation of information from signals to the amounts of the analytes has been investigated along with the relevance of individual components. Information theory can also be applied to find the optimum experimental conditions. The evaluation of the properties of analytical methods by information theory has been proposed.
Chemometric Calibration of Infrared Spectrometers: Selection and Validation Of Variables by
2004
Data from spectrophotometers form spectra that are sets of a great number of exploitable variables in quantitative chemical analysis; calibration models using chemometric methods must be established to exploit these variables. In order to design these calibration models which are specific to each analyzed parameter, it is advisable to select a reduced number of spectral variables. This paper presents a new incremental method (step by step) for the selection of spectral variables, using linear regression or neural networks, and based on an objective validation (external) of the calibration model; this validation is carried out on data that are independent from those used during calibration. The advantages of the method are discussed and highlighted, in comparison to the current calibration methods used in quantitative chemical analysis by spectrophotometry.
Analytica Chimica Acta, 2004
The UV spectrophotometric analysis of a multicomponent mixture containing paracetamol, caffeine, tripelenamine and salicylamide by using multivariate calibration methods, such as principal component regression (PCR) and partial least-squares regression (PLS), was described. The calibration set was based on 47 reference samples, consisting of quaternary, ternary, binary and single-component mixtures, with the aim to develop models able to predict the concentrations of unknown samples containing as many as one-to-four components. The calibration models were optimized by an appropriate selection of the number of factors as well as wavelength ranges to be used for building up the data matrix and excluding any information about the interfering excipients included in pharmaceutics. The PCR and PLS models were compared and their predictive performance was inferred by a successful application to the assays of synthetic mixtures and pharmaceutical formulations.