Large Data Sets Research Papers (original) (raw)

In this paper we consider the question of uncertainty of detected patterns in data mining. In particular, we develop statistical tests for patterns found in continuous data, indicating the significance of these patterns in terms of the... more

In this paper we consider the question of uncertainty of detected patterns in data mining. In particular, we develop statistical tests for patterns found in continuous data, indicating the significance of these patterns in terms of the probability that they have occurred by chance. We examine the performance of these tests on patterns detected in several large data sets, including

Abstract. Models based on neural and neuro-fuzzy structures are developed to represent knowledge about a large data set containing chemical descriptors of organic compounds, commonly used in industrial processes. The neuro-fuzzy models... more

Abstract. Models based on neural and neuro-fuzzy structures are developed to represent knowledge about a large data set containing chemical descriptors of organic compounds, commonly used in industrial processes. The neuro-fuzzy models here proposed include ...

... and small scale fading in indoor wireless communication channels. Reinaldo A. Valenzuela, Dmitry Chizhik and Jonathan Ling ... References [I] R. A. Valenzuela, 0. Landron, DL Jacobs, "Estimating Local Mean Signal Strength of... more

... and small scale fading in indoor wireless communication channels. Reinaldo A. Valenzuela, Dmitry Chizhik and Jonathan Ling ... References [I] R. A. Valenzuela, 0. Landron, DL Jacobs, "Estimating Local Mean Signal Strength of Indoor Multipath Propagation", IEEE Trans. ...

Improving student's academic performance is not an easy task for the academic community of higher learning. The academic performance of engineering and science students during their first year at university is a turning point in... more

Improving student's academic performance is not an easy task for the academic community of higher learning. The academic performance of engineering and science students during their first year at university is a turning point in their educational path and usually encroaches on their General Point Average (GPA) in a decisive manner. The students evaluation factors like class quizzes mid and final exam assignment lab - work are studied. It is recommended that all these correlated information should be conveyed to the class teacher before the conduction of final exam. This study will help the teachers to reduce the drop out ratio to a significant level and improve the performance of students. In this paper, we present a hybrid procedure based on Decision Tree of Data mining method and Data Clustering that enables academicians to predict student's GPA and based on that instructor can take necessary step to improve student academic performance Graded Point Average (GPA) is a co...

For the past two decades, the single-index model, a special case of pro- jection pursuit regression, has proven to be an efficient way of coping with the high-dimensional problem in nonparametric regression. In this paper, based on a... more

For the past two decades, the single-index model, a special case of pro- jection pursuit regression, has proven to be an efficient way of coping with the high-dimensional problem in nonparametric regression. In this paper, based on a weakly dependent sample, we investigate a robust single-index model, where the single-index is identified by the best approximation to the multivariate prediction function of the response variable, regardless of whether the prediction function is a genuine single-index function. A polynomial spline estimator is proposed for the single-index coefficients, and is shown to be root-n consistent and asymptoti- cally normal. An iterative optimization routine is used that is sufficiently fast for the user to analyze large data sets of high dimension within seconds. Simulation experiments have provided strong evidence corroborating the asymptotic theory. Application of the proposed procedure to the river flow data of Iceland has yielded superior out-of-sample r...

Racial and ethnic achievement gaps narrowed substantially in the 1970s and 1980s. As some of the gaps widened in the 1990s, there were some setbacks in the progress the nation made toward racial and ethnic equity. This article offers a... more

Racial and ethnic achievement gaps narrowed substantially in the 1970s and 1980s. As some of the gaps widened in the 1990s, there were some setbacks in the progress the nation made toward racial and ethnic equity. This article offers a look below the surface at Black-White and Hispanic-White achievement gap trends over the past 30 years. The literature review and data analysis identify the key factors that seem to have contributed to bifurcated patterns in achievement gaps. The conventional measures of socioeconomic and family conditions, youth culture and student behavior, and schooling conditions and practices might account for some of the achievement gap trends for a limited time period or for a particular racial and ethnic group. However, they do not fully capture the variations. This preliminary analysis of covariations in racial and ethnic gap patterns across several large data sets has implications for future research on the achievement of minority groups.

The organophosphorous compound soman is an acetylcholinesterase inhibitor that causes damage to the brain. Exposure to soman causes neuropathology as a result of prolonged and recurrent seizures. In the present study, long-term recordings... more

The organophosphorous compound soman is an acetylcholinesterase inhibitor that causes damage to the brain. Exposure to soman causes neuropathology as a result of prolonged and recurrent seizures. In the present study, long-term recordings of cortical EEG were used to develop an unbiased means to quantify measures of seizure activity in a large data set while excluding other signal types. Rats were implanted with telemetry transmitters and exposed to soman followed by treatment with therapeutics similar to those administered in the field after nerve agent exposure. EEG, activity and temperature were recorded continuously for a minimum of 2 days pre-exposure and 15 days post-exposure. A set of automatic MATLAB algorithms have been developed to remove artifacts and measure the characteristics of long-term EEG recordings. The algorithms use short-time Fourier transforms to compute the power spectrum of the signal for 2-s intervals. The spectrum is then divided into the delta, theta, alpha, and beta frequency bands. A linear fit to the power spectrum is used to distinguish normal EEG activity from artifacts and high amplitude spike wave activity. Changes in time spent in seizure over a prolonged period are a powerful indicator of the effects of novel therapeutics against seizures. A graphical user interface has been created that simultaneously plots the raw EEG in the time domain, the power spectrum, and the wavelet transform. Motor activity and temperature are associated with EEG changes. The accuracy of this algorithm is also verified against visual inspection of video recordings up to 3 days after exposure.

Abstract: A great deal has previously been written about the use of skeletal morphological changes in estimating ages-at-death. This article looks in particular at the pubic symphysis, as it was historically one of the first regions to... more

Abstract: A great deal has previously been written about the use of skeletal morphological changes in estimating ages-at-death. This article looks in particular at the pubic symphysis, as it was historically one of the first regions to be described in the literature on age estimation. Despite the lengthy history, the value of the pubic symphysis in estimating ages and in providing evidence for putative identifications remains unclear. This lack of clarity primarily stems from the fact that rather ad hoc statistical methods have been applied in previous studies. This article presents a statistical analysis of a large data set (n = 1766) of pubic symphyseal scores from multiple contexts, including anatomical collections, war dead, and victims of genocide. The emphasis is in finding statistical methods that will have the correct “coverage.”“Coverage” means that if a method has a stated coverage of 50%, then approximately 50% of the individuals in a particular pubic symphyseal stage should have ages that are between the stated age limits, and that approximately 25% should be below the bottom age limit and 25% above the top age limit. In a number of applications it is shown that if an appropriate prior age-at-death distribution is used, then “transition analysis” will provide accurate “coverages,” while percentile methods, range methods, and means (±standard deviations) will not. Even in cases where there are significant differences in the mean ages-to-transition between populations, the effects on the stated age limits for particular “coverages” are minimal. As a consequence, more emphasis needs to be placed on collecting data on age changes in large samples, rather than focusing on the possibility of inter-population variation in rates of aging.

As technologies for acquiring 3D data and algorithms for constructing integrated models evolve, very large data sets representing objects or environments are emerging in various application areas. As a result, significant research in... more

As technologies for acquiring 3D data and algorithms for constructing integrated models evolve, very large data sets representing objects or environments are emerging in various application areas. As a result, significant research in computer graphics has aimed to interactively render such models on affordable commodity computers. Interest is growing in the possibility of integrating real-time analysis and transformation tools in interactive visualization environments as they become more available.

Summary. I discuss the production of low rank smoothers for d ≥ 1 dimensional data, which can be fitted by regression or penalized regression methods. The smoothers are constructed by a simple transformation and truncation of the basis... more

Summary. I discuss the production of low rank smoothers for d ≥ 1 dimensional data, which can be fitted by regression or penalized regression methods. The smoothers are constructed by a simple transformation and truncation of the basis that arises from the solution of the thin plate spline smoothing problem and are optimal in the sense that the truncation is designed to result in the minimum possible perturbation of the thin plate spline smoothing problem given the dimension of the basis used to construct the smoother. By making use of Lanczos iteration the basis change and truncation are computationally efficient. The smoothers allow the use of approximate thin plate spline models with large data sets, avoid the problems that are associated with ‘knot placement’ that usually complicate modelling with regression splines or penalized regression splines, provide a sensible way of modelling interaction terms in generalized additive models, provide low rank approximations to generalized smoothing spline models, appropriate for use with large data sets, provide a means for incorporating smooth functions of more than one variable into non-linear models and improve the computational efficiency of penalized likelihood models incorporating thin plate splines. Given that the approach produces spline-like models with a sparse basis, it also provides a natural way of incorporating unpenalized spline-like terms in linear and generalized linear models, and these can be treated just like any other model terms from the point of view of model selection, inference and diagnostics.