Large Data Sets Research Papers (original) (raw)

Top Papers
Most Cited Papers
Most Downloaded Papers
Newest Papers
People
- by
- •
- Treatment Outcome, Adolescent, Stents, Humans
- by Omer Rana
- •
- Information Retrieval, Artificial Intelligence, Visualization, Decision Making

In this paper we consider the question of uncertainty of detected patterns in data mining. In particular, we develop statistical tests for patterns found in continuous data, indicating the significance of these patterns in terms of the probability that they have occurred by chance. We examine the performance of these tests on patterns detected in several large data sets, including

- by David Hand
- •
- Mathematics, Data Mining, Statistical Analysis, Flow Cytometry
- by Liviu Vladutu
- •
- Bioinformatics, Artificial Intelligence, Social Networks, Machine Learning
- by Hamid Pilevar
- •
- Cognitive Science, Data Mining, Pattern Recognition, Algorithm
- by Dan Suciu
- •
- Query processing, Data Model, Large Data Sets
- by Terence van Zyl
- •
- Remote Sensing, Time Series, Time series analysis, Earth Observation
- by Elena Boriani
- •
- Ecotoxicology, Modeling, Performance, Environmental Monitoring
- by Mikhail Lifshits
- •
- Applied Mathematics, Distributed Computing, Web search, Nearest Neighbor

Abstract. Models based on neural and neuro-fuzzy structures are developed to represent knowledge about a large data set containing chemical descriptors of organic compounds, commonly used in industrial processes. The neuro-fuzzy models... more

- by Emilio Benfenati
- •
- Computer Science, Knowledge Representation, Large Data Sets, Neuro Fuzzy

... and small scale fading in indoor wireless communication channels. Reinaldo A. Valenzuela, Dmitry Chizhik and Jonathan Ling ... References [I] R. A. Valenzuela, 0. Landron, DL Jacobs, "Estimating Local Mean Signal Strength of... more

- by Reinaldo Valenzuela
- •
- Mathematics, Computational Modeling, Statistical Analysis, Wireless Systems
- by H. Hofmann
- •
- Mathematics, Computer Science, Statistics, Key words

Improving student's academic performance is not an easy task for the academic community of higher learning. The academic performance of engineering and science students during their first year at university is a turning point in their educational path and usually encroaches on their General Point Average (GPA) in a decisive manner. The students evaluation factors like class quizzes mid and final exam assignment lab - work are studied. It is recommended that all these correlated information should be conveyed to the class teacher before the conduction of final exam. This study will help the teachers to reduce the drop out ratio to a significant level and improve the performance of students. In this paper, we present a hybrid procedure based on Decision Tree of Data mining method and Data Clustering that enables academicians to predict student's GPA and based on that instructor can take necessary step to improve student academic performance Graded Point Average (GPA) is a co...

- by Md. Hedayetul Islam
- •
- Bioinformatics, Mathematics, Computer Science, Data Mining
- by Lynn Stein
- •
- Computer Science, Relational Database, Web, Distributed Control
- by Rada Mihalcea
- •
- Computational Humor, Computer Model, Human behavior, Empirical evidence
- by Aleksandra Samecka-Cymerman
- •
- Data Mining, Chemical Ecology, Principal Component Analysis, Biology
- by tom cwik
- •
- Cognitive Science, Distributed Computing, Parallel Computing, Cluster Computing

For the past two decades, the single-index model, a special case of pro- jection pursuit regression, has proven to be an efficient way of coping with the high-dimensional problem in nonparametric regression. In this paper, based on a weakly dependent sample, we investigate a robust single-index model, where the single-index is identified by the best approximation to the multivariate prediction function of the response variable, regardless of whether the prediction function is a genuine single-index function. A polynomial spline estimator is proposed for the single-index coefficients, and is shown to be root-n consistent and asymptoti- cally normal. An iterative optimization routine is used that is sufficiently fast for the user to analyze large data sets of high dimension within seconds. Simulation experiments have provided strong evidence corroborating the asymptotic theory. Application of the proposed procedure to the river flow data of Iceland has yielded superior out-of-sample r...

- by li w
- •
- Statistics, Nonparametric Regression, Large Data Sets, Simulation experiment
- by Alexander Kowarik and +1
- •
- Econometrics, Statistics, Large Data Sets, Missing Values
- by Joost van de Weijer
- •
- Cognitive Science, Algorithms, Colorimetry, Color Constancy
- by Marcela Svarc
- •
- Econometrics, Statistics, Pattern Recognition, Experimental Design
- by Jonathan Nissanov
- •
- Biomedical Engineering, Software, Software Design, Brain

Racial and ethnic achievement gaps narrowed substantially in the 1970s and 1980s. As some of the gaps widened in the 1990s, there were some setbacks in the progress the nation made toward racial and ethnic equity. This article offers a look below the surface at Black-White and Hispanic-White achievement gap trends over the past 30 years. The literature review and data analysis identify the key factors that seem to have contributed to bifurcated patterns in achievement gaps. The conventional measures of socioeconomic and family conditions, youth culture and student behavior, and schooling conditions and practices might account for some of the achievement gap trends for a limited time period or for a particular racial and ethnic group. However, they do not fully capture the variations. This preliminary analysis of covariations in racial and ethnic gap patterns across several large data sets has implications for future research on the achievement of minority groups.

- by Jaekyung Lee
- •
- Education, Data Analysis, Literature Review, Educational Researcher
- by Jaideep Anand
- •
- Applied Economics, Corporate Strategy, Business and Management, Large Data Sets
- by Christopher Lipinski
- •
- Pharmacokinetics, Humans, Permeability, High throughput screening
- by Ian Evans
- •
- Geology, Geomorphology, Morphometry, Quaternary

The organophosphorous compound soman is an acetylcholinesterase inhibitor that causes damage to the brain. Exposure to soman causes neuropathology as a result of prolonged and recurrent seizures. In the present study, long-term recordings of cortical EEG were used to develop an unbiased means to quantify measures of seizure activity in a large data set while excluding other signal types. Rats were implanted with telemetry transmitters and exposed to soman followed by treatment with therapeutics similar to those administered in the field after nerve agent exposure. EEG, activity and temperature were recorded continuously for a minimum of 2 days pre-exposure and 15 days post-exposure. A set of automatic MATLAB algorithms have been developed to remove artifacts and measure the characteristics of long-term EEG recordings. The algorithms use short-time Fourier transforms to compute the power spectrum of the signal for 2-s intervals. The spectrum is then divided into the delta, theta, alpha, and beta frequency bands. A linear fit to the power spectrum is used to distinguish normal EEG activity from artifacts and high amplitude spike wave activity. Changes in time spent in seizure over a prolonged period are a powerful indicator of the effects of novel therapeutics against seizures. A graphical user interface has been created that simultaneously plots the raw EEG in the time domain, the power spectrum, and the wavelet transform. Motor activity and temperature are associated with EEG changes. The accuracy of this algorithm is also verified against visual inspection of video recordings up to 3 days after exposure.

- by madineh sarvestani
- •
- Neuroscience, Cognitive Science, Algorithms, Electroencephalography
- by Marcel Warnaar and +1
- •
- Economics, Price Elasticity, Almost Ideal Demand System, School Children
- by Peter Wigand
- •
- Archaeology, Geology, Quaternary, Great Plains
- by Harry Zhang
- •
- Text Classification, Scaling up, Time Complexity, Naive Bayes
- by Arye Nehorai
- •
- Algorithms, Signal Processing, Cramer Rao Lower Bound, Adaptive Filter
- by Bernd Hamann
- •
- Approximation Theory, Visualization, Computational Geometry, Mesh simplification
- by Susan Swensen
- •
- Evolutionary Biology, Molecular Systematics, Plant Biology, Classification
- by Colin Everson
- •
- Plant Ecology, Plant Biology, Ecology, Low Frequency
- by Paolo Ciaccia and +1
- •
- Data Mining, Data Analysis, Pattern Recognition, Information Extraction

Abstract: A great deal has previously been written about the use of skeletal morphological changes in estimating ages-at-death. This article looks in particular at the pubic symphysis, as it was historically one of the first regions to be described in the literature on age estimation. Despite the lengthy history, the value of the pubic symphysis in estimating ages and in providing evidence for putative identifications remains unclear. This lack of clarity primarily stems from the fact that rather ad hoc statistical methods have been applied in previous studies. This article presents a statistical analysis of a large data set (n = 1766) of pubic symphyseal scores from multiple contexts, including anatomical collections, war dead, and victims of genocide. The emphasis is in finding statistical methods that will have the correct “coverage.”“Coverage” means that if a method has a stated coverage of 50%, then approximately 50% of the individuals in a particular pubic symphyseal stage should have ages that are between the stated age limits, and that approximately 25% should be below the bottom age limit and 25% above the top age limit. In a number of applications it is shown that if an appropriate prior age-at-death distribution is used, then “transition analysis” will provide accurate “coverages,” while percentile methods, range methods, and means (±standard deviations) will not. Even in cases where there are significant differences in the mean ages-to-transition between populations, the effects on the stated age limits for particular “coverages” are minimal. As a consequence, more emphasis needs to be placed on collecting data on age changes in large samples, rather than focusing on the possibility of inter-population variation in rates of aging.

- by Lyle Konigsberg
- •
- Evolutionary Biology, Archaeology, Anthropology, Biological Anthropology
- by rahul kala
- •
- Machine Learning, Speaker Recognition, Fuzzy Logic, Face Recognition
- by lukas lukas
- •
- Time Series, Learning problems, Large Data Sets, Parameter Tuning
- by Joost van de Weijer
- •
- Color Constancy, Large Data Sets
- by Bradley Erickson and +1
- •
- Human Factors, Health Care, Digital imaging, Quality of Care
- by Carlos Lucasius
- •
- Analytical Chemistry, Genetic Algorithms, Genetic Algorithm, Data Reduction
- by Nikhil Pal
- •
- Applied Mathematics, Fuzzy Clustering, First-Order Logic, Local minima
- by Mars Heng
- •
- Marketing, Psychology, Cognitive Science, Psychometrics

As technologies for acquiring 3D data and algorithms for constructing integrated models evolve, very large data sets representing objects or environments are emerging in various application areas. As a result, significant research in computer graphics has aimed to interactively render such models on affordable commodity computers. Interest is growing in the possibility of integrating real-time analysis and transformation tools in interactive visualization environments as they become more available.

- by mona lisa
- •
- Cognitive Science, Computer Graphics, Visualization, Data Analysis
- by Glen Holmes
- •
- Linked Data, Biological Sciences, Environmental Sciences, Coral Bleaching

Summary. I discuss the production of low rank smoothers for d ≥ 1 dimensional data, which can be fitted by regression or penalized regression methods. The smoothers are constructed by a simple transformation and truncation of the basis that arises from the solution of the thin plate spline smoothing problem and are optimal in the sense that the truncation is designed to result in the minimum possible perturbation of the thin plate spline smoothing problem given the dimension of the basis used to construct the smoother. By making use of Lanczos iteration the basis change and truncation are computationally efficient. The smoothers allow the use of approximate thin plate spline models with large data sets, avoid the problems that are associated with ‘knot placement’ that usually complicate modelling with regression splines or penalized regression splines, provide a sensible way of modelling interaction terms in generalized additive models, provide low rank approximations to generalized smoothing spline models, appropriate for use with large data sets, provide a means for incorporating smooth functions of more than one variable into non-linear models and improve the computational efficiency of penalized likelihood models incorporating thin plate splines. Given that the approach produces spline-like models with a sparse basis, it also provides a natural way of incorporating unpenalized spline-like terms in linear and generalized linear models, and these can be treated just like any other model terms from the point of view of model selection, inference and diagnostics.

- by simon wood
- •
- Econometrics, Statistics, Spline smoothing and regression, Model Selection
- by Vipin Kumar
- •
- High Performance Computing, Machine Learning, Data Mining, Data Analysis