Sparse Data Research Papers - Academia.edu (original) (raw)

Conditional logistic regression was developed to avoid "sparse-data" biases that can arise in ordinary logistic regression analysis. Nonetheless, it is a large-sample method that can exhibit considerable... more

Conditional logistic regression was developed to avoid "sparse-data" biases that can arise in ordinary logistic regression analysis. Nonetheless, it is a large-sample method that can exhibit considerable bias when certain types of matched sets are infrequent or when the model contains too many parameters. Sparse-data bias can cause misleading inferences about confounding, effect modification, dose response, and induction periods, and can interact with other biases. In this paper, the authors describe these problems in the context of matched case-control analysis and provide examples from a study of electrical wiring and childhood leukemia and a study of diet and glioma. The same problems can arise in any likelihood-based analysis, including ordinary logistic regression. The problems can be detected by careful inspection of data and by examining the sensitivity of estimates to category boundaries, variables in the model, and transformations of those variables. One can also apply various bias corrections or turn to methods less sensitive to sparse data than conditional likelihood, such as Bayesian and empirical-Bayes (hierarchical regression) methods.

The 87Sr/86Sr values based on brachiopods and conodonts define a nearly continuous record for the Late Permian and Triassic intervals. Minor gaps in measurements exist only for the uppermost Brahmanian, lower part of the Upper Olenekian,... more

The 87Sr/86Sr values based on brachiopods and conodonts define a nearly continuous record for the Late Permian and Triassic intervals. Minor gaps in measurements exist only for the uppermost Brahmanian, lower part of the Upper Olenekian, and Middle Norian, and ...

Recently, it has been claimed 1 that the worldwide climate over the past million years follows a low-dimensional strange attractor. Contrary to that claim, I report here that there is no sign of such an attractor. This holds both for the... more

Recently, it has been claimed 1 that the worldwide climate over the past million years follows a low-dimensional strange attractor. Contrary to that claim, I report here that there is no sign of such an attractor. This holds both for the worldwide climate of the past 1–2 Myr (averaged ...

... Again, the standard MCF network-programming technique is applied, and in this case, we use the “costs” of the previous solutions (ie, achieved in the T × B⊥ plane) to set the weights associated 0196-2892/$20.00 © 2006 IEEE Page 2.... more

... Again, the standard MCF network-programming technique is applied, and in this case, we use the “costs” of the previous solutions (ie, achieved in the T × B⊥ plane) to set the weights associated 0196-2892/$20.00 © 2006 IEEE Page 2. PEPE AND LANARI: PHASE ...

Spectral clustering (SC) methods have been successfully applied to many real-world applications. The success of these SC methods is largely based on the manifold assumption, namely, that two nearby data points in the high-density region... more

Spectral clustering (SC) methods have been successfully applied to many real-world applications. The success of these SC methods is largely based on the manifold assumption, namely, that two nearby data points in the high-density region of a low-dimensional data manifold have the same cluster label. However, such an assumption might not always hold on high-dimensional data. When the data do not exhibit a clear low-dimensional manifold structure (e.g., high-dimensional and sparse data), the clustering performance of SC will be degraded and become even worse than K -means clustering. In this paper, motivated by the observation that the true cluster assignment matrix for high-dimensional data can be always embedded in a linear space spanned by the data, we propose the spectral embedded clustering (SEC) framework, in which a linearity regularization is explicitly added into the objective function of SC methods. More importantly, the proposed SEC framework can naturally deal with out-of-sample data. We also present a new Laplacian matrix constructed from a local regression of each pattern and incorporate it into our SEC framework to capture both local and global discriminative information for clustering. Comprehensive experiments on eight real-world high-dimensional datasets demonstrate the effectiveness and advantages of our SEC framework over existing SC methods and K-means-based clustering methods. Our SEC framework significantly outperforms SC using the Nyström algorithm on unseen data.

Techniques that exploit knowledge of distributional similarity between words have been proposed in many areas of Natural Language Processing. For example, in language modeling, the sparse data problem can be alleviated by estimating the... more

Techniques that exploit knowledge of distributional similarity between words have been proposed in many areas of Natural Language Processing. For example, in language modeling, the sparse data problem can be alleviated by estimating the probabilities of unseen co-occurrences ...