Sparse Data Research Papers - Academia.edu (original) (raw)

2008, International Journal of Climatology

Spatial climate data sets of 1971-2000 mean monthly precipitation and minimum and maximum temperature were developed for the conterminous United States. These 30-arcsec (∼800-m) grids are the official spatial climate data sets of the U.S. Department of Agriculture. The PRISM (Parameter-elevation Relationships on Independent Slopes Model) interpolation method was used to develop data sets that reflected, as closely as possible, the current state of knowledge of spatial climate patterns in the United States. PRISM calculates a climate-elevation regression for each digital elevation model (DEM) grid cell, and stations entering the regression are assigned weights based primarily on the physiographic similarity of the station to the grid cell. Factors considered are location, elevation, coastal proximity, topographic facet orientation, vertical atmospheric layer, topographic position, and orographic effectiveness of the terrain. Surface stations used in the analysis numbered nearly 13 000 for precipitation and 10 000 for temperature. Station data were spatially quality controlled, and short-period-of-record averages adjusted to better reflect the 1971-2000 period.

1983, Pattern Analysis and Machine …

2010, Reviews of Geophysics

Precipitation downscaling improves the coarse resolution and poor representation of precipitation in global climate models, and helps end users to assess the likely hydrological impacts of climate change. This paper integrates perspectives from meteorologists, climatologists, statisticians and hydrologists, to identify generic end user (in particular impact modeler) needs, and to discuss downscaling capabilities and gaps. End users need a reliable representation of precipitation intensities, temporal and spatial variability, as well as physical consistency, independent of region and season. In addition to presenting dynamical downscaling, we review perfect prog statisti-

2003, IEEE Transactions on …

2006, Trends in Cognitive Sciences

Inductive inference allows humans to make powerful generalizations from sparse data when learning about word meanings, unobserved properties, causal relationships, and many other aspects of the world. Traditional accounts of induction emphasize either the power of statistical learning, or the importance of strong constraints from structured domain knowledge, intuitive theories or schemas. We argue that both components are necessary to explain the nature, use and acquisition of human knowledge, and we introduce a theory-based Bayesian framework for modeling inductive learning and reasoning as statistical inferences over structured knowledge representations.

2008, Journal of Clinical Densitometry

The International Society for Clinical Densitometry Official Positions on reporting of densitometry results in children represent an effort to consolidate opinions to assist healthcare providers determine which skeletal sites should be assessed, which adjustments should be made in these assessments, appropriate pediatric reference databases, and elements to include in a dual energy X-ray absorptiometry (DXA) report. Skeletal sites recommended for assessment are the lumbar spine and total body less head, the latter being valuable as it provides information on soft tissue, as well as bone. Interpretation of DXA findings in children with growth or maturational delay requires special consideration; adjustments are required to prevent erroneous interpretation. Normative databases used as a reference should be based on a large sample of healthy children that characterizes the variability in bone measures relative to gender, age, and race/ethnicity, and should be specific for each manufacturer and model of densitometer and software. Pediatric DXA reports should provide relevant demographic and health information, technical details of the scan, Z-scores, and should not include T-scores. The rationale and evidence for development of the Official Positions are provided. Given the sparse data currently available in many of these areas, it is likely that these positions will change over time as new data become available.

2002, Data Mining and Knowledge Discovery

Web usage mining, possibly used in conjunction with standard approaches to personalization such as collaborative filtering, can help address some of the shortcomings of these techniques, including reliance on subjective user ratings, lack of scalability, and poor performance in the face of high-dimensional and sparse data. However, the discovery of patterns from usage data by itself is not sufficient for performing the personalization tasks. The critical step is the effective derivation of good quality and useful (i.e., actionable) "aggregate usage profiles" from these patterns. In this paper we present and experimentally evaluate two techniques, based on clustering of user transactions and clustering of pageviews, in order to discover overlapping aggregate profiles that can be effectively used by recommender systems for real-time Web personalization. We evaluate these techniques both in terms of the quality of the individual profiles generated, as well as in the context of providing recommendations as an integrated part of a personalization engine. In particular, our results indicate that using the generated aggregate profiles, we can achieve effective personalization at early stages of users' visits to a site, based only on anonymous clickstream data and without the benefit of explicit input by these users or deeper knowledge about them.

2000, IEEE Transactions on Signal Processing

The aim of this paper is to propose diffusion strategies for distributed estimation over adaptive networks, assuming the presence of spatially correlated measurements distributed according to a Gaussian Markov random field (GMRF) model. The proposed methods incorporate prior information about the statistical dependency among observations, while at the same time processing data in real-time and in a fully decentralized manner. A detailed mean-square analysis is carried out in order to prove stability and evaluate the steady-state performance of the proposed strategies. Finally, we also illustrate how the proposed techniques can be easily extended in order to incorporate thresholding operators for sparsity recovery applications. Numerical results show the potential advantages of using such techniques for distributed learning in adaptive networks deployed over GMRF.

2006, American Journal of Medicine

There are sparse data on the frequency of venous thromboembolism in patients with various types of cancer. We sought to determine the incidence and relative risk of venous thromboembolism, pulmonary embolism, and deep venous thrombosis in... more

2006, Journal of Clinical Epidemiology

2008, BMJ

Objective To analyse the benefits and harms of statins in patients with chronic kidney disease (pre-dialysis, dialysis, and transplant populations). Design Meta-analysis. Data sources Cochrane Central Register of Controlled Trials, Medline, Embase, and Renal Health Library (July 2006). Study selection Randomised and quasi-randomised controlled trials of statins compared with placebo or other statins in chronic kidney disease. Data extraction and analysis Two reviewers independently assessed trials for inclusion, extracted data, and assessed trial quality. Differences were resolved by consensus. Treatment effects were summarised as relative risks or weighted mean differences with 95% confidence intervals by using a random effects model.

1990, Environmental Management

An extensive review of the published literature identified more than 150 case studies in which some aspect of resilience in freshwater systems was reported. Approximately 79% of systems studied were Iotic and the remainder lentic. Most of the stressor types were chemical with DDT (N = 29) and rotenone (N = 15) the most common. The most common nonchemical stressors were logging activity (N = 16), flooding (N = 8), dredging (N = 3), and drought (N = 7).

2006, The Annals of Statistics

The use of principal component methods to analyze functional data is appropriate in a wide range of different settings. In studies of "functional data analysis," it has often been assumed that a sample of random functions is observed precisely, in the continuum and without noise. While this has been the traditional setting for functional data analysis, in the context of longitudinal data analysis a random function typically represents a patient, or subject, who is observed at only a small number of randomly distributed points, with nonnegligible measurement error. Nevertheless, essentially the same methods can be used in both these cases, as well as in the vast number of settings that lie between them. How is performance affected by the sampling plan? In this paper we answer that question. We show that if there is a sample of n functions, or subjects, then estimation of eigenvalues is a semiparametric problem, with root-n consistent estimators, even if only a few observations are made of each function, and if each observation is encumbered by noise. However, estimation of eigenfunctions becomes a nonparametric problem when observations are sparse. The optimal convergence rates in this case are those which pertain to more familiar function-estimation settings. We also describe the effects of sampling at regularly spaced points, as opposed to random points. In particular, it is shown that there are often advantages in sampling randomly. However, even in the case of noisy data there is a threshold sampling rate (depending on the number of functions treated) above which the rate of sampling (either randomly or regularly) has negligible impact on estimator performance, no matter whether eigenfunctions or eigenvectors are being estimated. 1.1. Connections between FDA and LDA. Advances in modern technology, including computing environments, have facilitated the collection and analysis of

2003, Gastroenterology

Sparse data exist about the prognosis of childhood constipation and its possible persistence into adulthood. Methods: A total of 418 constipated patients older than 5 years at intake (279 boys; median age, 8.0 yr) participated in studies evaluating therapeutic modalities for constipation. All children subsequently were enrolled in this follow-up study with prospective data collection after an initial 6-week intensive treatment protocol, at 6 months, and thereafter annually, using a standardized questionnaire. Results: Follow-up was obtained in more than 95% of the children. The median duration of the follow-up period was 5 years (range, 1-8 yr). The cumulative percentage of children who were treated successfully during follow-up was 60% at 1 year, increasing to 80% at 8 years. Successful treatment was more frequent in children without encopresis and in children with an age of onset of defecation difficulty older than 4 years. In the group of children treated successfully, 50% experienced at least one period of relapse. Relapses occurred more frequently in boys than in girls (relative risk 1.73; 95% confidence interval, 1.15-2.62). In the subset of children aged 16 years and older, constipation still was present in 30%. Conclusions: After intensive initial medical and behavioral treatment, 60% of all children referred to a tertiary medical center for chronic constipation were treated successfully at 1 year of follow-up. One third of the children followed-up beyond puberty continued to have severe complaints of constipation. This finding contradicts the general belief that childhood constipation gradually disappears before or during puberty.

2012

This work introduces the use of compressed sensing (CS) algorithms for data compression in wireless sensors to address the energy and telemetry bandwidth constraints common to wireless sensor nodes. Circuit models of both analog and digital implementations of the CS system are presented that enable analysis of the power/performance costs associated with the design space for any potential CS application, including analog-to-information converters (AIC). Results of the analysis show that a digital implementation is significantly more energy-efficient for the wireless sensor space where signals require high gain and medium to high resolutions. The resulting circuit architecture is implemented in a 90 nm CMOS process. Measured power results correlate well with the circuit models, and the test system demonstrates continuous, on-the-fly data processing, resulting in more than an order of magnitude compression for electroencephalography (EEG) signals while consuming only 1.9 W at 0.6 V for sub-20 kS/s sampling rates. The design and measurement of the proposed architecture is presented in the context of medical sensors, however the tools and insights are generally applicable to any sparse data acquisition.

2000, Antimicrobial Agents and Chemotherapy

The objective of this study was to conduct a prospective population pharmacokinetic and pharmacodynamic evaluation of lumefantrine during blinded comparisons of artemether-lumefantrine treatment regimens in uncomplicated multidrug-resistant falciparum malaria. Three combination regimens containing an average adult lumefantrine dose of 1,920 mg over 3 days (four doses) (regimen A) or 2,780 mg over 3 or 5 days (six doses) (regimen B or C, respectively) were given to 266 Thai patients. Detailed observations were obtained for 51 hospitalized adults, and sparse data were collected for 215 patients of all ages in a community setting. The population absorption half-life of lumefantrine was 4.5 h. The model-based median (5th and 95th percentiles) peak plasma lumefantrine concentrations were 6.2 (0.25 and 14.8) g/ml after regimen A, 9.0 (1.1 and 19.8) g/ml after regimen B, and 8 (1.4 and 17.4) g/ml after regimen C. During acute malaria, there was marked variability in the fraction of drug absorbed by patients (coefficient of variation, 150%). The fraction increased considerably and variability fell with clinical recovery, largely because food intake was resumed; taking a normal meal close to drug administration increased oral bioavailability by 108% (90% confidence interval, 64 to 164) (P, 0.0001). The higher-dose regimens (B and C) gave 60 and 100% higher areas under the concentration-time curves (AUC), respectively, and thus longer durations for which plasma lumefantrine concentrations exceeded the putative in vivo MIC of 280 g/ml (median for regimen B, 252 h; that for regimen C, 298 h; that for regimen A, 204 h [P, 0.0001]) and higher cure rates. Lumefantrine oral bioavailability is very dependent on food and is consequently poor in acute malaria but improves markedly with recovery. The high cure rates with the two six-dose regimens resulted from increased AUC and increased time at which lumefantrine concentrations were above the in vivo MIC.

1986, Proceedings of the National Academy of Sciences

Many problems in early vision can be formulated in terms of minimizing a cost function. Examples are shape from shading, edge detection, motion analysis, structure from motion, and surface interpolation. As shown by Poggio and Koch [Poggio, T.-& Koch, C. (1985) Proc. R. Soc. London, Ser. B 226, 303-323], quadratic variational problems, an important subset of early vision tasks, can be "solved" by linear, analog electrical, or chemical networks. However, in the presence of discontinuities, the cost function is nonquadratic, raising the question of designing efficient algorithms for computing the optimal solution. Recently, Hopfield and Tank [Hopfield, J. J. & Tank, D. W. (1985) Biol. Cylern. 52, 141-152] have shown that networks of nonlinear analog "neurons" canibe effective in computing the solution ofoptimization problems. We show how these networks can be generalized to solve the nonconvex energy functionals of early vision. We illustrate this approach by implementing a specific analog network, solving the problem of reconstructing a smooth surface from sparse data while preserving its discontinuities. These results suggest a novel computational strategy for solving early vision problems in both biological and real-time artificial vision systems. This study addresses the use of simple analog networks to implement and solve problems in early vision, such as computing depth from two stereoscopic images, reconstructing and smoothing images from sparsely sampled data, and computing motion. Within the last years, computational studies have provided promising theories ofthe computations necessary for early vision (for partial reviews, see refs. 1-5). A number of early vision tasks can be described within the framework of standard regularization theory (5). Standard regularization analysis can be used to solve these problems in terms of quadratic energy functionals that must be minimized. Previous work by Poggio and Koch (6) showed how to design linear, analog networks for solving regularization problems with quadratic energy functions. The domain of applicability of standard regularization theory is limited, however, by the convexity of the energy functions, which makes it impossible to deal with problems involving true discontinuities. Such problems can be described by nonconvex energy functions involving binary line processes (7-10). More recently Marroquin (11) has proposed an approach to early vision based on the use of Markov random-field models and Bayes estimation theory (11, 33). We will show how these algorithms map naturally onto very simple resistive networks. There has been considerable interest in the computational properties and capabilities of networks of simple, neuroitallike elements (12-15). Recently, flopfield and Tank (16) have shown that analog neuronal networks can provide fast, next-to-optimal solutions to a well-characterized but difficult optimization problem, the "traveling salesman problem."' In this paper we show that networks of simple, analog, or hybrid processing elements can be used to give fast solutions to a number of early vision problems.

2000, American Journal of Epidemiology

Conditional logistic regression was developed to avoid &amp;quot;sparse-data&amp;quot; biases that can arise in ordinary logistic regression analysis. Nonetheless, it is a large-sample method that can exhibit considerable bias when certain types of matched sets are infrequent or when the model contains too many parameters. Sparse-data bias can cause misleading inferences about confounding, effect modification, dose response, and induction periods, and can interact with other biases. In this paper, the authors describe these problems in the context of matched case-control analysis and provide examples from a study of electrical wiring and childhood leukemia and a study of diet and glioma. The same problems can arise in any likelihood-based analysis, including ordinary logistic regression. The problems can be detected by careful inspection of data and by examining the sensitivity of estimates to category boundaries, variables in the model, and transformations of those variables. One can also apply various bias corrections or turn to methods less sensitive to sparse data than conditional likelihood, such as Bayesian and empirical-Bayes (hierarchical regression) methods.

2009, Mathematical Geosciences

Building a 3D geological model from field and subsurface data is a typical task in geological studies involving natural resource evaluation and hazard assessment. However, there is quite often a gap between research papers presenting case studies or specific innovations in 3D modeling and the objectives of a typical class in 3D structural modeling, as more and more is implemented at universities. In this paper, we present general procedures and guidelines to effectively build a structural model made of faults and horizons from typical sparse data. Then we describe a typical 3D structural modeling workflow based on triangulated surfaces. Our goal is not to replace software user guides, but to provide key concepts, principles, and procedures to be applied during geomodeling tasks, with a specific focus on quality control.

2003, … et Cosmochimica Acta

The 87Sr/86Sr values based on brachiopods and conodonts define a nearly continuous record for the Late Permian and Triassic intervals. Minor gaps in measurements exist only for the uppermost Brahmanian, lower part of the Upper Olenekian,... more

1986, Nature

Recently, it has been claimed 1 that the worldwide climate over the past million years follows a low-dimensional strange attractor. Contrary to that claim, I report here that there is no sign of such an attractor. This holds both for the... more

2006

... Again, the standard MCF network-programming technique is applied, and in this case, we use the costs of the previous solutions (ie, achieved in the T × B⊥ plane) to set the weights associated 0196-2892/$20.00 © 2006 IEEE Page 2.... more

2011, IEEE Transactions on Neural Networks

Spectral clustering (SC) methods have been successfully applied to many real-world applications. The success of these SC methods is largely based on the manifold assumption, namely, that two nearby data points in the high-density region of a low-dimensional data manifold have the same cluster label. However, such an assumption might not always hold on high-dimensional data. When the data do not exhibit a clear low-dimensional manifold structure (e.g., high-dimensional and sparse data), the clustering performance of SC will be degraded and become even worse than K -means clustering. In this paper, motivated by the observation that the true cluster assignment matrix for high-dimensional data can be always embedded in a linear space spanned by the data, we propose the spectral embedded clustering (SEC) framework, in which a linearity regularization is explicitly added into the objective function of SC methods. More importantly, the proposed SEC framework can naturally deal with out-of-sample data. We also present a new Laplacian matrix constructed from a local regression of each pattern and incorporate it into our SEC framework to capture both local and global discriminative information for clustering. Comprehensive experiments on eight real-world high-dimensional datasets demonstrate the effectiveness and advantages of our SEC framework over existing SC methods and K-means-based clustering methods. Our SEC framework significantly outperforms SC using the Nyström algorithm on unseen data.

2002, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '02

Relational Markov models (RMMs) are a generalization of Markov models where states can be of different types, with each type described by a different set of variables. The domain of each variable can be hierarchically structured, and shrinkage is carried out over the cross product of these hierarchies. RMMs make effective learning possible in domains with very large and heterogeneous state spaces, given only sparse data. We apply them to modeling the behavior of web site users, improving prediction in our PROTEUS architecture for personalizing web sites. We present experiments on an e-commerce and an academic web site showing that RMMs are substantially more accurate than alternative methods, and make good predictions even when applied to previously-unvisited parts of the site.

2001

: A model rendered at real-time rates (approximately half the performance of the standard per-vertex lighting model on an NVIDIA GeForce 3) with several BRDFs approximated using the technique in this paper. From left to right: satin (anisotropic Poulin-Fournier model), krylon blue, garnet red, cayman, mystique (Cornell measured data), leather, and velvet (CURET measured data).

1998, Toxicology and Applied Pharmacology

It is well known that young animals are generally more sensitive to lethal effects of cholinesterase-inhibiting pesticides, but there are sparse data comparing less-than-lethal effects. We compared the behavioral and biochemical toxicity of chlorpyrifos in young (postnatal Day 17; PND17) and adult (about 70 days old) rats. First, we established that the magnitude of the age-related differences decreased as the rat matures. Next, we evaluated the time course of a single oral dose of chlorpyrifos in adult and PND17 male and female rats. Behavioral changes were assessed using a functional observational battery (with age-appropriate modifications for pre-weanling rats) and an evaluation of motor activity. Cholinesterase (ChE) activity was measured in brain and peripheral tissues and muscarinic receptor binding assays were conducted on selected tissues. Rats received either vehicle (corn oil) or chlorpyrifos (adult dose: 80 mg/kg; PND17 dose: 15 mg/kg); these doses were equally effective in inhibiting ChE. The rats were tested, and tissues were then taken at 1, 2, 3.5, 6.5, 24, 72, 168, or 336 h after dosing. In adult rats, peak behavioral changes and ChE inhibition occurred in males at 3.5 h after dosing, while in females the onset of functional changes was sooner, the time course was more protracted and recovery was slower. In PND17 rats, maximal behavioral effects and ChE inhibition occurred at 6.5 h after dosing, and there were no gender-related differences. Behavioral changes showed partial to full recovery at 24 to 72 h, whereas ChE inhibition recovered markedly slower. Blood and brain ChE activity in young rats had nearly recovered by 1 week after dosing, whereas brain ChE in adults had not recovered at 2 weeks. Muscarinic-receptor binding assays revealed apparent down-regulation in some brain areas, mostly at 24 and 72 h. PND17 rats generally showed more receptor down-regulation than adults, whereas only adult female rats showed receptor changes in striatal tissue that persisted for 2 weeks. Thus, compared to adults (1) PND17 rats show similar behavioral changes and ChE inhibition although at a five-fold lower dose; (2) the onset of maximal effects is somewhat delayed in the young rats; (3) ChE activity tended to recover more quickly in the young rats; (4) young rats appear to have more extensive muscarinic receptor down-regulation, and (5) young rats show no gender-related differences.

2001

Gap models are perhaps the most widely used class of individual-based tree models used in ecology and climate change research. However, most gap model emphasize, in terms of process detail, computer code, and validation effort, tree growth with little attention to the simulation of plant death or mortality. Mortality algorithms have been mostly limited to general relationships because of sparse data on the causal mechanisms of mortality. If gap models are to be used to explore community dynamics under changing climates, the limitations and shortcomings of these mortality algorithms must be identified and the simulation of mortality must be improved. In this paper, we review the treatment of mortality in gap models, evaluate the relationships used to represent mortality in the current generation of gap models, and then assess the prospects for making improvements, especially for applications involving global climate change. Three needs are identified to improve mortality simulations in gap models: (1) process-based empirical analyses are needed to create more climate-sensitive stochastic mortality functions, (2) fundamental research is required to quantify the biophysical relationships between mortality and plant dynamics, and (3) extensive field data are needed to quantify, parameterize, and validate existing and future gap model mortality functions.

2000

Web usage mining, possibly used in conjunction with standard approaches to personalization such as collaborative filtering, can help address some of the shortcomings of these techniques, including reliance on subjective user ratings, lack of scalability, and poor performance in the face highdimensional and sparse data. However, the discovery of patterns from usage data by itself is not sufficient for performing the personalization tasks. The critical step is the effective derivation of good quality and useful (i.e., actionable) "aggregate usage profiles" from these patterns. In this paper we present and experimentally evaluate two techniques, based on clustering of user transactions and clustering of pageviews, in order to discover overlapping aggregate profiles that can be effectively used by recommender systems for real-time personalization. We evaluate these techniques both in terms of the quality of the individual profiles generated, as well as in the context of providing recommendations as an integrated part of a personalization engine.

2008, Proceedings of the 2008 ACM conference on Recommender systems - RecSys '08

Collaborative Filtering is one of the most widely used approaches in recommendation systems which predicts user preferences by learning past user-item relationships. In recent years, item-oriented collaborative filtering methods came into prominence as they are more scalable compared to useroriented methods. Item-oriented methods discover itemitem relationships from the training data and use these relations to compute predictions. In this paper, we propose a novel item-oriented algorithm, Random Walk Recommender, that first infers transition probabilities between items based on their similarities and models finite length random walks on the item space to compute predictions. This method is especially useful when training data is less than plentiful, namely when typical similarity measures fail to capture actual relationships between items. Aside from the proposed prediction algorithm, the final transition probability matrix computed in one of the intermediate steps can be used as an item similarity matrix in typical item-oriented approaches. Thus, this paper suggests a method to enhance similarity matrices under sparse data as well. Experiments on Movie-Lens data show that Random Walk Recommender algorithm outperforms two other item-oriented methods in different sparsity levels while having the best performance difference in sparse datasets.

1995, Computer Speech & Language

In recent years there is much interest in word cooccurrence relations, such as n-grams, verbobject combinations, or cooccurrence within a limited context. This paper discusses how to estimate the probability of cooccurrences that do not occur in the training data. We present a method that makes local analogies between each specific unobserved cooccurrence and other cooccurrences that contain similar words, as determined by an appropriate word similarity metric. Our evaluation suggests that this method performs better than existing smoothing methods, and may provide an alternative to class based models.

2004, Psychological Medicine

Background. The processing of facial emotion involves a distributed network of limbic and paralimbic brain structures. Many of these regions are also implicated in the pathophysiology of mood disorders. Behavioural data indicate that depressed subjects show a state-related positive recognition bias for faces displaying negative emotions. There are sparse data to suggest there may be an analogous, state-related negative recognition bias for negative emotions in mania. We used functional magnetic resonance imaging (fMRI) to investigate the behavioural and neurocognitive correlates of happy and sad facial affect recognition in patients with mania.

2007, Global Ecology and Biogeography

Aim Distribution modelling relates sparse data on species occurrence or abundance to environmental information to predict the population of a species at any point in space. Recently, the importance of spatial autocorrelation in distributions has been recognized. Spatial autocorrelation can be categorized as exogenous (stemming from autocorrelation in the underlying variables) or endogenous (stemming from activities of the organism itself, such as dispersal). Typically, one asks whether spatial models explain additional variability (endogenous) in comparison to a fully specified habitat model. We turned this question around and asked: can habitat models explain additional variation when spatial structure is accounted for in a fully specified spatially explicit model? The aim was to find out to what degree habitat models may be inadvertently capturing spatial structure rather than true explanatory mechanisms.

2006, IEEE Transactions on Pattern Analysis and Machine Intelligence

In order to optimize the accuracy of the Nearest-Neighbor classification rule, a weighted distance is proposed, along with algorithms to automatically learn the corresponding weights. These weights may be specific for each class and feature, for each individual prototype, or for both. The learning algorithms are derived by (approximately) minimizing the Leaving-One-Out classification error of the given training set. The proposed approach is assessed through a series of experiments with UCI/STATLOG corpora, as well as with a more specific task of text classification which entails very sparse data representation and huge dimensionality. In all these experiments, the proposed approach shows a uniformly good behavior, with results comparable to or better than state-of-the-art results published with the same data so far.

2005, Computational Linguistics

Techniques that exploit knowledge of distributional similarity between words have been proposed in many areas of Natural Language Processing. For example, in language modeling, the sparse data problem can be alleviated by estimating the... more

2007, Hydrological Sciences Journal

Event-based runoff coefficients can provide information on watershed response. They are useful for catchment comparison to understand how different landscapes "filter" rainfall into eventbased runoff and to explain the observed differences with catchment characteristics and related runoff mechanisms. However, the big drawback of this important parameter is the lack of a standard hydrograph separation method preceding its calculation. Event-based runoff coefficients determined with four well-established separation methods, as well as a newly developed separation method, are compared and are shown to differ considerably. This signifies that runoff coefficients reported in the literature often convey less information than required to allow for catchment classification. The new separation technique (constant-k method) is based on the theory of linear storage. Its advantages are that it is theoretically based in determining the end point of an event and that it can also be applied to events with multiple peaks. Furthermore, it is shown that event-based runoff coefficients in combination with simple statistical models improve our understanding of rainfall-runoff response of catchments with sparse data.

2006, … Laboratory, Office Of …

2001

We present a generalization of frequent itemsets allowing the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies error-tolerant frequent clusters of items in transactional data (customer-purchase data, web browsing data, text, etc.). This efficient algorithm exploits sparsity of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three real-world applications: clustering high-dimensional data, query selectivity estimation and collaborative filtering. Results show that we consistently uncover structure in large sparse databases that other more traditional clustering algorithms in data mining fail to find.

2011, Engineering Fracture Mechanics

This paper presents a methodology for uncertainty quantification and model validation in fatigue crack growth analysis. Several models -finite element model, crack growth model, surrogate model, etc. -are connected through a Bayes network that aids in model calibration, uncertainty quantification, and model validation. Three types of uncertainty are included in both uncertainty quantification and model validation: (1) natural variability in loading and material properties;

2009, 2009 IEEE 12th International Conference on Computer Vision

This paper investigates a new learning formulation called dynamic group sparsity. It is a natural extension of the standard sparsity concept in compressive sensing, and is motivated by the observation that in some practical sparse data the nonzero coefficients are often not random but tend to be clustered. Intuitively, better results can be achieved in these cases by reasonably utilizing both clustering and sparsity priors. Motivated by this idea, we have developed a new greedy sparse recovery algorithm, which prunes data residues in the iterative process according to both sparsity and group clustering priors rather than only sparsity as in previous methods. The proposed algorithm can recover stably sparse data with clustering trends using far fewer measurements and computations than current state-of-the-art algorithms with provable guarantees. Moreover, our algorithm can adaptively learn the dynamic group structure and the sparsity number if they are not available in the practical applications. We have applied the algorithm to sparse recovery and background subtraction in videos. Numerous experiments with improved performance over previous methods further validate our theoretical proofs and the effectiveness of the proposed algorithm.

2003, Sigplan Notices

Many important applications, such as those using sparse data structures, have memory reference patterns that are unknown at compile-time. Prior work has developed runtime reorderings of data and computation that enhance locality in such... more

2008, PLoS ONE

Background: There are sparse data on whether non-pharmaceutical interventions can reduce the spread of influenza. We implemented a study of the feasibility and efficacy of face masks and hand hygiene to reduce influenza transmission among... more

2010, Cognitive Technologies

In this chapter, we focus on the automatic recognition of emotional states using acoustic and linguistic parameters as features and classifiers as tools to predict the 'correct' emotional states. We first sketch history and state of the art in this field; then we describe the process of 'corpus engineering', i.e. the design and the recording of databases, the annotation of emotional states, and further processing such as manual or automatic segmentation. Next, we present an overview of acoustic and linguistic features that are extracted automatically or manually. In the section on classifiers, we deal with topics such as the curse of dimensionality and the sparse data problem, classifiers, and evaluation. At the end of each section, we point out important aspects that should be taken into account for the planning or the assessment of studies. The subject area of this chapter is not emotions in some narrow sense but in a wider sense encompassing emotion-related states such as moods, attitudes, or interpersonal stances as well. We do not aim at an in-depth treatise of some specific aspects or algorithms but at an overview of approaches and strategies that have been used or should be used.

2010, Neurocomputing

The task of discovering natural groupings of input patterns, or clustering, is an important aspect machine learning and pattern analysis. In this paper, we study the widely-used spectral clustering algorithm which clusters data using eigenvectors of a similarity/affinity matrix derived from a data set. In particular, we aim to solve two critical issues in spectral clustering: (1) How to automatically determine the number of clusters? and (2) How to perform effective clustering given noisy and sparse data? An analysis of the characteristics of eigenspace is carried out which shows that (a) Not every eigenvectors of a data affinity matrix is informative and relevant for clustering; (b) Eigenvector selection is critical because using uninformative/irrelevant eigenvectors could lead to poor clustering results; and (c) The corresponding eigenvalues cannot be used for relevant eigenvector selection given a realistic data set.

2006, American Journal of Transplantation

Background. Everolimus is a proliferation-signal inhibitor which was introduced for heart transplant recipients in 2004. To date, there are only sparse data about long-term calcineurin inhibitor (CNI)-free immunosuppression using everolimus. Methods. After heart transplantation, patients receiving everolimus were consecutively enrolled. Reasons for switching to everolimus were side effects of CNI immunosuppression, such as deterioration of kidney function and recurrent rejection episodes. All 60 patients underwent standardized switching protocols, 42 patients completed 24-month follow-up. Blood was sampled for lipid status, renal function, routine controls, and levels of immunosuppressive agents. On days 0, 14, and 28, and then every 3 months, echocardiography and physical examination were performed. Results. After switching to everolimus, most patients recovered from the side effects. Renal function improved significantly after 24 months (creatinine, 2.1 Ϯ 0.6 vs 1.8 Ϯ 1 mg/dL; P Ͻ .001; creatinine clearance, 41.8 Ϯ 22 vs 48.6 Ϯ 21.8 mL/min; P Ͻ .001). Median blood pressure increased from 120.0/75.0 mm Hg at baseline to 123.8/80.0 mm Hg at month 24 (P values .008 and .003 for systolic and diastolic pressures, respectively). Tremor, peripheral edema, hirsutism, and gingival hyperplasia markedly improved. Levels of interleukin-6 were stable between baseline and 24-month levels. Temporary adverse events occurred in 8 patients [13.3%: interstitial pneumonia (n ϭ 2), skin disorders (n ϭ 2); reactivated hepatitis B (n ϭ 1), and fever of unknown origin (n ϭ 3)]. Conclusion. CNI-free immunosuppression using everolimus is safe, with excellent efficacy in maintenance of heart transplant recipients. Arterial hypertension and renal function significantly improved. CNI-induced side effects, such as tremor, peripheral edema, hirsutism, and gingival hyperplasia, markedly improved in most patients.

2003, Engineering Geology

Differential Synthetic Aperture Radar (SAR) interferometry (DiffSAR) allows, in principle, to measure very small movements of the ground and to cover in continuity large areas, so that it can be considered as a potentially ideal tool to investigate landslides and other slope instability. In this paper, we explore the use of this technique to improve our knowledge of the slope instability of a well-investigated area (the Maratea Valley), affected by continuous slow movements, producing an impressive ''Sackung''-type phenomenon, which poses several unanswered questions.

1997, Proceedings of the eighth conference on …

Decoding algorithm is a crucial part in statistical machine translation. We describe a stack decoding algorithm in this paper. We present the hypothesis scoring method and the heuristics used in our algorithm. We report several techniques deployed to improve the performance of the decoder. We also introduce a simplified model to moderate the sparse data problem and to speed up the decoding process. We evaluate and compare these techniques/models in our statistical machine translation system.

2008, Ecological Applications

A fundamental challenge to estimating population size with mark-recapture methods is heterogeneous capture probabilities and subsequent bias of population estimates. Confronting this problem usually requires substantial sampling effort that can be difficult to achieve for some species, such as carnivores. We developed a methodology that uses two data sources to deal with heterogeneity and applied this to DNA mark-recapture data from grizzly bears (Ursus arctos). We improved population estimates by incorporating additional DNA ''captures'' of grizzly bears obtained by collecting hair from unbaited bear rub trees concurrently with baited, grid-based, hair snag sampling. We consider a Lincoln-Petersen estimator with hair snag captures as the initial session and rub tree captures as the recapture session and develop an estimator in program MARK that treats hair snag and rub tree samples as successive sessions. Using empirical data from a large-scale project in the greater Glacier National Park, Montana, USA, area and simulation modeling we evaluate these methods and compare the results to hair-snag-only estimates. Empirical results indicate that, compared with hair-snag-only data, the joint hair-snag-rub-tree methods produce similar but more precise estimates if capture and recapture rates are reasonably high for both methods. Simulation results suggest that estimators are potentially affected by correlation of capture probabilities between sample types in the presence of heterogeneity. Overall, closed population Huggins-Pledger estimators showed the highest precision and were most robust to sparse data, heterogeneity, and capture probability correlation among sampling types. Results also indicate that these estimators can be used when a segment of the population has zero capture probability for one of the methods. We propose that this general methodology may be useful for other species in which mark-recapture data are available from multiple sources.

2011, International Journal of Multiphase Flow

Recibido el 23 de septiembre de 2010; aceptado el 13 de diciembre de 2010

2001, Canadian Journal of Zoology

The use of indices to evaluate small-mammal populations has been heavily criticized, yet a review of smallmammal studies published from 1996 through 2000 indicated that indices are still the primary methods employed for measuring populations. The literature review also found that 98% of the samples collected in these studies were too small for reliable selection among population-estimation models. Researchers therefore generally have a choice between using a default estimator or an index, a choice for which the consequences have not been critically evaluated. We examined the use of a closed-population enumeration index, the number of unique individuals captured (M t+1 ), and 3 population estimators for estimating simulated small populations (N = 50) under variable effects of time, trap-induced behavior, individual heterogeneity in trapping probabilities, and detection probabilities. Simulation results indicated that the estimators produced population estimates with low bias and high precision when the estimator reflected the underlying sources of variation in capture probability. However, when the underlying sources of variation deviated from model assumptions, bias was often high and results were inconsistent. In our simulations, M t+1 generally exhibited lower variance and less sensitivity to the sources of variation in capture probabilities than the estimators.

1999, Global Biogeochemical Cycles

Iron occurs at very low concentrations in seawater and seems to be a limiting factor for primary production in the equatorial Pacific and the Southern Ocean. The global distribution of iron is still not well understood because of a lack of data and the complex chemistry of iron. We develop a 1 O-box model to study the oceanic distribution of iron and its effect on atmospheric CO2 concentration. Subject to our assumptions, we find that a lack of interocean fractionation of deep sea iron concentrations, as suggested by Johnson et al.