Local Distance-Based Generalized Linear Models using the dbstats package for R (original) (raw)

Distance-based regression for non-normal data

THE 4TH INNOVATION AND ANALYTICS CONFERENCE & EXHIBITION (IACE 2019)

Distance-based regression (DBR) is a good alternative method for estimating the unknown parameters in regression modeling when dealing with mixed-type of exploratory variables. The concept of DBR is similar to classical linear regression (LR), but the explanatory variables are measured based on distance instead of raw values. This study extends the early study by Cuadras that investigated DBR on normal data, to consider the data that are non-normal. At the same time, we propose a new approach of DBR. The new DBR is focused on the categorical explanatory variables where it investigated the binomial, nominal and ordinal data separately. The investigation was set up in a Monte Carlo study, aiming to compare the performance of DBR over bootstrapping regression (nonparametric) based on R square (R 2), mean square error (MSE) and Bayesian information criterion (BIC). The findings indicate that both DBR and new DBR outperformed LR in both numerical exploratory variables and mixed-type of exploratory variables.

Distance models for analysis of multivariate binary data

2018

In this dissertation we developed statistical tools for analyzing multivariate binary data. In many disciplines multivariate binary data, in which there are multiple binary dependent variables and one or more independent variables, are often collected. In this dissertation we further developed the IPC model for analyzing multivariate binary data. The IPC model is a probabilistic multidimensional unfolding model and closely related to the Ideal Point Discriminant Analysis (IPDA). In Chapter 4, we proposed a Multivariate Logistic Distance (MLD) model for analyzing multivariate binary data. The MLD model unifies two domains of statistical methods, i.e., Multidimensional Scaling (MDS) and Generalized Linear Model (GLM). The MLD model can be used to simultaneously assess the dimensional structure of the data and to study the effect of the predictor variables on the response variables. For the NESDA data, for example, a researcher can use the MLD model to determine the dimensional structu...

Fitting a distance model to homogeneous subsets of variables: Points of view analysis of categorical data

Journal of Classification, 1996

An approach is presented for analyzing a heterogeneous set of categorical variables assumed to form a limited number of homogeneous subsets. The variables generate a particular set of proximities between the objects in the data matrix, and the objective of the analysis is to represent the objects in lowdimensional Euclidean spaces, where the distances approximate these proximities. A least squares loss function is minimized that involves three major components: a) the partitioning of the heterogeneous variables into homogeneous subsets; b) the optimal quantification of the categories of the variables, and c) the representation of the objects through multiple multidimensional scaling tasks performed simultaneously. An important aspect from an algorithmic point of view is in the use of majorization. The use of the procedure is demonstrated by a typical example of possible application, i.e., the analysis of categorical data obtained in a free-sort task. The results of points of view analysis are contrasted with a standard homogeneity analysis, and the stability is studied through a Jackknife analysis.

Distance-Based Estimation Methods for Models for Discrete and Mixed-Scale Data

2021

Pearson residuals aid the task of identifying model misspecification because they compare the estimated, using data, model with the model assumed under the null hypothesis. We present different formulations of the Pearson residual system that account for the measurement scale of the data and study their properties. We further concentrate on the case of mixed-scale data, that is, data measured in both categorical and interval scale. We study the asymptotic properties and the robustness of minimum disparity estimators obtained in the case of mixed-scale data and exemplify the performance of the methods via simulation.

Distance metric choice can both reduce and induce collinearity in geographically weighted regression

Environment and Planning B: Urban Analytics and City Science, 2018

This paper explores the impact of different distance metrics on collinearity in local regression models such as Geographically Weighted Regression (GWR). Using a case study of house price data collected in Hà Nội, Vietnam, and by fully varying both power and rotation parameters to create different Minkowski distances, the analysis shows that local collinearity can be both negatively and positively affected by distance metric choice. The Minkowski distance that maximised collinearity in GWR was approximate to a Manhattan distance with (power = 0.70) with a rotation of 30°, and that which minimised collinearity was parameterised with power = 0.05 and a rotation of 70°. The results indicate that distance metric choice can provide a useful extra tuning component to address local collinearity issues in spatially varying coefficient modelling and that understanding the interaction of distance metric and collinearity can provide insight into the nature and structure of the data relationships. The discussion considers first, the exploration and selection of different distance metrics to minimise collinearity as an alternative to localised ridge regression, lasso and elastic net approaches. Second, it discusses the how distance metric choice could extend the methods that additionally optimise local model fit (lasso and elastic net) by selecting a distance metric that further helped minimise local collinearity. Third, it identifies the need to investigate the relationship between kernel bandwidth, distance metrics and collinearity as an area of further work.

A survey of distance measures for mixed variables

International Journal of Chemical Studies, 2020

Distance measures are base for many statistical and data science methods with their applicability in various fields of science. Mixed variables data which is combination of continuous and categorical variables occurs frequently in fields such as medical, agriculture, remote sensing, biology, marketing, ecology etc., but a little work has been done for evaluating distance for such type of data. As there is not much literature available on distance measures for mixed data, therefore the fundamental sources that provide a comprehensive detail of a particular measure for mixed variables data were studied and reviewed in this paper.

Managing distance and covariate information with point-based clustering

BMC Medical Research Methodology, 2016

Background: Geographic perspectives of disease and the human condition often involve point-based observations and questions of clustering or dispersion within a spatial context. These problems involve a finite set of point observations and are constrained by a larger, but finite, set of locations where the observations could occur. Developing a rigorous method for pattern analysis in this context requires handling spatial covariates, a method for constrained finite spatial clustering, and addressing bias in geographic distance measures. An approach, based on Ripley's K and applied to the problem of clustering with deliberate self-harm (DSH), is presented. Methods: Point-based Monte-Carlo simulation of Ripley's K, accounting for socioeconomic deprivation and sources of distance measurement bias, was developed to estimate clustering of DSH at a range of spatial scales. A rotated Minkowski L 1 distance metric allowed variation in physical distance and clustering to be assessed. Self-harm data was derived from an audit of 2 years' emergency hospital presentations (n = 136) in a New Zealand town (population~50,000). Study area was defined by residential (housing) land parcels representing a finite set of possible point addresses. Results: Area-based deprivation was spatially correlated. Accounting for deprivation and distance bias showed evidence for clustering of DSH for spatial scales up to 500 m with a one-sided 95 % CI, suggesting that social contagion may be present for this urban cohort. Conclusions: Many problems involve finite locations in geographic space that require estimates of distance-based clustering at many scales. A Monte-Carlo approach to Ripley's K, incorporating covariates and models for distance bias, are crucial when assessing health-related clustering. The case study showed that social network structure defined at the neighbourhood level may account for aspects of neighbourhood clustering of DSH. Accounting for covariate measures that exhibit spatial clustering, such as deprivation, are crucial when assessing point-based clustering.

A Note on the Mixed Geographically Weighted Regression Model*

Journal of Regional Science, 2004

A mixed, geographically weighted regression (GWR) model is useful in the situation where certain explanatory variables influencing the response are global while others are local. Undoubtedly, how to identify these two types of the explanatory variables is essential for building such a model. Nevertheless, It seems that there has not been a formal way to achieve this task. Based on some work on the GWR technique and the distribution theory of quadratic forms in normal variables, a statistical test approach is suggested here to identify a mixed GWR model. Then, this note mainly focuses on simulation studies to examine the performance of the test and to provide some guidelines for performing the test in practice. The simulation studies demonstrate that the test works quite well and provides a feasible way to choose an appropriate mixed GWR model for a given data set. 143 *This research was supported by the 863 Project of China (No. 2001AA111301) and China Postdoctoral Science Foundation. The authors would like to thank the three anonymous referees and Co-Editor Marlon G. Boarnet for their valuable comments and suggestions which led to significant improvements in the paper