Sujit Ghosh - Profile on Academia.edu (original) (raw)

Papers by Sujit Ghosh

arXiv (Cornell University), Jul 13, 2023

Residual bootstrap is a classical method for statistical inference in regression settings. With m... more Residual bootstrap is a classical method for statistical inference in regression settings. With massive data sets becoming increasingly common, there is a demand for computationally efficient alternatives to residual bootstrap. We propose a simple and versatile scalable algorithm called subsampled residual bootstrap (SRB) for generalized linear models (GLMs), a large class of regression models that includes the classical linear regression model as well as other widely used models such as logistic, Poisson and probit regression. We prove consistency and distributional results that establish that the SRB has the same theoretical guarantees under the GLM framework as the classical residual bootstrap, while being computationally much faster. We demonstrate the empirical performance of SRB via simulation studies and a real data analysis of the Forest Covertype data from the UCI Machine Learning Repository.

Quality Engineering, 2013

A series of returns are often modeled using stochastic volatility models. Many observed financial... more A series of returns are often modeled using stochastic volatility models. Many observed financial series exhibit unit-root non-stationary behavior in the latent AR(1) volatility process and tests for a unit-root become necessary, especially when the error process of the returns is correlated with the error terms of the AR(1) process. In this paper, we develop a class of priors that assigns positive prior probability on the non-stationary region, employ credible interval for the test, and show that Markov Chain Monte Carlo methods can be implemented using standard software. Several practical scenarios and real examples are explored to investigate the performance of our method.

Journal of Computational and Graphical Statistics, Feb 27, 2019

A Gaussian measurement error assumption, i.e., an assumption that the data are observed up to Gau... more A Gaussian measurement error assumption, i.e., an assumption that the data are observed up to Gaussian noise, can bias any parameter estimation in the presence of outliers. A heavy tailed error assumption based on Student's t distribution helps reduce the bias. However, it may be less efficient in estimating parameters if the heavy tailed assumption is uniformly applied to all of the data when most of them are normally observed. We propose a mixture error assumption that selectively converts Gaussian errors into Student's t errors according to latent outlier indicators, leveraging the best of the Gaussian and Student's t errors; a parameter estimation can be not only robust but also accurate. Using simulated hospital profiling data and astronomical time series of brightness data, we demonstrate the potential for the proposed mixture error assumption to estimate parameters accurately in the presence of outliers. Supplementary materials are available online.

Chapman and Hall/CRC eBooks, Apr 12, 2019

The Astrophysical Journal, May 4, 2017

Measurements of radial velocity variations from the spectroscopic monitoring of stars and their c... more Measurements of radial velocity variations from the spectroscopic monitoring of stars and their companions are essential for a broad swath of astrophysics, providing access to the fundamental physical properties that dictate all phases of stellar evolution and facilitating the quantitative study of planetary systems. The conversion of those measurements into both constraints on the orbital architecture and individual component spectra can be a serious challenge, however, especially for extreme flux ratio systems and observations with relatively low sensitivity. Gaussian processes define sampling distributions of flexible, continuous functions that are well-motivated for modeling stellar spectra, enabling proficient search for companion lines in time-series spectra. We introduce a new technique for spectral disentangling, where the posterior distributions of the orbital parameters and intrinsic, rest-frame stellar spectra are explored simultaneously without needing to invoke cross-correlation templates.

arXiv (Cornell University), Oct 5, 2022

Due to the heterogeneity of the randomized controlled trial (RCT) and external target populations... more Due to the heterogeneity of the randomized controlled trial (RCT) and external target populations, the estimated treatment effect from the RCT is not directly applicable to the target population. For example, the patient characteristics of the ACTG 175 HIV trial are significantly different from that of the three external target populations of interest: US early-stage HIV patients, Thailand HIV patients, and southern Ethiopia HIV patients. This paper considers several methods to transport the treatment effect from the ACTG 175 HIV trial to the target populations beyond the trial population. Most transport methods focus on continuous and binary outcomes; on the contrary, we derive and discuss several transport methods for survival outcomes: an outcome regression method based on a Cox proportional hazard (PH) model, an inverse probability weighting method based on the models for treatment assignment, sampling score, and censoring, and a doubly robust method that combines both methods, called the augmented calibration weighting (ACW) method. However, as the PH assumption was found to be incorrect for the ACTG 175 trial, the methods that depend on the PH assumption may lead to the biased quantification of the treatment effect. To account for the violation of the PH assumption, we extend the ACW method with the linear spline-based hazard regression model that does not require the PH assumption. Applying the aforementioned methods for transportability, we explore the effect of PH assumption, or the violation thereof, on transporting the survival results from the ACTG 175 trial to various external populations.

arXiv (Cornell University), Feb 15, 2019

In recent years there have been a lot of interest to test for similarity between biological drug ... more In recent years there have been a lot of interest to test for similarity between biological drug products, commonly known as biologics. Biologics are large and complex molecule drugs that are produced by living cells and hence these are sensitive to the environmental changes. In addition, biologics usually induce antibodies which raises the safety and efficacy issues. The manufacturing process is also much more complicated and often costlier than the small-molecule generic drugs. Because of these complexities and inherent variability of the biologics, the testing paradigm of the traditional generic drugs cannot be directly used to test for biosimilarity. Taking into account some of these concerns we propose a functional distance based methodology that takes into consideration the entire time course of the study and is based on a class of flexible semi-parametric models. The empirical results show that the proposed approach is more sensitive than the classical equivalence tests approach which are usually based on arbitrarily chosen time point. Bootstrap based methodologies are also presented for statistical inference.

Journal of Computational and Graphical Statistics, Apr 6, 2020

Inferring unknown conic sections on the basis of noisy data is a challenging problem with applica... more Inferring unknown conic sections on the basis of noisy data is a challenging problem with applications in computer vision. A major limitation of the currently available methods for conic sections is that estimation methods rely on the underlying shape of the conics (being known to be ellipse, parabola or hyperbola). A general purpose Bayesian hierarchical model is proposed for conic sections and corresponding estimation method based on noisy data is shown to work even when the specific nature of the conic section is unknown. The model, thus, provides probabilistic detection of the underlying conic section and inference about the associated parameters of the conic section. Through extensive simulation studies where the true conics may not be known, the methodology is demonstrated to have practical and methodological advantages relative to many existing techniques. In addition, the proposed method provides probabilistic measures of uncertainty of the estimated parameters. Furthermore, we observe high fidelity to the true conics even in challenging situations, such as data arising from partial conics in arbitrarily rotated and non-standard form, and where a visual inspection is unable to correctly identify the type of conic section underlying the data.

2015 AAEA & WAEA Joint Annual Meeting, July 26-28, San Francisco, California, 2015

This study proposes a new method to impute for ordinal missing data found in the household sectio... more This study proposes a new method to impute for ordinal missing data found in the household section of the Agricultural Resource Management Survey (ARMS). We extend a multivariate imputation method known as Iterative Sequential Regression (ISR) and make use of cut points to transform these ordinal variables into continuous variables for imputation. The household section contains important economic information on the well-being of the farm operator's household, asking respondents for information on off-farm income, household expenditures, and off-farm debt and assets. Currently, the USDA's Economic Research Service (ERS) uses conditional mean imputation in the household section, a method known to bias the variance of imputed variables downward and to distort multivariate relationships. The new transformation of these variables allows them to be jointly modeled with other ARMS variables using a Gaussian copula. A conditional linear model for imputation is then built using correlation analysis and economic theory. Finally, we discuss a Monte Carlo study which will randomly poke holes in the ARMS data to test the robustness of our proposed method. This will allow us to assess how well the adapted ISR imputation method works in comparison with two other missing data strategies, conditional mean imputation and a complete case analysis.

RePEc: Research Papers in Economics, 2008

This research develops a mixture regression model that is shown to have advantages over the class... more This research develops a mixture regression model that is shown to have advantages over the classical Tobit model in model fit and predictive tests when data are generated from a two step process. Additionally, the model is shown to allow for flexibility in distributional assumptions while nesting the classic Tobit model. A simulated data set is utilized to assess the potential loss in efficiency from model misspecification, assuming the Tobit and a zero-inflated log-normal distribution, which is derived from the generalized mixture mdoel. Results from simulations key on the finding that the the proposed zero-inflated log-normal model clearly outperforms the Tobit model when data are generated from a two step process. When data are generated from a Tobit model, forecats are more accurate when utilizing the Tobit model. However, the Tobit model will be shown to be a special case of the generalized mixture model. The empirical model is then applied to evaluating mortality rates in commercial cattle feedlots, both independently and as part of a system including other performance and health factors. This particular application is hypothesized to be more apropriate for the proposed model due to the high degree of censoring and skewed nature of mortality rates. The zero-inflated log-normal model clearly models and predicts with more accuracy that the tobit model.

arXiv (Cornell University), Oct 30, 2019

arXiv (Cornell University), Jan 26, 2023

Modern clinical and epidemiological studies widely employ wearables to record parallel streams of... more Modern clinical and epidemiological studies widely employ wearables to record parallel streams of real-time data on human physiology and behavior. With recent advances in distributional data analysis, these high-frequency data are now often treated as distributional observations resulting in novel regression settings. Motivated by these modelling setups, we develop a distributional outcome regression via quantile functions (DORQF) that expands existing literature with three key contributions: i) handling both scalar and distributional predictors, ii) ensuring jointly monotone regression structure without enforcing monotonicity on individual functional regression coefficients, iii) providing statistical inference via asymptotic projection-based joint confidence bands and a statistical test of global significance to quantify uncertainty of the estimated functional regression coefficients. The method is motivated by and applied to Actiheart component of Baltimore Longitudinal Study of Aging that collected one week of minute-level heart rate (HR) and physical activity (PA) data on 781 older adults to gain deeper understanding of age-related changes in daily life heart rate reserve, defined as a distribution of daily HR, while accounting for daily distribution of physical activity, age, gender, and body composition. Intriguingly, the results provide novel insights in epidemiology of daily life heart rate reserve.

arXiv (Cornell University), Oct 27, 2022

Time series forecasting has been a quintessential topic in data science, but traditionally, forec... more Time series forecasting has been a quintessential topic in data science, but traditionally, forecasting models have relied on extensive historical data. In this paper, we address a practical question: How much recent historical data is required to attain a targeted percentage of statistical prediction efficiency compared to the full time series? We propose the Pareto-Efficient Backsubsampling (PaEBack) method to estimate the percentage of the most recent data needed to achieve the desired level of prediction accuracy. We provide a theoretical justification based on asymptotic prediction theory for the AutoRegressive (AR) models. In particular, through several numerical illustrations, we show the application of the PaEBack for some recently developed machine learning forecasting methods even when the models might be misspecified. The main conclusion is that only a fraction of the most recent historical data provides nearoptimal or even better relative predictive accuracy for a broad class of forecasting methods.

Computational Statistics & Data Analysis

Shape restrictions on functional regression coefficients such as non-negativity, monotonicity, co... more Shape restrictions on functional regression coefficients such as non-negativity, monotonicity, convexity or concavity are often available in the form of a prior knowledge or required to maintain a structural consistency in functional regression models. A new estimation method is developed in shape-constrained functional regression models using Bernstein polynomials. Specifically, estimation approaches from nonparametric regression are extended to functional data, properly accounting for shape-constraints in a large class of functional regression models such as scalar-on-function regression (SOFR), function-on-scalar regression (FOSR), and function-on-function regression (FOFR). Theoretical results establish the asymptotic consistency of the constrained estimators under standard regularity conditions. A projection based approach provides point-wise asymptotic confidence intervals for the constrained estimators. A bootstrap test is developed facilitating testing of the shape constraints. Numerical analysis using simulations illustrate improvement in efficiency of the estimators from the use of the proposed method under shape constraints. Two applications include i) modeling a drug effect in a mental health study via shape-restricted FOSR and ii) modeling subject-specific quantile functions of accelerometry-estimated physical activity in the Baltimore Longitudinal Study of Aging (BLSA) as outcomes via shape-restricted quantile-function on scalar regression (QFOSR). R software implementation and illustration of the proposed estimation method and the test is provided.

A new model for both overdispersion and underdispersion using latent Markov processes modeled a s... more A new model for both overdispersion and underdispersion using latent Markov processes modeled a stationary processes is proposed. The parameters in this model can be estimated by the Bayesian method. The performance of the proposed method for the new model, evaluating in term of bias, MSE and coverage probability, has been explored using numerical methods based on simulated and real data.

Statistica Neerlandica, 2017

Modeling the correlation structure of returns is essential in many financial applications. Consid... more Modeling the correlation structure of returns is essential in many financial applications. Considerable evidence from empirical studies has shown that the correlation among asset returns is not stable over time. A recent development in the multivariate stochastic volatility literature is the application of inverse Wishart processes to characterize the evolution of return correlation matrices. Within the inverse Wishart multivariate stochastic volatility framework, we propose a flexible correlated latent factor model to achieve dimension reduction and capture the stylized fact of ‘correlation breakdown’ simultaneously. The parameter estimation is based on existing Markov chain Monte Carlo methods. We illustrate the proposed model with several empirical studies. In particular, we use high‐dimensional stock return data to compare our model with competing models based on multiple performance metrics and tests. The results show that the proposed model not only describes historic stylized...

Journal of Computational and Graphical Statistics, 2020

Data augmentation (DA) turns seemingly intractable computational problems into simple ones by aug... more Data augmentation (DA) turns seemingly intractable computational problems into simple ones by augmenting latent missing data. In addition to computational simplicity, it is now well-established that DA equipped with a deterministic transformation can improve the convergence speed of iterative algorithms such as an EM algorithm or Gibbs sampler. In this article, we outline a framework for the transformation-based DA, which we call data transforming augmentation (DTA), allowing augmented data to be a deterministic function of latent and observed data, and unknown parameters. Under this framework, we investigate a novel DTA scheme that turns heteroscedastic models into homoscedastic ones to take advantage of simpler computations typically available in homoscedastic cases. Applying this DTA scheme to fitting linear mixed models, we demonstrate simpler computations and faster convergence rates of resulting iterative algorithms, compared with those under a non-transformation-based DA scheme. We also fit a Beta-Binomial model using the proposed DTA scheme, which enables sampling approximate marginal posterior distributions that are available only under homoscedasticity. An R package Rdta is publicly available at CRAN.

Journal of statistical theory and practice, Nov 18, 2016

With a quantal response, the dose-response relation is summarized by the response probability fun... more With a quantal response, the dose-response relation is summarized by the response probability function (RPF) that provides probabilities of the response being reacted as a function of dose levels. In the dose-response analysis (DRA), it is often of primary interest to find a dose at which targeted response probability is attained, which we call target dose (TD). The estimation of the TD clearly depends on the underlying RPF structure. In this article, we provide a comparative analysis of some of the existing and newly proposed RPF estimation methods with particular emphasis on TD estimation. Empirical performances based on simulated data are presented to compare the existing and newly proposed methods. Nonparametric models based on a sequence of Bernstein polynomials are found to be robust against model misspecification. The methods are also illustrated using data obtained from a toxicological study.

CRC Press eBooks, May 25, 2000

The term "generalized linear models" encompasses both a class of models and a style of thinking a... more The term "generalized linear models" encompasses both a class of models and a style of thinking about building models that is applicable to a wide variety of distributions and types of responses. Key elements are a distributional assumption for the data and a transformation of the mean which is assumed to create a linear model for the predictors. The history of generalized linear models 1s traced, current work is reviewed and some predictions are made.

Stats

In this paper, we investigate a validation process in order to assess the predictive capabilities... more In this paper, we investigate a validation process in order to assess the predictive capabilities of a single degree-of-freedom oscillator. Model validation is understood here as the process of determining the accuracy with which a model can predict observed physical events or important features of the physical system. Therefore, assessment of the model needs to be performed with respect to the conditions under which the model is used in actual simulations of the system and to specific quantities of interest used for decision-making. Model validation also supposes that the model be trained and tested against experimental data. In this work, virtual data are produced from a non-linear single degree-of-freedom oscillator, the so-called oracle model, which is supposed to provide an accurate representation of reality. The mathematical model to be validated is derived from the oracle model by simply neglecting the non-linear term. The model parameters are identified via Bayesian updating...