Sourish Das - Profile on Academia.edu (original) (raw)

Papers by Sourish Das

2017 International Conference on Computational Intelligence in Data Science(ICCIDS), 2017

Yield curve forecasting is an important problem in finance. In this work we explore the use of Ga... more Yield curve forecasting is an important problem in finance. In this work we explore the use of Gaussian Processes in conjunction with a dynamic modeling strategy, much like the Kalman Filter, to model the yield curve. Gaussian Processes have been successfully applied to model functional data in a variety of applications. A Gaussian Process is used to model the yield curve. The hyper-parameters of the Gaussian Process model are updated as the algorithm receives yield curve data. Yield curve data is typically available as a time series with a frequency of one day. We compare existing methods to forecast the yield curve with the proposed method. The results of this study showed that while a competing method (a multivariate time series method) performed well in forecasting the yields at the short term structure region of the yield curve, Gaussian Processes perform well in the medium and long term structure regions of the yield curve. Accuracy in the long term structure region of the yield curve has important practical implications. The Gaussian Process framework yields uncertainty and probability estimates directly in contrast to other competing methods. Analysts are frequently interested in this information. In this study the proposed method has been applied to yield curve forecasting, however it can be applied to model high frequency time series data or data streams in other domains.

ArXiv, 2017

We present an algorithm for classification tasks on big data. Experiments conducted as part of th... more We present an algorithm for classification tasks on big data. Experiments conducted as part of this study indicate that the algorithm can be as accurate as ensemble methods such as random forests or gradient boosted trees. Unlike ensemble methods, the models produced by the algorithm can be easily interpreted. The algorithm is based on a divide and conquer strategy and consists of two steps. The first step consists of using a decision tree to segment the large dataset. By construction, decision trees attempt to create homogeneous class distributions in their leaf nodes. However, non-homogeneous leaf nodes are usually produced. The second step of the algorithm consists of using a suitable classifier to determine the class labels for the non-homogeneous leaf nodes. The decision tree segment provides a coarse segment profile while the leaf level classifier can provide information about the attributes that affect the label within a segment.

Computational Statistics, 2020

Statistical Machine Learning (SML) refers to a body of algorithms and methods by which computers ... more Statistical Machine Learning (SML) refers to a body of algorithms and methods by which computers are allowed to discover important features of input data sets which are often very large in size. The very task of feature discovery from data is essentially the meaning of the keyword 'learning' in SML. Theoretical justifications for the effectiveness of the SML algorithms are underpinned by sound principles from different disciplines, such as Computer Science and Statistics. The theoretical underpinnings particularly justified by statistical inference methods are together termed as statistical learning theory. This paper provides a review of SML from a Bayesian decision theoretic point of view-where we argue that many SML techniques are closely connected to making inference by using the so called Bayesian paradigm. We discuss many important SML techniques such as supervised and unsupervised learning, deep learning, online learning and Gaussian processes especially in the context of very large data sets where these are often employed. We present a dictionary which maps the key concepts of SML from Computer Science and Statistics. We illustrate the SML techniques with three moderately large data sets where we also discuss many practical implementation issues. Thus the review is especially targeted at statisticians and computer scientists who are aspiring to understand and apply SML for moderately large to big data sets.

Calcutta Statistical Association Bulletin, 2018

Datasets with a mixture of numerical and categorical attributes are routinely encountered in many... more Datasets with a mixture of numerical and categorical attributes are routinely encountered in many application domains. Such datasets do not have a direct representation in Euclidean space. As a consequence, dissimilarity measures such as the Gower distance are used when partitioning clustering approaches are used with such datasets. Homogeneity analysis (HA) can be used to determine a Euclidean representation of mixed datasets. Such a representation can be analysed by leveraging the large body of tools and techniques for data with a Euclidean representation. The utility of the representation obtained from HA is not limited to clustering. This representation can be used to visualize mixed datasets and generate succinct numerical summaries. Such summaries can yield clues about associations between variables which may be difficult to discover otherwise. AMS Classification Code: 62-07

Big Data Research, 2018

Gaussian Processes are widely used for regression tasks. A known limitation in the application of... more Gaussian Processes are widely used for regression tasks. A known limitation in the application of Gaussian Processes to regression tasks is that the computation of the solution requires performing a matrix inversion. The solution also requires the storage of a large matrix in memory. These factors restrict the application of Gaussian Process regression to small and moderate size data sets. We present an algorithm that combines estimates from models developed using subsets of the data obtained in a manner similar to the bootstrap. The sample size is a critical parameter for this algorithm. Guidelines for reasonable choices of algorithm parameters, based on detailed experimental study, are provided. Various techniques have been proposed to scale Gaussian Processes to large scale regression tasks. The most appropriate choice depends on the problem context. The proposed method is most appropriate for problems where an additive model works well and the response depends on a small number of features. The minimax rate of convergence for such problems is attractive and we can build effective models with a small subset of the data. The Stochastic Variational Gaussian Process and the Sparse Gaussian Process are also appropriate choices for such problems. These methods pick a subset of data based on theoretical considerations. The proposed algorithm uses bagging and random sampling. Results from experiments conducted as part of this study indicate that the algorithm presented in this work can be as effective as these methods.

New Economic Windows, 2019

Yield curve modeling is an essential problem in finance. In this work, we explore the use of Baye... more Yield curve modeling is an essential problem in finance. In this work, we explore the use of Bayesian statistical methods in conjunction with Nelson-Siegel model. We present the hierarchical Bayesian model for the parameters of the Nelson-Siegel yield function. We implement the MAP estimates via BFGS algorithm in rstan. The Bayesian analysis relies on the Monte Carlo simulation method. We perform the Hamiltonian Monte Carlo (HMC), using the rstan package. As a by-product of the HMC, we can simulate the Monte Carlo price of a Bond, and it helps us to identify if the bond is over-valued or under-valued. We demonstrate the process with an experiment and US Treasury's yield curve data. One of the interesting observation of the experiment is that there is a strong negative correlation between the price and long-term effect of yield. However, the relationship between the short-term interest rate effect and the value of the bond is weakly positive. This is because posterior analysis shows that the short-term effect and the long-term effect are negatively correlated.

Efficacy of endoscopic ultrasound (EUS) guided celiac plexus neurolysis (CPN) for managing abdominal pain associated with pancreas cancer: a meta-analysis

Gastrointestinal Endoscopy, 2009

Efficacy of Endoscopic Ultrasound-guided Celiac Plexus Block and Celiac Plexus Neurolysis for Managing Abdominal Pain Associated With Chronic Pancreatitis and Pancreatic Cancer

Journal of Clinical Gastroenterology, 2010

Endoscopic ultrasound (EUS)-guided celiac plexus block (CPB) and celiac plexus neurolysis (CPN) h... more Endoscopic ultrasound (EUS)-guided celiac plexus block (CPB) and celiac plexus neurolysis (CPN) have become important interventions in the management of pain due to chronic pancreatitis and pancreatic cancer. However, only a few well-structured studies have been performed to evaluate their efficacy. Given limited data, their use remains controversial. Herein, we evaluate the efficacy of EUS-guided CPB and CPN in alleviating chronic abdominal pain due to chronic pancreatitis and pancreatic cancer respectively. Using Medline, Pubmed, and Embase databases from January 1966 through December 2007, a thorough search of the English literature for studies evaluating the efficacy of EUS-guided CPB and CPN for the management of chronic abdominal pain due to chronic pancreatitis and pancreatic cancer was conducted, along with a hand search of reference lists. Studies that involved less than 10 patients were excluded. Data on pain relief was extracted, pooled, and analyzed. A total of 9 studies were included in the final analysis. For chronic pancreatitis, 6 relevant studies were identified, comprising a total of 221 patients. EUS-guided CPB was effective in alleviating abdominal pain in 51.46% of patients. For pancreatic cancer, 5 relevant studies were identified with a total of 119 patients. EUS-guided CPN was effective in alleviating abdominal pain in 72.54% of patients. EUS-guided CPB was 51.46% effective in managing chronic abdominal pain in patients with chronic pancreatitis, but warrants improvement in patient selection and refinement of technique, whereas EUS-guided CPN was 72.54% effective in managing pain due to pancreatic cancer and is a reasonable option for patients with tolerance to narcotic analgesics.

S1332 Efficacy of Endoscopic Ultrasound (EUS) Guided Celiac Plexus Block (CPB) for Managing Abdominal Pain Associated with Chronic Pancreatitis (CP): A Meta-Analysis

Gastroenterology, 2008

received chemotherapy with low dose gemcitabine (800-1000mg/body) on days 1, 8, and 15 every 4 we... more received chemotherapy with low dose gemcitabine (800-1000mg/body) on days 1, 8, and 15 every 4 weeks, or observation. Results: Twenty-one patients had locally advanced cancer, and 38 had metastatic pancreatic cancer. Thirty-nine patients had Eastern Cooperative Oncology Group performance status of 0 to 2, and 19 had the performance status of 3. Thirty-five patients received the chemotherapy. Five could not complete the chemotherapy of one cycle. Of 30 patients receiving the chemotherapy of more than 1 cycle, 1 achieved partial response and 11 disease stabilisation; however, progressive disease was noted in 18 patients. Median survival time was 8.0 months and 3.0 months, respectively, in patients receiving the chemotherapy and observation. In 58 patients, chemotharpay with low dose gemcitabine (odds ratio 6.84, 95% confidence interval 1.04-44.8, P = 0.04) and no metastasis (odds ratio 13.0, 95% confidence interval 2.50-67.4, P = 0.002) were associated with 6month survival by using multivariate logistic regression analysis, and performance status was not selected. Patients with disease stabilization had better survival than those with progressive disease (median survival time 13.2 months versus 6.7 months, P = 0.02, Breslow-Gehan-Wilcoxon). Furthermore, patients with progressive disease had better survival than those with those receiving observation (P = 0.02, Breslow-Gehan-Wilcoxon). Conclusion: This study indicates that the chemotherapy with low dose gemcitabine may improve the prognosis of elderly patients with unresectable advanced pancreatic cancer. We consider that, even in elderly patients, the effect of gemcitabine is worth investigating in future studies.

In this article, we obtain an estimator of the regression parameters for generalized linear model... more In this article, we obtain an estimator of the regression parameters for generalized linear models, using the Jacobian technique. We restrict ourselves to the natural exponential family for the response variable and choose the conjugate prior for the natural parameter. Using the Jacobian of transformation, we obtain the posterior distribution for the canonical link function and thereby obtain the posterior mode for the link. Under the full rank assumption for the covariate matrix, we then find an estimator for the regression parameters for the natural exponential family. Then the proposed estimator is specially derived for the Poisson model with log link function, and the binomial response model with the logit link function. We also discuss extensions to the binomial response model when covariates are all positive. Finally, an illustrative real-life example is given for the Poisson model with log link. In order to estimate the standard error of our estimators, we use the Bernstein-von Mises theorem. Finally, we compare the results using our Jacobian technique with a maximum likelihood estimates for the regression parameters.

Statistics & Probability Letters, 2010

In this paper we define a generalized multivariate gamma (MG) distribution and develop various pr... more In this paper we define a generalized multivariate gamma (MG) distribution and develop various properties of this distribution. Then we consider a Bayesian decision theoretic approach to develop the inference technique for the related scale matrix Σ. We show that maximum posteriori (MAP) estimate is a Bayes estimator. We also develop the testing problem for Σ using Bayes factor. This approach provides a mathematically closed form solution for Σ. Only other approach to Bayesian inference for MG distribution is given on Tsionas , which is based on Markov Chain Monte Carlo (MCMC) technique. Tsionas (2004) technique involves costly matrix inversion whose computational complexity increases in cubic order, hence make inference infeasible for Σ, for large dimension. In this paper, we provide an elegant closed form Bayes factor for Σ.

Alcoholism-clinical and Experimental Research, 2007

Background: Previous studies demonstrated, and replicated, an association between single nucleot... more Background: Previous studies demonstrated, and replicated, an association between single nucleotide polymorphisms (SNPs) within the GABRA2 gene and risk for alcohol dependence. The present study examines the association of a GABRA2 SNP with another definition of alcohol involvement and with the effects of psychosocial treatment.Methods: European-American subjects (n = 812, 73.4% male) provided DNA samples for the analysis. All were participants in Project Matching Alcoholism Treatment to Client Heterogeneity (MATCH), a multi-center randomized clinical trial evaluating the efficacy of 3 types of psychosocial treatment for alcoholism: Cognitive Behavioral Therapy (CBT), Motivational Enhancement Therapy (MET), or twelve-step facilitation (TSF). The daily probabilities of drinking and heavy drinking were estimated during the 12-week treatment and 12-month post-treatment periods.Results: Subjects homozygous for the allele associated with low risk for alcohol dependence in previous studies had lower daily probabilities of drinking and heavy drinking in the present study. This low-risk allele was also associated with a greater difference in drinking outcomes between the treatments. In addition, it enhanced the relative superiority of TSF over CBT and MET. Population stratification was excluded as a confound using ancestry informative marker analysis.Conclusions: The assessment of genetic vulnerability may be relevant to studies of the efficacy of psychosocial treatment: GABRA2 genotype modifies the variance in drinking and can therefore moderate power for resolving differences between treatments.

Statistics in Medicine, 2010

We developed a novel Pareto regression model with unknown shape parameter to analyze extreme drin... more We developed a novel Pareto regression model with unknown shape parameter to analyze extreme drinking in patients with Alcohol Dependence (AD). We used a generalized linear models (GLM) framework and a log-link between the shape parameter of the random and systematic components and a Monte Carlo based Bayesian method to implement the analysis. We examined two issues of importance in the study of AD: First, we tested whether a single nucleotide polymorphism within GABRA2 gene, which encodes a subunit of the GABA A receptor and has been associated to AD, influenced extreme alcohol intake and second, the efficacy of three psychotherapies for alcoholism in treating extreme drinking behavior. European-American participants (n = 812, 73.4% male) from Project MATCH, a multi-center randomized clinical trial of the psychotherapeutic treatment of alcoholism, provided DNA samples for this study.

American Statistician, 2006