Changyi Park - Academia.edu (original) (raw)

Papers by Changyi Park

Research paper thumbnail of Kernel variable selection for multicategory support vector machines

Journal of Multivariate Analysis

Research paper thumbnail of Comparison of nonlinear classification methods for image data

Journal of the Korean Data And Information Science Society

Research paper thumbnail of Nonparametric matrix regression function estimation over symmetric positive definite matrices

Journal of the Korean Statistical Society

Research paper thumbnail of Comparison study of classification methods for image data

Journal of the Korean Data And Information Science Sociaty

Research paper thumbnail of Comparison study of K-nearest neighborhood classification algorithms

Journal of the Korean Data And Information Science Society

Research paper thumbnail of Regularized boxplot via convex clustering

Journal of Statistical Computation and Simulation

Research paper thumbnail of Network analysis for count data with excess zeros

BMC genetics, Nov 6, 2017

Undirected graphical models or Markov random fields have been a popular class of models for repre... more Undirected graphical models or Markov random fields have been a popular class of models for representing conditional dependence relationships between nodes. In particular, Markov networks help us to understand complex interactions between genes in biological processes of a cell. Local Poisson models seem to be promising in modeling positive as well as negative dependencies for count data. Furthermore, when zero counts are more frequent than are expected, excess zeros should be considered in the model. We present a penalized Poisson graphical model for zero inflated count data and derive an expectation-maximization (EM) algorithm built on coordinate descent. Our method is shown to be effective through simulated and real data analysis. Results from the simulated data indicate that our method outperforms the local Poisson graphical model in the presence of excess zeros. In an application to a RNA sequencing data, we also investigate the gender effect by comparing the estimated networks...

Research paper thumbnail of A robust support vector machine for labeling errors

Communications in Statistics - Simulation and Computation

Research paper thumbnail of Categorical Variable Selection in Naïve Bayes Classification

Korean Journal of Applied Statistics

Naïve Bayes Classification is based on input variables that are a conditionally independent given... more Naïve Bayes Classification is based on input variables that are a conditionally independent given output variable. The Naïve Bayes assumption is unrealistic but simplifies the problem of high dimensional joint probability estimation into a series of univariate probability estimations. Thus Naïve Bayes classifier is often adopted in the analysis of massive data sets such as in spam e-mail filtering and recommendation systems. In this paper, we propose a variable selection method based on χ 2 statistic on input and output variables. The proposed method retains the simplicity of Naïve Bayes classifier in terms of data processing and computation; however, it can select relevant variables. It is expected that our method can be useful in classification problems for ultra-high dimensional or big data such as the classification of diseases based on single nucleotide polymorphisms(SNPs).

Research paper thumbnail of Classification of ratings in online reviews

Journal of the Korean Data and Information Science Society

Sentiment analysis or opinion mining is a technique of text mining employed to identify subjectiv... more Sentiment analysis or opinion mining is a technique of text mining employed to identify subjective information or opinions of an individual from documents in blogs, reviews, articles, or social networks. In the literature, only a problem of binary classification of ratings based on review texts in an online review. However, because there can be positive or negative reviews as well as neutral reviews, a multi-class classification will be more appropriate than the binary classification. To this end, we consider the multi-class classification of ratings based on review texts. In the preprocessing stage, we extract words related with ratings using chi-square statistic. Then the extracted words are used as input variables to multi-class classifiers such as support vector machines and proportional odds model to compare their predictive performances.

Research paper thumbnail of Improving Disease Prediction by Incorporating Family Disease History in Risk Prediction Models with Large-Scale Genetic Data

Genetics

Despite the many successes of genome-wide association studies (GWAS), the known susceptibility va... more Despite the many successes of genome-wide association studies (GWAS), the known susceptibility variants identified by GWAS have modest effect sizes, leading to notable skepticism about the effectiveness of building a risk prediction model from large-scale genetic data. However, in contrast to genetic variants, the family history of diseases has been largely accepted as an important risk factor in clinical diagnosis and risk prediction. Nevertheless, the complicated structures of the family history of diseases have limited their application in clinical practice. Here, we developed a new method that enables incorporation of the general family history of diseases with a liability threshold model, and propose a new analysis strategy for risk prediction with penalized regression analysis that incorporates both large numbers of genetic variants and clinical risk factors. Application of our model to type 2 diabetes in the Korean population (1846 cases and 1846 controls) demonstrated that single-nucleotide polymorphisms accounted for 32.5% of the variation explained by the predicted risk scores in the test data set, and incorporation of family history led to an additional 6.3% improvement in prediction. Our results illustrate that family medical history provides valuable information on the variation of complex diseases and improves prediction performance.

Research paper thumbnail of Evaluation of Penalized and Nonpenalized Methods for Disease Prediction with Large-Scale Genetic Data

BioMed Research International, 2015

Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to... more Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to identify disease susceptibility loci and this successful finding has substantially improved our understanding of complex diseases. However, in spite of these successes, most of the genetic effects for many complex diseases were found to be very small, which have been a big hurdle to build disease prediction model. Recently, many statistical methods based on penalized regressions have been proposed to tackle the so-called “large P and small N” problem. Penalized regressions including least absolute selection and shrinkage operator (LASSO) and ridge regression limit the space of parameters, and this constraint enables the estimation of effects for very large number of SNPs. Various extensions have been suggested, and, in this report, we compare their accuracy by applying them to several complex diseases. Our results show that penalized regressions are usually robust and provide better acc...

Research paper thumbnail of A Bahadur Representation of the Linear Support Vector Machine

Journal of Machine Learning Research

The support vector machine has been successful in a variety of applications. Also on the theoreti... more The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been stud- ied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences.

Research paper thumbnail of Support vector machines for big data analysis

We cannot analyze big data, which attracts recent attentions in industry and academy, by batch pr... more We cannot analyze big data, which attracts recent attentions in industry and academy, by batch processing algorithms developed in data mining because big data, by definition, cannot be uploaded and processed in the memory of a single system. So an imminent issue is to develop various leaning algorithms so that they can be applied to big data. In this paper, we review various algorithms for support vector machines in the literature. Particularly, we introduce online type and parallel processing algorithms that are expected to be useful in big data classifications and compare the strengths, the weaknesses and the performances of those algorithms through simulations for linear classification.

Research paper thumbnail of 그래프 LASSO에서 모형선택기준의 비교

Journal of the Korean Data and Information Science Society, 2014

요 약 그래프 모형 (graphical model)은 확률변수들간의 조건부 독립성 (conditional independence)을 시각적인 네트워크형태로 표현할 수 있기 때... more 요 약 그래프 모형 (graphical model)은 확률변수들간의 조건부 독립성 (conditional independence)을 시각적인 네트워크형태로 표현할 수 있기 때문에, 정보학 (bioinformatics)이나 사회관계망 (social network) 등 수많은 변수들이 서로 연결되어 있는 복잡한 확률 시스템에 대한 직관적인 도구로 활용 될 수 있다. 그래프 LASSO (graphical least absolute shrinkage and selection operator)는 고차 원의 자료에 대한 가우스 그래프 모형 (Gaussian graphical model)의 추정에서 과대적합 (overfitting)을 방지하는데에 효과적인 것으로 알려진 방법이다. 본 논문에서는 그래프 LASSO 추정에서 매 우 중요한 문제인 모형선택에 대하여 고려한다. 특히 여러가지 모형선택기준을 모의실험을 통해 비교 하며 실제 금융 자료를 분석한다.

Research paper thumbnail of 빅 데이터 분석을 위한 지지벡터기계

Journal of the Korean Data and Information Science Society, 2013

Research paper thumbnail of Fused least absolute shrinkage and selection operator for credit scoring

Journal of Statistical Computation and Simulation, 2014

Research paper thumbnail of 지지벡터기계의 변수 선택방법 비교

Korean Journal of Applied Statistics, 2013

Research paper thumbnail of Exploiting Correlations Between Link Flows to Improve Estimation of Average Annual Daily Traffic on Coverage Count Segments: Methodology and Numerical Study

Transportation Research Record, 2005

Page 1. A method is developed for exploiting correlations among segment flows that result from co... more Page 1. A method is developed for exploiting correlations among segment flows that result from common origin–destination (OD) path flows when average annual daily traffic (AADT) is being estimated on high-way segments sampled with coverage counts. ...

Research paper thumbnail of Oracle properties of SCAD-penalized support vector machine

Journal of Statistical Planning and Inference, 2012

ABSTRACT In many scientific investigations, a large number of input variables are given at the ea... more ABSTRACT In many scientific investigations, a large number of input variables are given at the early stage of modeling and identifying the variables predictive of the response is often a main purpose of such investigations. Recently, the support vector machine has become an important tool in classification problems of many fields. Several variants of the support vector machine adopting different penalties in its objective function have been proposed. This paper deals with the Fisher consistency and the oracle property of support vector machines in the setting where the dimension of inputs is fixed. First, we study the Fisher consistency of the support vector machine over the class of affine functions. It is shown that the function class for decision functions is crucial for the Fisher consistency. Second, we study the oracle property of the penalized support vector machines with the smoothly clipped absolute deviation penalty. Once we have addressed the Fisher consistency of the support vector machine over the class of affine functions, the oracle property appears to be meaningful in the context of classification. A simulation study is provided in order to show small sample properties of the penalized support vector machines with the smoothly clipped absolute deviation penalty.

Research paper thumbnail of Kernel variable selection for multicategory support vector machines

Journal of Multivariate Analysis

Research paper thumbnail of Comparison of nonlinear classification methods for image data

Journal of the Korean Data And Information Science Society

Research paper thumbnail of Nonparametric matrix regression function estimation over symmetric positive definite matrices

Journal of the Korean Statistical Society

Research paper thumbnail of Comparison study of classification methods for image data

Journal of the Korean Data And Information Science Sociaty

Research paper thumbnail of Comparison study of K-nearest neighborhood classification algorithms

Journal of the Korean Data And Information Science Society

Research paper thumbnail of Regularized boxplot via convex clustering

Journal of Statistical Computation and Simulation

Research paper thumbnail of Network analysis for count data with excess zeros

BMC genetics, Nov 6, 2017

Undirected graphical models or Markov random fields have been a popular class of models for repre... more Undirected graphical models or Markov random fields have been a popular class of models for representing conditional dependence relationships between nodes. In particular, Markov networks help us to understand complex interactions between genes in biological processes of a cell. Local Poisson models seem to be promising in modeling positive as well as negative dependencies for count data. Furthermore, when zero counts are more frequent than are expected, excess zeros should be considered in the model. We present a penalized Poisson graphical model for zero inflated count data and derive an expectation-maximization (EM) algorithm built on coordinate descent. Our method is shown to be effective through simulated and real data analysis. Results from the simulated data indicate that our method outperforms the local Poisson graphical model in the presence of excess zeros. In an application to a RNA sequencing data, we also investigate the gender effect by comparing the estimated networks...

Research paper thumbnail of A robust support vector machine for labeling errors

Communications in Statistics - Simulation and Computation

Research paper thumbnail of Categorical Variable Selection in Naïve Bayes Classification

Korean Journal of Applied Statistics

Naïve Bayes Classification is based on input variables that are a conditionally independent given... more Naïve Bayes Classification is based on input variables that are a conditionally independent given output variable. The Naïve Bayes assumption is unrealistic but simplifies the problem of high dimensional joint probability estimation into a series of univariate probability estimations. Thus Naïve Bayes classifier is often adopted in the analysis of massive data sets such as in spam e-mail filtering and recommendation systems. In this paper, we propose a variable selection method based on χ 2 statistic on input and output variables. The proposed method retains the simplicity of Naïve Bayes classifier in terms of data processing and computation; however, it can select relevant variables. It is expected that our method can be useful in classification problems for ultra-high dimensional or big data such as the classification of diseases based on single nucleotide polymorphisms(SNPs).

Research paper thumbnail of Classification of ratings in online reviews

Journal of the Korean Data and Information Science Society

Sentiment analysis or opinion mining is a technique of text mining employed to identify subjectiv... more Sentiment analysis or opinion mining is a technique of text mining employed to identify subjective information or opinions of an individual from documents in blogs, reviews, articles, or social networks. In the literature, only a problem of binary classification of ratings based on review texts in an online review. However, because there can be positive or negative reviews as well as neutral reviews, a multi-class classification will be more appropriate than the binary classification. To this end, we consider the multi-class classification of ratings based on review texts. In the preprocessing stage, we extract words related with ratings using chi-square statistic. Then the extracted words are used as input variables to multi-class classifiers such as support vector machines and proportional odds model to compare their predictive performances.

Research paper thumbnail of Improving Disease Prediction by Incorporating Family Disease History in Risk Prediction Models with Large-Scale Genetic Data

Genetics

Despite the many successes of genome-wide association studies (GWAS), the known susceptibility va... more Despite the many successes of genome-wide association studies (GWAS), the known susceptibility variants identified by GWAS have modest effect sizes, leading to notable skepticism about the effectiveness of building a risk prediction model from large-scale genetic data. However, in contrast to genetic variants, the family history of diseases has been largely accepted as an important risk factor in clinical diagnosis and risk prediction. Nevertheless, the complicated structures of the family history of diseases have limited their application in clinical practice. Here, we developed a new method that enables incorporation of the general family history of diseases with a liability threshold model, and propose a new analysis strategy for risk prediction with penalized regression analysis that incorporates both large numbers of genetic variants and clinical risk factors. Application of our model to type 2 diabetes in the Korean population (1846 cases and 1846 controls) demonstrated that single-nucleotide polymorphisms accounted for 32.5% of the variation explained by the predicted risk scores in the test data set, and incorporation of family history led to an additional 6.3% improvement in prediction. Our results illustrate that family medical history provides valuable information on the variation of complex diseases and improves prediction performance.

Research paper thumbnail of Evaluation of Penalized and Nonpenalized Methods for Disease Prediction with Large-Scale Genetic Data

BioMed Research International, 2015

Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to... more Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to identify disease susceptibility loci and this successful finding has substantially improved our understanding of complex diseases. However, in spite of these successes, most of the genetic effects for many complex diseases were found to be very small, which have been a big hurdle to build disease prediction model. Recently, many statistical methods based on penalized regressions have been proposed to tackle the so-called “large P and small N” problem. Penalized regressions including least absolute selection and shrinkage operator (LASSO) and ridge regression limit the space of parameters, and this constraint enables the estimation of effects for very large number of SNPs. Various extensions have been suggested, and, in this report, we compare their accuracy by applying them to several complex diseases. Our results show that penalized regressions are usually robust and provide better acc...

Research paper thumbnail of A Bahadur Representation of the Linear Support Vector Machine

Journal of Machine Learning Research

The support vector machine has been successful in a variety of applications. Also on the theoreti... more The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been stud- ied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences.

Research paper thumbnail of Support vector machines for big data analysis

We cannot analyze big data, which attracts recent attentions in industry and academy, by batch pr... more We cannot analyze big data, which attracts recent attentions in industry and academy, by batch processing algorithms developed in data mining because big data, by definition, cannot be uploaded and processed in the memory of a single system. So an imminent issue is to develop various leaning algorithms so that they can be applied to big data. In this paper, we review various algorithms for support vector machines in the literature. Particularly, we introduce online type and parallel processing algorithms that are expected to be useful in big data classifications and compare the strengths, the weaknesses and the performances of those algorithms through simulations for linear classification.

Research paper thumbnail of 그래프 LASSO에서 모형선택기준의 비교

Journal of the Korean Data and Information Science Society, 2014

요 약 그래프 모형 (graphical model)은 확률변수들간의 조건부 독립성 (conditional independence)을 시각적인 네트워크형태로 표현할 수 있기 때... more 요 약 그래프 모형 (graphical model)은 확률변수들간의 조건부 독립성 (conditional independence)을 시각적인 네트워크형태로 표현할 수 있기 때문에, 정보학 (bioinformatics)이나 사회관계망 (social network) 등 수많은 변수들이 서로 연결되어 있는 복잡한 확률 시스템에 대한 직관적인 도구로 활용 될 수 있다. 그래프 LASSO (graphical least absolute shrinkage and selection operator)는 고차 원의 자료에 대한 가우스 그래프 모형 (Gaussian graphical model)의 추정에서 과대적합 (overfitting)을 방지하는데에 효과적인 것으로 알려진 방법이다. 본 논문에서는 그래프 LASSO 추정에서 매 우 중요한 문제인 모형선택에 대하여 고려한다. 특히 여러가지 모형선택기준을 모의실험을 통해 비교 하며 실제 금융 자료를 분석한다.

Research paper thumbnail of 빅 데이터 분석을 위한 지지벡터기계

Journal of the Korean Data and Information Science Society, 2013

Research paper thumbnail of Fused least absolute shrinkage and selection operator for credit scoring

Journal of Statistical Computation and Simulation, 2014

Research paper thumbnail of 지지벡터기계의 변수 선택방법 비교

Korean Journal of Applied Statistics, 2013

Research paper thumbnail of Exploiting Correlations Between Link Flows to Improve Estimation of Average Annual Daily Traffic on Coverage Count Segments: Methodology and Numerical Study

Transportation Research Record, 2005

Page 1. A method is developed for exploiting correlations among segment flows that result from co... more Page 1. A method is developed for exploiting correlations among segment flows that result from common origin–destination (OD) path flows when average annual daily traffic (AADT) is being estimated on high-way segments sampled with coverage counts. ...

Research paper thumbnail of Oracle properties of SCAD-penalized support vector machine

Journal of Statistical Planning and Inference, 2012

ABSTRACT In many scientific investigations, a large number of input variables are given at the ea... more ABSTRACT In many scientific investigations, a large number of input variables are given at the early stage of modeling and identifying the variables predictive of the response is often a main purpose of such investigations. Recently, the support vector machine has become an important tool in classification problems of many fields. Several variants of the support vector machine adopting different penalties in its objective function have been proposed. This paper deals with the Fisher consistency and the oracle property of support vector machines in the setting where the dimension of inputs is fixed. First, we study the Fisher consistency of the support vector machine over the class of affine functions. It is shown that the function class for decision functions is crucial for the Fisher consistency. Second, we study the oracle property of the penalized support vector machines with the smoothly clipped absolute deviation penalty. Once we have addressed the Fisher consistency of the support vector machine over the class of affine functions, the oracle property appears to be meaningful in the context of classification. A simulation study is provided in order to show small sample properties of the penalized support vector machines with the smoothly clipped absolute deviation penalty.