E. Bura - Academia.edu (original) (raw)
Papers by E. Bura
Statistics in Medicine, 2011
We propose a method to combine several predictors (markers) that are measured repeatedly over tim... more We propose a method to combine several predictors (markers) that are measured repeatedly over time into a composite marker score without assuming a model and only requiring a mild condition on the predictor distribution. Assuming that the first and second moments of the predictors can be decomposed into a time and a marker component via a Kronecker product structure that accommodates the longitudinal nature of the predictors, we develop first-moment sufficient dimension reduction techniques to replace the original markers with linear transformations that contain sufficient information for the regression of the predictors on the outcome. These linear combinations can then be combined into a score that has better predictive performance than a score built under a general model that ignores the longitudinal structure of the data. Our methods can be applied to either continuous or categorical outcome measures. In simulations, we focus on binary outcomes and show that our method outperforms existing alternatives by using the AUC, the area under the receiver-operator characteristics (ROC) curve, as a summary measure of the discriminatory ability of a single continuous diagnostic marker for binary disease outcomes. 3 7 7 7 7 5
Statistics in Medicine, 2002
A new technique, denaturing high-performance liquid chromatography (dHPLC), allows for detection ... more A new technique, denaturing high-performance liquid chromatography (dHPLC), allows for detection of any heterozygous sequence variation in a gene without prior knowledge of the precise location of the sequence change. The results of a dHPLC analysis are recorded in real-time in the form of a chromatogram that is sequence-specific. In this paper we present methods to classify an individual, based on the observed chromatogram, as a homozygous wild-type or a carrier of a specific variant for the given DNA segment by comparison to representative chromatograms that are obtained from the training set of individuals with known variant status. The first approach consists of finding a parsimonious parametric model and then classifying each newly observed curve based on comparing the most discriminating characteristic, the main mode, to the main mode of the training curves. The second approach consists of finding empirical estimates of the modes of each chromatogram and using a bootstrap test for equality with the corresponding estimates of the training curves. We apply both methods to data on the breast cancer susceptibility gene BRCA1 and test the performance of the methods on independent samples.
Statistics & Probability Letters, 2008
In several dimension reduction techniques, the original variables are replaced by a smaller numbe... more In several dimension reduction techniques, the original variables are replaced by a smaller number of linear combinations. The coefficients of these linear combinations are typically the elements of the left singular vectors of a random matrix. We derive the asymptotic distribution of the left singular vectors of a random matrix that has a normal limit distribution. This result is then used to develop a Wald-type test for testing variable importance in Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE), two popular sufficient dimension reduction methods.
Law, Probability and Risk, 2011
Law, Probability and Risk, 2009
In two related decisions in Chamber of Commerce of the United States of America v. Securities and... more In two related decisions in Chamber of Commerce of the United States of America v. Securities and Exchange Commission (SEC), the District of Columbia Federal Court of Appeals ruled that the SEC had not fully complied with some provisions of the Administrative Procedures Act when it required that the boards of investment companies managing mutual funds have at least 75% of their membership and the Chairman be independent directors. In preparation for renewed rule making, the Office of Economic Analysis of the SEC prepared a Power Study to respond to an industry-sponsored report claiming that the returns of funds with independent boards and chairs are not superior to funds with boards dominated by management. The Power Study concluded that the available studies on the effectiveness of independent board members do not have sufficient statistical power to detect a meaningful difference in the returns of the two types of funds. This paper demonstrates that the method used by the SEC in their power calculation is not correct, unless a very restrictive condition that rarely occurs in practice holds. When the appropriate power formulas are used, the expected power of studies of the same size as the ones examined by the SEC is actually lower than the corresponding results of the SEC. Thus, the results in the paper actually strengthen the argument that the SEC is advocating. The relevance of both the SEC and the industry studies to the main issue in the case is also questioned in the discussion.
Journal of the American Statistical Association, 2014
There are two general approaches based on inverse regression for estimating the linear sufficient... more There are two general approaches based on inverse regression for estimating the linear sufficient 13 reductions for the regression of Y on X: the moment-based approach such as SIR, PIR, SAVE, and 14 DR, and the likelihood-based approach such as Principal Fitted Components (PFC) and Likelihood 15 Acquired Directions (LAD) when the inverse predictors, X|Y , are normal. By construction these 16 methods extract information from the first two conditional moments of X|Y , they can only estimate 17 linear reductions and thus form the linear Sufficient Dimension Reduction (SDR) methodology. 18 When var(X|Y ) is constant, E(X|Y ) contains the reduction and it can be estimated using PFC. 19 When var(X|Y ) is non-constant, PFC misses the information in the variance and second moment 20 based methods (SAVE, DR, LAD) are used instead, resulting in efficiency loss in the estimation 21 of the mean-based reduction. In this paper we prove that (a) if X|Y is elliptically contoured with 22 parameters (µ Y , ∆) and density g Y , there is no linear non-trivial sufficient reduction except if 23 g Y is the normal density with constant variance; (b) for non-normal elliptically contoured data, all 24 existing linear SDR methods only estimate part of the reduction; (c) a sufficient reduction of X for 25 the regression of Y on X comprises of a linear and a non-linear component. 26 1 27
Journal of Multivariate Analysis, 2011
ABSTRACT
Journal of Chemical Ecology, 1994
The cuticular chemicals of 124 individual wasps (foundresses and workers) from 23 colonies ofPoli... more The cuticular chemicals of 124 individual wasps (foundresses and workers) from 23 colonies ofPolistes fuscatus were analyzed. The compounds identified, all of which were hydrocarbons, were similar to those of other vespid wasps in that the bulk of the hydrocarbons were 23-33 carbons in chain length. However, the hydrocarbon profile ofP. fuscatus differed from those of its congeners in its proportions of straight-chain alkanes, methylalkanes, and alkenes. Three of the 20 identified hydrocarbons, 13- and 15-MeC31, 11,15- and 13,17-diMeC31, and 13-, 15-, and 17-MeC33, had properties postulated for recognition pheromones: colony specificity, efficacy in assigning wasps to the appropriate colony, heritability, lack of differences between foundresses and workers, and distinctive stereochemistry.
Bioinformatics, 2003
We introduce simple graphical classification and prediction tools for tumor status using geneexpr... more We introduce simple graphical classification and prediction tools for tumor status using geneexpression profiles. They are based on two dimension estimation techniques sliced average variance estimation (SAVE) and sliced inverse regression (SIR). Both SAVE and SIR are used to infer on the dimension of the classification problem and obtain linear combinations of genes that contain sufficient information to predict class membership, such as tumor type. Plots of the estimated directions as well as numerical thresholds estimated from the plots are used to predict tumor classes in cDNA microarrays and the performance of the class predictors is assessed by cross-validation. A microarray simulation study is carried out to compare the power and predictive accuracy of the two methods. Results: The methods are applied to cDNA microarray data on BRCA1 and BRCA2 mutation carriers as well as sporadic tumors from Hedenfalk et al. (2001). All samples are correctly classified.
In cases involving possible discrimination in hiring or promotion plaintiffs allege that they wer... more In cases involving possible discrimination in hiring or promotion plaintiffs allege that they were treated differently than similarly qualified majority individuals. The data are typically analyzed using logistic regression with a minority indicator variable. Alternatively, the Peters-Belson (PB) regression method, which fits a regression model to the majority data and compares the status of each minority member to its prediction obtained from the majority equation, has also been accepted by courts. The average difference estimates the disparity in treatment accounting for job-related covariates. The appropriateness of these parametric models depends on whether they reflect the process generating the data. To lessen the dependence of the ultimate inference on the assumed parametric model, the majority equation is fit by local linear logistic regression and the response of each minority is predicted from it. Large sample properties of this PB-type procedure are obtained and a simulation study shows that the method loses little power relative to parametric methods even when the assumed parametric method is correct. Moreover, it yields more reliable estimates of the disparity when the data do not follow the assumed model. Data from the Berger v. Iron Workers Local 201 case are used to illustrate the method.
Biophysical Journal, 2008
Statistical analyses of forced unfolding data for protein tandems, i.e., unfolding forces (force-... more Statistical analyses of forced unfolding data for protein tandems, i.e., unfolding forces (force-ramp) and unfolding times (force-clamp), used in single-molecule dynamic force spectroscopy rely on the assumption that the unfolding transitions of individual protein domains are independent (uncorrelated) and characterized, respectively, by identically distributed unfolding forces and unfolding times. In our previous work, we showed that in the experimentally accessible
Statistics in Medicine, 2011
We propose a method to combine several predictors (markers) that are measured repeatedly over tim... more We propose a method to combine several predictors (markers) that are measured repeatedly over time into a composite marker score without assuming a model and only requiring a mild condition on the predictor distribution. Assuming that the first and second moments of the predictors can be decomposed into a time and a marker component via a Kronecker product structure that accommodates the longitudinal nature of the predictors, we develop first-moment sufficient dimension reduction techniques to replace the original markers with linear transformations that contain sufficient information for the regression of the predictors on the outcome. These linear combinations can then be combined into a score that has better predictive performance than a score built under a general model that ignores the longitudinal structure of the data. Our methods can be applied to either continuous or categorical outcome measures. In simulations, we focus on binary outcomes and show that our method outperforms existing alternatives by using the AUC, the area under the receiver-operator characteristics (ROC) curve, as a summary measure of the discriminatory ability of a single continuous diagnostic marker for binary disease outcomes. 3 7 7 7 7 5
Statistics in Medicine, 2002
A new technique, denaturing high-performance liquid chromatography (dHPLC), allows for detection ... more A new technique, denaturing high-performance liquid chromatography (dHPLC), allows for detection of any heterozygous sequence variation in a gene without prior knowledge of the precise location of the sequence change. The results of a dHPLC analysis are recorded in real-time in the form of a chromatogram that is sequence-specific. In this paper we present methods to classify an individual, based on the observed chromatogram, as a homozygous wild-type or a carrier of a specific variant for the given DNA segment by comparison to representative chromatograms that are obtained from the training set of individuals with known variant status. The first approach consists of finding a parsimonious parametric model and then classifying each newly observed curve based on comparing the most discriminating characteristic, the main mode, to the main mode of the training curves. The second approach consists of finding empirical estimates of the modes of each chromatogram and using a bootstrap test for equality with the corresponding estimates of the training curves. We apply both methods to data on the breast cancer susceptibility gene BRCA1 and test the performance of the methods on independent samples.
Statistics & Probability Letters, 2008
In several dimension reduction techniques, the original variables are replaced by a smaller numbe... more In several dimension reduction techniques, the original variables are replaced by a smaller number of linear combinations. The coefficients of these linear combinations are typically the elements of the left singular vectors of a random matrix. We derive the asymptotic distribution of the left singular vectors of a random matrix that has a normal limit distribution. This result is then used to develop a Wald-type test for testing variable importance in Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE), two popular sufficient dimension reduction methods.
Law, Probability and Risk, 2011
Law, Probability and Risk, 2009
In two related decisions in Chamber of Commerce of the United States of America v. Securities and... more In two related decisions in Chamber of Commerce of the United States of America v. Securities and Exchange Commission (SEC), the District of Columbia Federal Court of Appeals ruled that the SEC had not fully complied with some provisions of the Administrative Procedures Act when it required that the boards of investment companies managing mutual funds have at least 75% of their membership and the Chairman be independent directors. In preparation for renewed rule making, the Office of Economic Analysis of the SEC prepared a Power Study to respond to an industry-sponsored report claiming that the returns of funds with independent boards and chairs are not superior to funds with boards dominated by management. The Power Study concluded that the available studies on the effectiveness of independent board members do not have sufficient statistical power to detect a meaningful difference in the returns of the two types of funds. This paper demonstrates that the method used by the SEC in their power calculation is not correct, unless a very restrictive condition that rarely occurs in practice holds. When the appropriate power formulas are used, the expected power of studies of the same size as the ones examined by the SEC is actually lower than the corresponding results of the SEC. Thus, the results in the paper actually strengthen the argument that the SEC is advocating. The relevance of both the SEC and the industry studies to the main issue in the case is also questioned in the discussion.
Journal of the American Statistical Association, 2014
There are two general approaches based on inverse regression for estimating the linear sufficient... more There are two general approaches based on inverse regression for estimating the linear sufficient 13 reductions for the regression of Y on X: the moment-based approach such as SIR, PIR, SAVE, and 14 DR, and the likelihood-based approach such as Principal Fitted Components (PFC) and Likelihood 15 Acquired Directions (LAD) when the inverse predictors, X|Y , are normal. By construction these 16 methods extract information from the first two conditional moments of X|Y , they can only estimate 17 linear reductions and thus form the linear Sufficient Dimension Reduction (SDR) methodology. 18 When var(X|Y ) is constant, E(X|Y ) contains the reduction and it can be estimated using PFC. 19 When var(X|Y ) is non-constant, PFC misses the information in the variance and second moment 20 based methods (SAVE, DR, LAD) are used instead, resulting in efficiency loss in the estimation 21 of the mean-based reduction. In this paper we prove that (a) if X|Y is elliptically contoured with 22 parameters (µ Y , ∆) and density g Y , there is no linear non-trivial sufficient reduction except if 23 g Y is the normal density with constant variance; (b) for non-normal elliptically contoured data, all 24 existing linear SDR methods only estimate part of the reduction; (c) a sufficient reduction of X for 25 the regression of Y on X comprises of a linear and a non-linear component. 26 1 27
Journal of Multivariate Analysis, 2011
ABSTRACT
Journal of Chemical Ecology, 1994
The cuticular chemicals of 124 individual wasps (foundresses and workers) from 23 colonies ofPoli... more The cuticular chemicals of 124 individual wasps (foundresses and workers) from 23 colonies ofPolistes fuscatus were analyzed. The compounds identified, all of which were hydrocarbons, were similar to those of other vespid wasps in that the bulk of the hydrocarbons were 23-33 carbons in chain length. However, the hydrocarbon profile ofP. fuscatus differed from those of its congeners in its proportions of straight-chain alkanes, methylalkanes, and alkenes. Three of the 20 identified hydrocarbons, 13- and 15-MeC31, 11,15- and 13,17-diMeC31, and 13-, 15-, and 17-MeC33, had properties postulated for recognition pheromones: colony specificity, efficacy in assigning wasps to the appropriate colony, heritability, lack of differences between foundresses and workers, and distinctive stereochemistry.
Bioinformatics, 2003
We introduce simple graphical classification and prediction tools for tumor status using geneexpr... more We introduce simple graphical classification and prediction tools for tumor status using geneexpression profiles. They are based on two dimension estimation techniques sliced average variance estimation (SAVE) and sliced inverse regression (SIR). Both SAVE and SIR are used to infer on the dimension of the classification problem and obtain linear combinations of genes that contain sufficient information to predict class membership, such as tumor type. Plots of the estimated directions as well as numerical thresholds estimated from the plots are used to predict tumor classes in cDNA microarrays and the performance of the class predictors is assessed by cross-validation. A microarray simulation study is carried out to compare the power and predictive accuracy of the two methods. Results: The methods are applied to cDNA microarray data on BRCA1 and BRCA2 mutation carriers as well as sporadic tumors from Hedenfalk et al. (2001). All samples are correctly classified.
In cases involving possible discrimination in hiring or promotion plaintiffs allege that they wer... more In cases involving possible discrimination in hiring or promotion plaintiffs allege that they were treated differently than similarly qualified majority individuals. The data are typically analyzed using logistic regression with a minority indicator variable. Alternatively, the Peters-Belson (PB) regression method, which fits a regression model to the majority data and compares the status of each minority member to its prediction obtained from the majority equation, has also been accepted by courts. The average difference estimates the disparity in treatment accounting for job-related covariates. The appropriateness of these parametric models depends on whether they reflect the process generating the data. To lessen the dependence of the ultimate inference on the assumed parametric model, the majority equation is fit by local linear logistic regression and the response of each minority is predicted from it. Large sample properties of this PB-type procedure are obtained and a simulation study shows that the method loses little power relative to parametric methods even when the assumed parametric method is correct. Moreover, it yields more reliable estimates of the disparity when the data do not follow the assumed model. Data from the Berger v. Iron Workers Local 201 case are used to illustrate the method.
Biophysical Journal, 2008
Statistical analyses of forced unfolding data for protein tandems, i.e., unfolding forces (force-... more Statistical analyses of forced unfolding data for protein tandems, i.e., unfolding forces (force-ramp) and unfolding times (force-clamp), used in single-molecule dynamic force spectroscopy rely on the assumption that the unfolding transitions of individual protein domains are independent (uncorrelated) and characterized, respectively, by identically distributed unfolding forces and unfolding times. In our previous work, we showed that in the experimentally accessible