Carlos Daniel Paulino | Universidade de Lisboa (original) (raw)

Papers by Carlos Daniel Paulino

Research paper thumbnail of Modelos de interacção genética de dois genes em fenótipos

Em trabalhos anteriores foram propostos diversos modelos estatísticos para a penetrância de forma... more Em trabalhos anteriores foram propostos diversos modelos estatísticos para a penetrância de forma a inferir a interacção de dois genes dial´ elicos na construção de fenótipos binários complexos: modelos de acção independente, modelos de inibição e modelos de ...

Research paper thumbnail of Optimal sample size for estimating the mean concentration of invasive organisms in ballast water via a semiparametric Bayesian analysis

Statistical Methods & Applications

Research paper thumbnail of Non-homogeneous Poisson models with a change-point: an application to ozone peaks in Mexico city

Environmental and Ecological Statistics, 2009

Abstract In this paper, we use some non-homogeneous Poisson models in order to study the behavior... more Abstract In this paper, we use some non-homogeneous Poisson models in order to study the behavior of ozone measurements in Mexico City. We assume that the number of ozone peaks follows a non-homogeneous Poisson process. We consider four types of rate ...

Research paper thumbnail of Verifying compliance with ballast water standards : a decision-theoretic approach

We construct credible intervals to estimate the mean organism (zooplankton and phytoplankton) con... more We construct credible intervals to estimate the mean organism (zooplankton and phytoplankton) concentration in ballast water via a decision-theoretic approach. To obtain the required optimal sample size, we use a total cost minimization criterion defined as the sum of the sampling cost and the Bayes risk either under a Poisson or a negative binomial model for organism counts, both with a gamma prior distribution. Such credible intervals may be employed to verify whether the ballast water discharged from a ship is in compliance with international standards. We also conduct a simulation study to evaluate the credible interval lengths associated with the proposed optimal sample sizes

Research paper thumbnail of Computational Bayesian Statistics

Meaningful use of advanced Bayesian methods requires a good understanding of the fundamentals. Th... more Meaningful use of advanced Bayesian methods requires a good understanding of the fundamentals. This engaging book explains the ideas that underpin the construction and analysis of Bayesian models, with particular focus on computational methods and schemes. The unique features of the text are the extensive discussion of available software packages combined with a brief but complete and mathematically rigorous introduction to Bayesian inference. The text introduces Monte Carlo methods, Markov chain Monte Carlo methods, and Bayesian software, with additional material on model validation and comparison, transdimensional MCMC, and conditionally Gaussian models. The inclusion of problems makes the book suitable as a textbook for a first graduate-level course in Bayesian computation with a focus on Monte Carlo methods. The extensive discussion of Bayesian software - R/R-INLA, OpenBUGS, JAGS, STAN, and BayesX - makes it useful also for researchers and graduate students from beyond statistics.

Research paper thumbnail of Sample size for estimating organism concentration in ballast water: A Bayesian approach

Estimation of microorganism concentration in ballast water tanks is important to evaluate and pos... more Estimation of microorganism concentration in ballast water tanks is important to evaluate and possibly to prevent the introduction of invasive species in stable ecosystems. For such purpose, the number of organisms in ballast water aliquots must be counted and used to estimate their concentration with some precision requirement. Poisson and negative binomial models have been employed to describe the organism distribution in the tank, but determination of sample sizes required to generate estimates with pre-specified precision is still not well established. A Bayesian approach is a flexible alternative to accommodate adequate models that account for the heterogeneous distribution of the organisms and may provide a sequential way of enhancing the estimation procedure by updating the prior distribution along the ballast water discharging process. We adopt such an approach to compute sample sizes required to construct credible intervals obtained via two optimality criteria that have not...

Research paper thumbnail of Catdata: Software for Analysis of Categorical Data with Complete or Missing Responses

We present a collection of computational routines written in the R language (R Development Core T... more We present a collection of computational routines written in the R language (R Development Core Team, 2007) for the analysis of categorical data with complete or missing responses under a product-multinomial scenario. For complete data or incomplete data generated by an ignorable missingness mechanism as defined in Little and Rubin (2002, Wiley), linear and log-linear models may be fitted via maximum likelihood (ML). Weighted least squares (WLS) methodology may as well be used to fit more general functional linear models for complete data or for incomplete data if a missing completely at random (MCAR) mechanism is assumed. The software also allows a hybrid approach, where ML is used in a first stage, and the estimated marginal probabilities of categorization and their covariance matrix are used in a second stage to fit the model via WLS, in the spirit of functional asymptotic regression methodology described by Imrey, Koch, Stokes et al. (1981, 1982, International Statistical Review...

Research paper thumbnail of Análise bayesiana semiparamétrica de resposta binária com covariável contínua sujeita a omissão não aleatória

Expressa-se com gratid˜ao os apoios financeiros concedidos a este trabalho de investiga¸c˜ao: Fre... more Expressa-se com gratid˜ao os apoios financeiros concedidos a este trabalho de investiga¸c˜ao: Frederico Z. Poleto e Julio M. Singer, pela Coordena¸c˜ao de Aperfei¸coamento de Pessoal de N´ivel Superior (CAPES), Brasil, Funda¸c˜ao de Amparo `a Pesquisa do Estado de S˜ao Paulo (FAPESP), Brasil, e Conselho Nacional de Desenvolvimento Cient´ifico e Tecnol´ogico (CNPq), Brasil; Carlos Daniel Paulino, pela Funda¸c˜ao para a Ciˆencia e Tecnologia (FCT) atrav´es da unidade CEAUL-FCUL, Portugal e projetos Pest-OE/MAT/UI0006 de 2011 e 2014; Geert Molenberghs, por IAP research network P6/03 do Governo Belga (Belgian Science Policy). Os autores agradecem ao Dr. Arnaud Perrier e ao Dr. Henri Bounameaux do Hospital Universit´ario de Genebra por fornecerem o conjunto de dados.

Research paper thumbnail of A fair comparison of credible and confidence intervals: an example with binomial proportions

Research paper thumbnail of Bayesian comparison of diagnostic tests with largely non-informative missing data

Journal of Statistical Computation and Simulation

ABSTRACT This work was motivated by a real problem of comparing binary diagnostic tests based upo... more ABSTRACT This work was motivated by a real problem of comparing binary diagnostic tests based upon a gold standard, where the collected data showed that the large majority of classifications were incomplete and the feedback received from the medical doctors allowed us to consider the missingness as non-informative. Taking into account the degree of data incompleteness, we used a Bayesian approach via MCMC methods for drawing inferences of interest on accuracy measures. Its direct implementation by well-known software demonstrated serious problems of chain convergence. The difficulties were overcome by the proposal of a simple, efficient and easily adaptable data augmentation algorithm, performed through an ad hoc computer program.

Research paper thumbnail of A product-multinomial framework for categorical data analysis with missing responses

Brazilian Journal of Probability and Statistics, 2014

With the objective of analysing categorical data with missing responses, we extend the multinomia... more With the objective of analysing categorical data with missing responses, we extend the multinomial modelling scenario described by Paulino (Braz. J. Probab. Stat. 5 (1991) 1-42) to a product-multinomial framework that allows the inclusion of explanatory variables. We consider maximum likelihood (ML) and weighted least squares (WLS) as well as a hybrid ML/WLS approach to fit linear, log-linear and more general functional linear models under ignorable and nonignorable missing data mechanisms. We express the results in an unified matrix notation that may be easily used for their computational implementation and develop such a set of subroutines in R. We illustrate the procedures with the analysis of two data sets, and perform simulations to assess the properties of the estimators.

Research paper thumbnail of Sample size and power calculations for detecting changes in malaria transmission using antibody seroconversion rate

Malaria Journal, 2015

Several studies have highlighted the use of serological data in detecting a reduction in malaria ... more Several studies have highlighted the use of serological data in detecting a reduction in malaria transmission intensity. These studies have typically used serology as an adjunct measure and no formal examination of sample size calculations for this approach has been conducted. A sample size calculator is proposed for cross-sectional surveys using data simulation from a reverse catalytic model assuming a reduction in seroconversion rate (SCR) at a given change point before sampling. This calculator is based on logistic approximations for the underlying power curves to detect a reduction in SCR in relation to the hypothesis of a stable SCR for the same data. Sample sizes are illustrated for a hypothetical cross-sectional survey from an African population assuming a known or unknown change point. Overall, data simulation demonstrates that power is strongly affected by assuming a known or unknown change point. Small sample sizes are sufficient to detect strong reductions in SCR, but invariantly lead to poor precision of estimates for current SCR. In this situation, sample size is better determined by controlling the precision of SCR estimates. Conversely larger sample sizes are required for detecting more subtle reductions in malaria transmission but those invariantly increase precision whilst reducing putative estimation bias. The proposed sample size calculator, although based on data simulation, shows promise of being easily applicable to a range of populations and survey types. Since the change point is a major source of uncertainty, obtaining or assuming prior information about this parameter might reduce both the sample size and the chance of generating biased SCR estimates.

Research paper thumbnail of Bayesian Genetic Mapping of Binary Trait Loci

Advances in Regression, Survival Analysis, Extreme Values, Markov Processes and Other Statistical Applications, 2013

Research paper thumbnail of New advances in statistical modeling and applications

Research paper thumbnail of Estimation of T-cell repertoire diversity and clonal size distribution by Poisson abundance models

Journal of Immunological Methods, Feb 28, 2010

The answer to many fundamental questions in Immunology requires the quantitative characterization... more The answer to many fundamental questions in Immunology requires the quantitative characterization of the T-cell repertoire, namely T cell receptor (TCR) diversity and clonal size distribution. An increasing number of repertoire studies are based on sequencing of the TCR variable regions in T-cell samples from which one tries to estimate the diversity of the original T-cell populations. Hitherto, estimation of TCR diversity was tackled either by a "standard" method that assumes a homogeneous clonal size distribution, or by non-parametric methods, such as the abundance-coverage and incidence-coverage estimators. However, both methods show caveats. On the one hand, the samples exhibit clonal size distributions with heavy right tails, a feature that is incompatible with the assumption of an equal frequency of every TCR sequence in the repertoire. Thus, this "standard" method produces inaccurate estimates. On the other hand, non-parametric estimators are robust in a wide range of situations, but per se provide no information about the clonal size distribution. This paper redeploys Poisson abundance models from Ecology to overcome the limitations of the above inferential procedures. These models assume that each TCR variant is sampled according to a Poisson distribution with a specific sampling rate, itself varying according to some Exponential, Gamma, or Lognormal distribution, or still an appropriate mixture of Exponential distributions. With these models, one can estimate the clonal size distribution in addition to TCR diversity of the repertoire. A procedure is suggested to evaluate robustness of diversity estimates with respect to the most abundant sampled TCR sequences. For illustrative purposes, previously published data on mice with limited TCR diversity are analyzed. Two of the presented models are more consistent with the data and give the most robust TCR diversity estimates. They suggest that clonal sizes follow either a Lognormal or an appropriate mixture of Exponential distributions. According to the ecological interpretation of these models, the T-cell repertoire would be divided in several T-cell niches, themselves created in a series of steps. Definitive conclusions, however, would require larger samples. It is shown here that samples 100-fold larger than hitherto available ones would be sufficient to discriminate candidate models. These large sample sizes are currently affordable using massively parallel sequencing technology. Foreseeing this we provide the package PAM for the R software that will facilitate T-cell repertoire data analysis based on Poisson abundance models.

Research paper thumbnail of Analyzing categorical data with complete or missing responses using the Catdata package

The objective of this document is to introduce the reader to the functions of the Catdata package... more The objective of this document is to introduce the reader to the functions of the Catdata package and to show how they may be used to perform analyses of categorical data with missing or complete responses.

Research paper thumbnail of CATDATA: SOFTWARE FOR ANALYSIS OF CATEGORICAL DATA WITH COMPLETE OR MISSING RESPONSES

We present a collection of computational routines written in the R language for the analysis of c... more We present a collection of computational routines written in the R language for the analysis of categorical data with complete or missing responses under a product-multinomial scenario. For complete data or incomplete data generated by an ignorable missingness mechanism as defined in Little and Rubin (2002, Wiley), linear and log-linear models may be fitted via maximum likelihood (ML). Weighted least squares (WLS) methodology may as well be used to fit more general functional linear models for complete data or for incomplete data if a missing completely at random (MCAR) mechanism is assumed. The software also allows a hybrid approach, where ML is used in a first stage, and the estimated marginal probabilities of categorization and their covariance matrix are used in a second stage to fit the model via WLS, in the spirit of functional asymptotic regression methodology described by Imrey, Koch, Stokes et al. (1981, 1982, International Statistical Review ) for complete data. The required computations are automatically conducted for complete data or for incomplete data when missing at random (MAR) or MCAR mechanisms are considered. For missing not at random (MNAR) mechanisms, the first step must be programmed by the user via one of the built-in optimization functions in the R software. Model formulation and use of the functions are similar to GENCAT, a program developed by Landis, Stanish, Freeman and Koch (1976, Computer Programs in Biomedicine), or by SAS' PROC CATMOD. We illustrate the procedures with three examples in the field of Biostatistics extracted from Paulino and Singer (2006, Blücher). The first involves fitting a regular loglinear model to a problem with complete data, the second deals with longitudinal data and the third is focused on incomplete data.

Research paper thumbnail of Semi-parametric Bayesian analysis of binary responses with a continuous covariate subject to non-random missingness

Statistical Modelling, 2014

Missingness in explanatory variables requires a model for the covariates even if the interest lie... more Missingness in explanatory variables requires a model for the covariates even if the interest lies only in a model for the outcomes given the covariates. An incorrect specification of the models for the covariates or for the missingness mechanism may lead to biased inferences for the parameters of interest. Previously published articles either use semi-/non-parametric flexible distributions for the covariates and identify the model via a missing at random assumption, or employ parametric distributions for the covariates and allow a more general non-random missingness mechanism. We consider the analysis of binary responses, combining a missing not at random mechanism with a nonparametric model based on a Dirichlet process mixture for the continuous covariates. We illustrate the proposal with simulations and the analysis of a dataset.

Research paper thumbnail of New Advances in Statistical Modeling and Applications

Research paper thumbnail of Analysis of rates in incomplete Poisson data

Journal of the Royal Statistical Society: Series D (The Statistician), 2003

Discrete data assumed to be generated by independent Poisson distributions are subject to censori... more Discrete data assumed to be generated by independent Poisson distributions are subject to censoring processes which determine the incomplete classification that is reported in several practical situations. The paper describes a probabilistic model that can explain incomplete data referring to partitions of the original set of categories.We overcome the model's lack of identifiability by assuming a missingness at random censoring mechanism. On the basis of the subsequent statistical model the results required to fit further non-informative censoring models by maximum likelihood methodology are obtained. This preliminary analysis opens the way to the analysis of structural models for Poisson expected rates. The paper describes how to test the fit of structural models for the Poisson rates, whether taken individually or simultaneously with special censoring models. Then, the maximum likelihood methodology is specialized to the analysis of strictly linear and log-linear models in such a way that its computational implementation opens up for any count table. The methods developed throughout the work are illustrated with a breast cancer data set.

Research paper thumbnail of Modelos de interacção genética de dois genes em fenótipos

Em trabalhos anteriores foram propostos diversos modelos estatísticos para a penetrância de forma... more Em trabalhos anteriores foram propostos diversos modelos estatísticos para a penetrância de forma a inferir a interacção de dois genes dial´ elicos na construção de fenótipos binários complexos: modelos de acção independente, modelos de inibição e modelos de ...

Research paper thumbnail of Optimal sample size for estimating the mean concentration of invasive organisms in ballast water via a semiparametric Bayesian analysis

Statistical Methods & Applications

Research paper thumbnail of Non-homogeneous Poisson models with a change-point: an application to ozone peaks in Mexico city

Environmental and Ecological Statistics, 2009

Abstract In this paper, we use some non-homogeneous Poisson models in order to study the behavior... more Abstract In this paper, we use some non-homogeneous Poisson models in order to study the behavior of ozone measurements in Mexico City. We assume that the number of ozone peaks follows a non-homogeneous Poisson process. We consider four types of rate ...

Research paper thumbnail of Verifying compliance with ballast water standards : a decision-theoretic approach

We construct credible intervals to estimate the mean organism (zooplankton and phytoplankton) con... more We construct credible intervals to estimate the mean organism (zooplankton and phytoplankton) concentration in ballast water via a decision-theoretic approach. To obtain the required optimal sample size, we use a total cost minimization criterion defined as the sum of the sampling cost and the Bayes risk either under a Poisson or a negative binomial model for organism counts, both with a gamma prior distribution. Such credible intervals may be employed to verify whether the ballast water discharged from a ship is in compliance with international standards. We also conduct a simulation study to evaluate the credible interval lengths associated with the proposed optimal sample sizes

Research paper thumbnail of Computational Bayesian Statistics

Meaningful use of advanced Bayesian methods requires a good understanding of the fundamentals. Th... more Meaningful use of advanced Bayesian methods requires a good understanding of the fundamentals. This engaging book explains the ideas that underpin the construction and analysis of Bayesian models, with particular focus on computational methods and schemes. The unique features of the text are the extensive discussion of available software packages combined with a brief but complete and mathematically rigorous introduction to Bayesian inference. The text introduces Monte Carlo methods, Markov chain Monte Carlo methods, and Bayesian software, with additional material on model validation and comparison, transdimensional MCMC, and conditionally Gaussian models. The inclusion of problems makes the book suitable as a textbook for a first graduate-level course in Bayesian computation with a focus on Monte Carlo methods. The extensive discussion of Bayesian software - R/R-INLA, OpenBUGS, JAGS, STAN, and BayesX - makes it useful also for researchers and graduate students from beyond statistics.

Research paper thumbnail of Sample size for estimating organism concentration in ballast water: A Bayesian approach

Estimation of microorganism concentration in ballast water tanks is important to evaluate and pos... more Estimation of microorganism concentration in ballast water tanks is important to evaluate and possibly to prevent the introduction of invasive species in stable ecosystems. For such purpose, the number of organisms in ballast water aliquots must be counted and used to estimate their concentration with some precision requirement. Poisson and negative binomial models have been employed to describe the organism distribution in the tank, but determination of sample sizes required to generate estimates with pre-specified precision is still not well established. A Bayesian approach is a flexible alternative to accommodate adequate models that account for the heterogeneous distribution of the organisms and may provide a sequential way of enhancing the estimation procedure by updating the prior distribution along the ballast water discharging process. We adopt such an approach to compute sample sizes required to construct credible intervals obtained via two optimality criteria that have not...

Research paper thumbnail of Catdata: Software for Analysis of Categorical Data with Complete or Missing Responses

We present a collection of computational routines written in the R language (R Development Core T... more We present a collection of computational routines written in the R language (R Development Core Team, 2007) for the analysis of categorical data with complete or missing responses under a product-multinomial scenario. For complete data or incomplete data generated by an ignorable missingness mechanism as defined in Little and Rubin (2002, Wiley), linear and log-linear models may be fitted via maximum likelihood (ML). Weighted least squares (WLS) methodology may as well be used to fit more general functional linear models for complete data or for incomplete data if a missing completely at random (MCAR) mechanism is assumed. The software also allows a hybrid approach, where ML is used in a first stage, and the estimated marginal probabilities of categorization and their covariance matrix are used in a second stage to fit the model via WLS, in the spirit of functional asymptotic regression methodology described by Imrey, Koch, Stokes et al. (1981, 1982, International Statistical Review...

Research paper thumbnail of Análise bayesiana semiparamétrica de resposta binária com covariável contínua sujeita a omissão não aleatória

Expressa-se com gratid˜ao os apoios financeiros concedidos a este trabalho de investiga¸c˜ao: Fre... more Expressa-se com gratid˜ao os apoios financeiros concedidos a este trabalho de investiga¸c˜ao: Frederico Z. Poleto e Julio M. Singer, pela Coordena¸c˜ao de Aperfei¸coamento de Pessoal de N´ivel Superior (CAPES), Brasil, Funda¸c˜ao de Amparo `a Pesquisa do Estado de S˜ao Paulo (FAPESP), Brasil, e Conselho Nacional de Desenvolvimento Cient´ifico e Tecnol´ogico (CNPq), Brasil; Carlos Daniel Paulino, pela Funda¸c˜ao para a Ciˆencia e Tecnologia (FCT) atrav´es da unidade CEAUL-FCUL, Portugal e projetos Pest-OE/MAT/UI0006 de 2011 e 2014; Geert Molenberghs, por IAP research network P6/03 do Governo Belga (Belgian Science Policy). Os autores agradecem ao Dr. Arnaud Perrier e ao Dr. Henri Bounameaux do Hospital Universit´ario de Genebra por fornecerem o conjunto de dados.

Research paper thumbnail of A fair comparison of credible and confidence intervals: an example with binomial proportions

Research paper thumbnail of Bayesian comparison of diagnostic tests with largely non-informative missing data

Journal of Statistical Computation and Simulation

ABSTRACT This work was motivated by a real problem of comparing binary diagnostic tests based upo... more ABSTRACT This work was motivated by a real problem of comparing binary diagnostic tests based upon a gold standard, where the collected data showed that the large majority of classifications were incomplete and the feedback received from the medical doctors allowed us to consider the missingness as non-informative. Taking into account the degree of data incompleteness, we used a Bayesian approach via MCMC methods for drawing inferences of interest on accuracy measures. Its direct implementation by well-known software demonstrated serious problems of chain convergence. The difficulties were overcome by the proposal of a simple, efficient and easily adaptable data augmentation algorithm, performed through an ad hoc computer program.

Research paper thumbnail of A product-multinomial framework for categorical data analysis with missing responses

Brazilian Journal of Probability and Statistics, 2014

With the objective of analysing categorical data with missing responses, we extend the multinomia... more With the objective of analysing categorical data with missing responses, we extend the multinomial modelling scenario described by Paulino (Braz. J. Probab. Stat. 5 (1991) 1-42) to a product-multinomial framework that allows the inclusion of explanatory variables. We consider maximum likelihood (ML) and weighted least squares (WLS) as well as a hybrid ML/WLS approach to fit linear, log-linear and more general functional linear models under ignorable and nonignorable missing data mechanisms. We express the results in an unified matrix notation that may be easily used for their computational implementation and develop such a set of subroutines in R. We illustrate the procedures with the analysis of two data sets, and perform simulations to assess the properties of the estimators.

Research paper thumbnail of Sample size and power calculations for detecting changes in malaria transmission using antibody seroconversion rate

Malaria Journal, 2015

Several studies have highlighted the use of serological data in detecting a reduction in malaria ... more Several studies have highlighted the use of serological data in detecting a reduction in malaria transmission intensity. These studies have typically used serology as an adjunct measure and no formal examination of sample size calculations for this approach has been conducted. A sample size calculator is proposed for cross-sectional surveys using data simulation from a reverse catalytic model assuming a reduction in seroconversion rate (SCR) at a given change point before sampling. This calculator is based on logistic approximations for the underlying power curves to detect a reduction in SCR in relation to the hypothesis of a stable SCR for the same data. Sample sizes are illustrated for a hypothetical cross-sectional survey from an African population assuming a known or unknown change point. Overall, data simulation demonstrates that power is strongly affected by assuming a known or unknown change point. Small sample sizes are sufficient to detect strong reductions in SCR, but invariantly lead to poor precision of estimates for current SCR. In this situation, sample size is better determined by controlling the precision of SCR estimates. Conversely larger sample sizes are required for detecting more subtle reductions in malaria transmission but those invariantly increase precision whilst reducing putative estimation bias. The proposed sample size calculator, although based on data simulation, shows promise of being easily applicable to a range of populations and survey types. Since the change point is a major source of uncertainty, obtaining or assuming prior information about this parameter might reduce both the sample size and the chance of generating biased SCR estimates.

Research paper thumbnail of Bayesian Genetic Mapping of Binary Trait Loci

Advances in Regression, Survival Analysis, Extreme Values, Markov Processes and Other Statistical Applications, 2013

Research paper thumbnail of New advances in statistical modeling and applications

Research paper thumbnail of Estimation of T-cell repertoire diversity and clonal size distribution by Poisson abundance models

Journal of Immunological Methods, Feb 28, 2010

The answer to many fundamental questions in Immunology requires the quantitative characterization... more The answer to many fundamental questions in Immunology requires the quantitative characterization of the T-cell repertoire, namely T cell receptor (TCR) diversity and clonal size distribution. An increasing number of repertoire studies are based on sequencing of the TCR variable regions in T-cell samples from which one tries to estimate the diversity of the original T-cell populations. Hitherto, estimation of TCR diversity was tackled either by a "standard" method that assumes a homogeneous clonal size distribution, or by non-parametric methods, such as the abundance-coverage and incidence-coverage estimators. However, both methods show caveats. On the one hand, the samples exhibit clonal size distributions with heavy right tails, a feature that is incompatible with the assumption of an equal frequency of every TCR sequence in the repertoire. Thus, this "standard" method produces inaccurate estimates. On the other hand, non-parametric estimators are robust in a wide range of situations, but per se provide no information about the clonal size distribution. This paper redeploys Poisson abundance models from Ecology to overcome the limitations of the above inferential procedures. These models assume that each TCR variant is sampled according to a Poisson distribution with a specific sampling rate, itself varying according to some Exponential, Gamma, or Lognormal distribution, or still an appropriate mixture of Exponential distributions. With these models, one can estimate the clonal size distribution in addition to TCR diversity of the repertoire. A procedure is suggested to evaluate robustness of diversity estimates with respect to the most abundant sampled TCR sequences. For illustrative purposes, previously published data on mice with limited TCR diversity are analyzed. Two of the presented models are more consistent with the data and give the most robust TCR diversity estimates. They suggest that clonal sizes follow either a Lognormal or an appropriate mixture of Exponential distributions. According to the ecological interpretation of these models, the T-cell repertoire would be divided in several T-cell niches, themselves created in a series of steps. Definitive conclusions, however, would require larger samples. It is shown here that samples 100-fold larger than hitherto available ones would be sufficient to discriminate candidate models. These large sample sizes are currently affordable using massively parallel sequencing technology. Foreseeing this we provide the package PAM for the R software that will facilitate T-cell repertoire data analysis based on Poisson abundance models.

Research paper thumbnail of Analyzing categorical data with complete or missing responses using the Catdata package

The objective of this document is to introduce the reader to the functions of the Catdata package... more The objective of this document is to introduce the reader to the functions of the Catdata package and to show how they may be used to perform analyses of categorical data with missing or complete responses.

Research paper thumbnail of CATDATA: SOFTWARE FOR ANALYSIS OF CATEGORICAL DATA WITH COMPLETE OR MISSING RESPONSES

We present a collection of computational routines written in the R language for the analysis of c... more We present a collection of computational routines written in the R language for the analysis of categorical data with complete or missing responses under a product-multinomial scenario. For complete data or incomplete data generated by an ignorable missingness mechanism as defined in Little and Rubin (2002, Wiley), linear and log-linear models may be fitted via maximum likelihood (ML). Weighted least squares (WLS) methodology may as well be used to fit more general functional linear models for complete data or for incomplete data if a missing completely at random (MCAR) mechanism is assumed. The software also allows a hybrid approach, where ML is used in a first stage, and the estimated marginal probabilities of categorization and their covariance matrix are used in a second stage to fit the model via WLS, in the spirit of functional asymptotic regression methodology described by Imrey, Koch, Stokes et al. (1981, 1982, International Statistical Review ) for complete data. The required computations are automatically conducted for complete data or for incomplete data when missing at random (MAR) or MCAR mechanisms are considered. For missing not at random (MNAR) mechanisms, the first step must be programmed by the user via one of the built-in optimization functions in the R software. Model formulation and use of the functions are similar to GENCAT, a program developed by Landis, Stanish, Freeman and Koch (1976, Computer Programs in Biomedicine), or by SAS' PROC CATMOD. We illustrate the procedures with three examples in the field of Biostatistics extracted from Paulino and Singer (2006, Blücher). The first involves fitting a regular loglinear model to a problem with complete data, the second deals with longitudinal data and the third is focused on incomplete data.

Research paper thumbnail of Semi-parametric Bayesian analysis of binary responses with a continuous covariate subject to non-random missingness

Statistical Modelling, 2014

Missingness in explanatory variables requires a model for the covariates even if the interest lie... more Missingness in explanatory variables requires a model for the covariates even if the interest lies only in a model for the outcomes given the covariates. An incorrect specification of the models for the covariates or for the missingness mechanism may lead to biased inferences for the parameters of interest. Previously published articles either use semi-/non-parametric flexible distributions for the covariates and identify the model via a missing at random assumption, or employ parametric distributions for the covariates and allow a more general non-random missingness mechanism. We consider the analysis of binary responses, combining a missing not at random mechanism with a nonparametric model based on a Dirichlet process mixture for the continuous covariates. We illustrate the proposal with simulations and the analysis of a dataset.

Research paper thumbnail of New Advances in Statistical Modeling and Applications

Research paper thumbnail of Analysis of rates in incomplete Poisson data

Journal of the Royal Statistical Society: Series D (The Statistician), 2003

Discrete data assumed to be generated by independent Poisson distributions are subject to censori... more Discrete data assumed to be generated by independent Poisson distributions are subject to censoring processes which determine the incomplete classification that is reported in several practical situations. The paper describes a probabilistic model that can explain incomplete data referring to partitions of the original set of categories.We overcome the model's lack of identifiability by assuming a missingness at random censoring mechanism. On the basis of the subsequent statistical model the results required to fit further non-informative censoring models by maximum likelihood methodology are obtained. This preliminary analysis opens the way to the analysis of structural models for Poisson expected rates. The paper describes how to test the fit of structural models for the Poisson rates, whether taken individually or simultaneously with special censoring models. Then, the maximum likelihood methodology is specialized to the analysis of strictly linear and log-linear models in such a way that its computational implementation opens up for any count table. The methods developed throughout the work are illustrated with a breast cancer data set.