Add an unknown discrete parameter following a Bernoulli distribution (original) (raw)

March 11, 2025, 10:08am 1

Hello everyone,
I’d like to model the intra-host viral load of each individual in my population. However, within my population, I have infected and uninfected individuals, but I can’t distinguish between them. So I want to estimate the infection status of each of my individuals and the probability of being infected in my population. To do this, I draw a probability of being infected Pinf~ Beta(2,2) and I draw in my infectious status for each observation to_vector(infected) ~ Bernoulli(Pinf).

However, I notice that Stan can’t model discrete parameters and that you have to ‘marginalize discrete unknowns out of the posterior distribution’, but I don’t know how to do that with a process as simple as mine (all the posts I’ve been able to find model a Poisson model, which is much more complex than my problem here).

Thanks in advance for the help!

Maxime

functions {
      // Ctfun returns dCt, the delta Ct below Vinf:
    real CtVinffun(real t, real tinf, int lod, int Vinf, real tp, real Tp, real Tc, real Vp){
      
        if (t <= tinf)
          return(Vinf); // Ct = 0
      
      // Viral load rises before peak: 
        else if (t >= tinf && t <= tp)
          return(Vp*(t-tinf)/Tp); // Tp = tp-tinf
          
      // Viral load falls after peak: 
        else //(t > tp)
           return(Vp+(lod-Vp)*(t-tp)/Tc); // Tc = tc-tp
    }
}


data{
  int<lower=0> N; // Number of concatenated data points
  int<lower=0> n_id; // Number of individuals 
  int lod; // Limit of detection of Ct (Ct=10)
  int Vinf; // Ct delta from lod at infection (Ct=0) 
  int<lower=0> id[N]; // Vector marking which datum belongs to which id
  real t[N];  // Vector marking the time for each data point 
  real<lower=10, upper=50> y[N];  // Concatenated data vector 
  int censor[N]; // vector of censor indication
  real tSS[n_id]; // Vector of individual time of symptom onset 
  int nb_random; // Total number of parameters with a random effect
}

transformed data {
  
  real log_is_lod[N]; // log of "is the data is at the LOD value ? : 0 OR 1"
  real log_is_obs[N]; // log of " if the data is below the LOD value ? : 0 OR 1"
  
  for(i in 1:N){
    if(censor[i]==1){
      log_is_obs[i]=log(0);
      log_is_lod[i]=log(1);
    } 
    else{ 
      log_is_obs[i]=log(1);
      log_is_lod[i]=log(0);
    }
  }
}


parameters{
  
  // parameters of Vp 
  real<lower=10, upper=50> mu_Vp;
  
  // parameters of proliferation phase
  real<lower=0> mu_Tp;
  
  // parameters of clearance phase
  real<lower=0> mu_Tc;
  
  // parameters incubation period
  real<lower=0, upper=14> mu_Tincub;
  
  real<lower=0> sigma;
  
  matrix[n_id,nb_random] eta_tilde; // random effect for Vp, Tp, Tc, and Tincub
  
  real<lower=0> eta_omega[nb_random];
  
  // Proportion of infected individuals in the population
  real Pinf; 
  
  // Infection status for each individual
  int infected[n_id];
}


transformed parameters{
  
  real<lower=0, upper=14> Tincub[n_id]; // Incubation period cannot exceed 14 days
  
  matrix[n_id,nb_random] eta; // matrix of random effects for Vp, Tp, Tc, and Tincub
  
  real<lower=10, upper=50> Vp[n_id];
  real<lower=0> Tp[n_id];
  real<lower=0> Tc[n_id];
  real tp[n_id];
  real tinf[n_id];
  
  real<upper=50> Ct[N];
  
  real num_arg[N,2];
  
  for(j in 1:nb_random){
    eta[,j] = eta_tilde[,j]*eta_omega[j];
  }
  
  for(i in 1:n_id){
    
    Vp[i] = mu_Vp*exp(eta[i,1]);
    
    Tp[i] = mu_Tp*exp(eta[i,2]);
    
    Tc[i] = mu_Tc*exp(eta[i,3]);
    
    Tincub[i] = mu_Tincub*exp(eta[i,4]);
    
    tinf[i] = tSS[i] - Tincub[i];

    tp[i] = tinf[i] + Tp[i];
    
  }
  
  // Looping on each observation of the data set
  for(i in 1:N){
    
    // Prediction of the model
    Ct[i] = CtVinffun(t[i], tinf[id[i]], lod, Vinf, tp[id[i]], Tp[id[i]], Tc[id[i]], Vp[id[i]]);
    
    // LL for infected individuals //
    if (infected[id[i]] == 1){
      if (t[i] < tinf[id[i]]){ // Observations before the start of infection
        
        num_arg[i,1] = log_is_obs[i] + log(0.0002); // likelihood that Ctobs value < LOD (false positive PCR test)
        num_arg[i,2] = log_is_lod[i] + log(0.9998); // likelihood that Ctobs value == LOD (true negative PCR test)
        
      } else{
        
        num_arg[i,1] = log_is_obs[i] + normal_lpdf(y[i] | Ct[i], sigma); // likelihood that Ctobs value < LOD (positive PCR test)
        num_arg[i,2] = log_is_lod[i] + normal_lcdf(10 | Ct[i], sigma); // likelihood that Ctobs value == LOD (negative PCR test)
      
      }
    // If the observation is not at the lod value, num_arg[i,2] is equal to -inf and will not be taken in the likelihood : target += log_sum_exp(num_arg[i]);
    // Because log(exp(log(0)) + exp(num_arg[i,1])) = num_arg[i,1]
    
    // LL for non-infected individuals //
    } else{ 
        num_arg[i,1] = log_is_obs[i] + log(0.0002); // likelihood that Ctobs value < LOD (false positive PCR test)
        num_arg[i,2] = log_is_lod[i] + log(0.9998); // likelihood that Ctobs value == LOD (negative PCR test)

    }
  }
}


model{
  // Priors //
  
  mu_Vp ~ normal(25, 0.5); // hierarchical mean (mu)
  mu_Tp ~ normal(6, 0.25); // hierarchical mean (mu)
  mu_Tc ~ normal(15, 0.5); // hierarchical mean (mu)
  mu_Tincub ~ normal(5, 0.25); // hierarchical mean (mu)
  
  Pinf ~ beta(2,2); // probability to be infected => proportion of infected individuals in the population
  to_vector(infected) ~ bernoulli(Pinf); // is the individual is infected
  
  sigma ~ std_normal(); // mesurement error
  
  to_vector(eta_tilde) ~ std_normal();
  
  to_vector(eta_omega) ~ normal(0.15,0.05); // variance of random effect 95% of density [0.05;0.25]
  
  // Likelihood (looped on each observation) //
  for(i in 1:N){
    target += log_sum_exp(num_arg[i]);
  }
}

Much like your current code, you can use log_sum_exp() to achieve this. The code below shows a very simplified version of it, where I assume you’ve calculated the person-specific log-likelihoods assuming they are infected and not infected.

for(i in 1:N){
  vector[2] lp;
  lp[1] = log(Pinf) + ll_if_infected[i];
  lp[2] = log(1 - Pinf) + ll_if_not_infected[i];
  
  target += log_sum_exp(lp);
}

And a few additional, unconnected thoughts…

You will likely want to use the updated array syntax. Rather than

real num_arg[N,2];

use

array[N,2] real num_arg;

I don’t have a perfect understanding of your model (so ignore this if I’m missing something crucial), but it looks like you can avoid a lot of unnecessary computation by calculating one value per person in num_arg. The code below is equivalent to an if statement: if censor[i] == 1 then it uses the log-likelihood associated with log_is_lod and it uses the likelihood for log_is_obs if censor[i]!=1. So you don’t need to calculate two values for each n of num_arg. You can also pre-calculate log(0.0002) and log(0.9998) in the transformed data block.

So it might look something like this. (note that I’ve removed the marginalization over lod/obs but added the marginalization over infected status).

transformed data{
    real log_0002 = log(0.0002);
    real log_9998 = log(0.9998);
    array[N] int use_obs_ll;  // 0 = lod; 1 = obs
    for(i in 1:N){
        if(censor[i] == 1){
            use_obs_ll[i] = 0;
        } else{
            use_obs_ll[i] = 1;
        }
    }
}
transformed parameters{
    // first row stores log-likelihoods if infected
    // second row stores log-likelihoods if not infected
    matrix[2,N] num_arg;

    num_arg[1,] = log(Pinf);
    num_arg[2,] = log(1 - Pinf);

    for(n in 1:N){
        if(t[i] < tinf[id[i]]){
            // Observations before the start of infection
            if(use_obs_ll[n] == 1){
                // Data are below the LOD value
                num_arg[1,n] += log_0002;
                num_arg[2,n] += log_0002;
            } else{
                // Data are at the LOD value
                num_arg[1,n] += log_9998;
                num_arg[2,n] += log_9998;
            }
        } else{
            // Observations after the start of infection
            if(use_obs_ll[n] == 1){
                // Data are below the LOD value
                num_arg[1,n] += normal_lpdf(y[i] | Ct[i], sigma);
                num_arg[2,n] += log_0002;
            } else{
                // Data are at the LOD value
                num_arg[1,n] +=normal_lcdf(10 | Ct[i], sigma);
                num_arg[2,n] += log_9998;
            }
        }
    }
}
model{
    for(n in 1:N){
        target += log_sum_exp(num_arg[,n]);
    }
}

I’ll just add that you can simplify the code further and get a bit more performance using the log_mix builtin, i.e. @simonbrauer’s code is equivalent to:

for(i in 1:N){
  target += log_mix(Pinf, ll_if_infected[i], ll_if_not_infectedi[i]);
}

Always use log1m(Pinf) in these cases—it’s more stable for small values of Pinf.

You can simplify the increments to

num_arg[ , n] += log_0002;

And then vectorize all of this:

  for (i in 1:n_id) {
    Vp[i] = mu_Vp*exp(eta[i,1]);
    Tp[i] = mu_Tp*exp(eta[i,2]);
    Tc[i] = mu_Tc*exp(eta[i,3]);
    Tincub[i] = mu_Tincub*exp(eta[i,4]);
    tinf[i] = tSS[i] - Tincub[i];
    tp[i] = tinf[i] + Tp[i];
  }

to much more efficient code:

Vp = mu_VP * exp(eta[ , 1]);
...
tp = tinf + Tp;

You can vectorize CtVinffun so it’s a one-liner, which will probably let you vectorize within the function. Anything that can get converted to vectorized operations will be a win if it involves parameters as we can usually speed up the autodiff.

If you precompute your tests, like (infected[id[I]] == 1) (equivalently just (infected[id[I]]) if it can only take values 0 and 1), you can vectorize all of the loops that branch on these conditions by just operating on the indices. For instance if tin

If you’re not doing custom inits and you have this:

mu_Vp ~ normal(25, 0.5);

then you want to declare mu_Vp as

real<offset=25, multiplier=0.5> mu_Vp;

which will standardize the scale and make our default inits and adaptation work much better in early iterations. This is even more important for the narrowly constrained ones like eta_omega.

You can move a bunch of things you declared as transformed parameters, like num_arg, into local variables in the model to save output space and make the posterior analysis faster.

P.S. I’d recommend more horizontal space around operators and braces and around operators, and less vertical space between lines. We have a style guide at the end of the User’s Guide.

Germante March 14, 2025, 9:10pm 6

Thank you all for your answers! I’ve now understood how to marginalise latent classes in the posterior with Stan! ;)
The model works very well even with a rather sparse dataset (very fast run, well-mixed chains, unbiased estimates). However, the model doesn’t correspond exactly to what I’d like to model. In fact, I’d like to have the same probability of infection (Pinf) per individual, for all his observations. But with the model you’ve proposed, this Pinf varies from one observation to another within the same individual, which isn’t logical - either the individual is really infected, or he’s not. So I wrote this model to take into account a single Pinf for all the tests of an individual. It runs, but very slowly on the same data sets, with same priors, and I get rather unstable results with more or less significant biases. Not all individuals have the same number of observations. This information is contained in the vector Ntest[n_id], so we only loop over the data for each individual. Could you have a look, it’s likely that I’ve written the model in a way that doesn’t conform to Stan, which could explain the current situation.

functions {
    // Ctfun returns dCt, the delta Ct below Vinf:
    real CtVinffun(real t, real tinf, int lod, int Vinf, real tp, real Tp, real Tc, real Vp){
        if (t <= tinf)
          return(Vinf); // Ct = 0
        else if (t >= tinf && t <= tp)
          return(Vp*(t-tinf)/Tp); // Viral load rises before peak: Tp = tp-tinf
        else // Viral load falls after peak: (t > tp)
           return(Vp+(lod-Vp)*(t-tp)/Tc); // Tc = tc-tp
    }
}


data{
  int<lower=0> N; // Number of concatenated data points
  int<lower=0> n_id; // Number of individuals 
  int lod; // Limit of detection of Ct (Ct=10)
  int Vinf; // Ct delta from lod at infection (Ct=0) 
  int<lower=0> id[N]; // Vector marking which datum belongs to which id
  real tSS[n_id]; // Vector of individual time of symptom onset 
  int nb_random; // Total number of parameters with a random effect
  int Ntest[n_id]; // Number of observations for each individual
  int max_Ntest; // Number of maximum of repeated test in the population
  matrix[n_id,max_Ntest] time; // array of dim n_id,max_Ntest containing time of observations 
  matrix[n_id, max_Ntest] y;  // array of dim n_id,max_Ntest containing Ct of observations
  matrix[n_id,max_Ntest] censor; // array of dim n_id,max_Ntest containing censor indication
}

transformed data{
    real log_0002 = log(0.0002); // log proba of false positive
    real log_9998 = log(0.9998); // log proba of true negative
    matrix[n_id,max_Ntest] use_obs_ll;  // 0 = lod; 1 = obs
    for(i in 1:n_id){
      for (n in 1:Ntest[i]){
        if(censor[i,n] == 1){
            use_obs_ll[i,n] = 0;
        } else{
            use_obs_ll[i,n] = 1;
        }
      }
    }
}


parameters{
  real mu_Vp; // fixed effect of Vp 
  real<lower=0> mu_Tp; // fixed effect of Tp
  real<lower=0> mu_Tc; // fixed effect of Tc
  real<lower=0, upper=14> mu_Tincub; // fixed effect of Tincub
  real<lower=0> sigma; // additive measurement error
  matrix[n_id,nb_random] eta_tilde; // Gaussian to center random effect to 0
  real<lower=0> eta_omega[nb_random]; // Variance of random effect for Vp, Tp, Tc, and Tincub
  real<lower=0, upper=1> Pinf; // Proportion of infected individuals in the population
}


transformed parameters{
  real<lower=0, upper=14> Tincub[n_id]; // Incubation period cannot exceed 14 days
  matrix[n_id,nb_random] eta; // matrix of random effects for Vp, Tp, Tc, and Tincub
  real Vp[n_id]; // Vector of individual Vp
  real<lower=0> Tp[n_id]; // Vector of individual Tp
  real<lower=0> Tc[n_id]; // Vector of individual Tc
  row_vector[n_id] tp; // Vector of individual tp
  row_vector[n_id] tinf; // Vector of individual tinf
  matrix<upper=50>[n_id,max_Ntest] Ct; // array of dim n_id,N_max_Tests containing Ct pred of each observation
  real num_arg[n_id, max_Ntest, 2]; // array containing individual LL contribution
  num_arg =rep_array(0, n_id, max_Ntest, 2); // Initializing num_arg to 0 (defaut was NaN)
  Ct =  to_matrix(rep_array(0, n_id, max_Ntest)); // // Initializing Ct to 0 (defaut was NaN)
  matrix[n_id,2] ll_Pinf; // matrix of two columns : Pinf and 1-Pinf for each individual
  ll_Pinf[,1] = to_vector(rep_array(log(Pinf), n_id)); // filling the matrix
  ll_Pinf[,2] = to_vector(rep_array(log1m(Pinf), n_id)); // filling the matrix
  for(j in 1:nb_random){
    eta[,j] = eta_tilde[,j]*eta_omega[j];
  }
  Vp = to_array_1d(mu_Vp * exp(eta[ ,1]));
  Tp = to_array_1d(mu_Tp * exp(eta[ ,2]));
  Tc = to_array_1d(mu_Tc * exp(eta[ ,3]));
  Tincub = to_array_1d(mu_Tincub * exp(eta[ ,4]));
  tinf = to_row_vector(tSS) + to_row_vector(Tincub);
  tp = to_row_vector(tinf) + to_row_vector(Tp);
    for(i in 1:n_id){  // Looping on each individual
      for (n in 1:Ntest[i]){  // Each observation of the individual
        Ct[i,n] = CtVinffun(time[i,n], tinf[i], lod, Vinf, tp[i], Tp[i], Tc[i], Vp[i]); // Prediction of the model
        if(time[i,n] < tinf[i]){ // Observations before the start of infection
            if(use_obs_ll[i,n] == 1){  // Data are uncensored
                num_arg[i,n,1] += log_0002 + ll_Pinf[i,1]; // proba of false positive
                num_arg[i,n,2] += log_0002 + ll_Pinf[i,2]; // proba of false positive
            } else{// Data are censored
                num_arg[i,n,1] += log_9998 + ll_Pinf[i,1]; // proba of true negative
                num_arg[i,n,2] += log_9998 + ll_Pinf[i,2]; // proba of true negative
            }
        } else{ // Observations after the start of infection
            if(use_obs_ll[i,n] == 1){ // Data are uncensored
                num_arg[i,n,1] += normal_lpdf(y[i,n] | Ct[i,n], sigma) + ll_Pinf[i,1]; // If infected 
                num_arg[i,n,2] += log_0002 + ll_Pinf[i,2]; // If not infected, proba of false positive 
            } else{ // Data are censored
                num_arg[i,n,1] += normal_lcdf(10 | Ct[i,n], sigma) + ll_Pinf[i,1]; // If infected 
                num_arg[i,n,2] += log_9998 + ll_Pinf[i,2]; // If not infected, proba of true negative 
            }
        }
    }
  }
}

model{
  // Priors //
  mu_Vp ~ normal(25, 0.5); // hierarchical mean (mu)
  mu_Tp ~ normal(6, 0.25); // hierarchical mean (mu)
  mu_Tc ~ normal(15, 0.5); // hierarchical mean (mu)
  mu_Tincub ~ normal(5, 0.25); // hierarchical mean (mu)
  Pinf ~ beta(2,2); // probability to be infected => proportion of infected individuals in the population
  sigma ~ std_normal(); // mesurement error
  to_vector(eta_tilde) ~ std_normal();
  to_vector(eta_omega) ~ normal(0.15,0.05); // variance of random effect 95% of density [0.05;0.25]
  for(i in 1:n_id){ // Likelihood (looped on each observation of each individual)
    for (n in 1:Ntest[i]){
      target += log_sum_exp(num_arg[i,n,1], num_arg[i,n,2]);
    }
  }
}

PS : @Bob_Carpenter I’d like to thank you for all your advice and tips on optimising my code. I’ll look into it, but only once I’ve got a model that works the way I want it to.

Thank you in advance,
Maxime

Having things like your CtVinffun can be problematic for sampling unless you can guarantee the results are continuously differentiable.

I would try to write censoring so it doesn’t use dummy values. You don’t need use_obs_ll as it’s just 1 - censor.

That’s the right thing to do. There’s a chapter in the User’s Guide on ragged data—we keep meaning to add it as a built-in, but it’s hard to design, code, and plumb through all of our I/O.

Finally, it seems that the problem stemmed from the definition of the infection start time, tinf tinf = tSS - Tincub (and not with a +), the infection comes before the symptoms, not after.

After correcting this typo in the script, the model runs very well. I have unbiased results, with a very good mix of chains with simulated datasets of 100 infected individuals and 100 uninfected individuals. The framework is very sparse with 75% of individuals with 1 observation, 20% with 2 observations and 5% with 3 or 4 observations. Despite this framework, the inference is satisfactory.

My final concern is the feasibility of using the model in my context, particularly in terms of computation time. I would like to use this model on a dataset of 1 million individuals (~1.3 million data points), as sparse as simulated previously. For 200 individuals, the current script takes between 15 and 25 minutes (see histogram) to run on a computer centre (200 Warm-Up, 200 sampling, adapt_delta=0.95 and max_treedepth=15). The 200 iterations of the W-U take 80% of the calculation time and the remaining 20% for the sampling, is this normal? Now I’m running on a dataset of 10,000 infected and 10,000 uninfected and the model seems to run forever (in 15 hours, I still haven’t done 40 iterations of the W-U). So I can’t imagine how long it would take for 1 million individuals.

Apart from vectorising the code and intra-chain parallelisation (I only have a maximum of 16 cores available for 4 chains, so divided by 4), is my model still feasible on a dataset of 1 million in Stan?

I am adding @charlesm93 to the discussion, with whom I had already chatted a few months ago on this subject. Thanks again @Bob_Carpenter for your valuable advice.

This all depends on your model. I would say, though, that the max_treedepth=15 is going to be problematic. That’s going to allow 32K leapfrog iterations, each requiring a log density and gradient. In most cases, the root cause is some bad geometry in hierarchical models which a non-centered parameterization can fix. In other cases, it’s much more complicated.

I would recommend trying the PyMC sampler Nutpie—it’s specifically faster than Stan in warmup. the other thing to try would be to run Pathfinder, our variational inference, and use draws from that to initialize MCMC. Pathfinder can fail during optimization if the geometry’s too tricky.

At the very least, I would parameterize things like mu_Vp, which has a normal(25, 0.5) distribution as

real<offset=25, multiplier=0.5> mu_Vp;

This will cause initialization to be in the range (-24, 26) instead of (-2, 2). This can help immensely with initialization if the prior’s reasonable.

How did you derive the 0.0002 and 0.9998 numbers?

The other thing to do with 1M individuals is to try to group them somehow and reduce everything to sufficient statistics. This may or may not be possible depending on the size of your model.

Sounds like that’s a hard no unless you can reparameterize and do some data reduction. For example, you can do some data reduction this way:

to_vector(infected) ~ bernoulli(Pinf);

is equivalent to

sum(infected) ~ binomial(n_id, Pinf);

But I’m really confused how you have integer parameters!

Germante April 22, 2025, 3:56pm 10

Thank you Bob for your clarifications and suggestions. We have since opted to reduce our dataset to sufficient statistics because we have so many individuals with only one observation with the same value at the same time (and so their contribution to their likelihood is strictly the same conditional on random effects).
I implemented a new model by including a ‘weight’ in my likelihood corresponding to the number of individuals with the same data at the same time.
I wrote it like this :

data{
  int weight[n_id]; // Weight in likelihood for each observation
} 
model{
  for(i in 1:n_id){ // Likelihood (looped on each observation of each individual)
    for (n in 1:Ntest[i]){
      target += weight[i] * log_sum_exp(num_arg[i,n,1], num_arg[i,n,2]); // Weighting LL with the number of individuals with the same observations
    }
  }
}

Since then, I’ve had major fitting problems, with chances that don’t mix at all and each stagnate in its own area.
As I’m working with a hierarchical model, I’m wondering about the impact of this weighting and the reduction in the number of unique individuals in my dataset on the estimate of the standard deviation of my random effects (eta_omega). Should I change the definition of the individual parameters to take account of the weighting like this?

  for (i in 1:n_id){
    for (j in 1:nb_random){
      eta[i,j] = eta_tilde[i,j] * eta_omega[j] / weight[i];
    }
  }

1.

1 ~ bernoulli(theta);
0 ~ bernoulli(theta);
1 ~ bernouli(theta);

2.

2 ~ binomial(3, theta);

3.

target += bernoulli_lupmf(1 | theta);
target += bernoulli_lupmf(0 | theta);
target += bernoulli_lupmf(1 | theta);

4.

target += 2 * bernoulli_lupmf(1 | theta);
target += 1 * bernoulli_lupmf(0 | theta);

The u in lupmf indicates that constants should be dropped to match what happens in the ~ notation.

Germante April 25, 2025, 3:43pm 12

Sorry Bob, the title of the post no longer corresponds exactly to my current problem. I do have a model in which I reconstruct the viral load and take into account whether or not an individual is infected (in particular by comparing the observed viral load with that predicted by the model).

It’s a linear mixed-effects model, and it works very well. However, it already takes time to run on 1,000 individuals, but my aim is to run this model on as many individuals as possible from my dataset of 700,000 individuals. As you suggested, I have a huge number of individuals (500,000) with only 1 observation, at the same time and with the same viral load value, so I’d like to use sufficient statistics to reduce my data. Data reduction is crucial, I could reduce my number of individuals by 20 and the number of observations by 7 in total.

I’ve tried to implement it in the way shown below, but either my \sigma error term shrinks to 0 (its simulation value is 2).

However, without using data reduction by keeping only distinct individuals on a small dataset of 1,000 individuals, i.e. the weight of each observation of each individual is 1 in the contribution to the LL, the model still runs well and I have unbiased estimates, as well as chains that mix.

The questions that come to mind:

What is the impact on the estimation of the standard deviations of the random effects eta_omega? Because with data reduction, we aggregate a huge number of individuals. Do we need to do something specific to control all this?
I haven’t really understood why it’s important to use a u function (in my case normal_lupdf), could it be the source of my inference issues? (Update : the use of normal_lupdf() did not solve my problem)

Again, data reduction is a most have to model the observations of my dataset. I thank you, or anyone else in advance for your help !

The model is written as follows:

functions {
    // Ctfun returns dCt, the delta Ct below Vinf:
    real CtVinffun(real t, real tinf, int lod, int Vinf, real tp, real Tp, real Tc, real Vp){
        if (t <= tinf)
          return(Vinf); // Ct = 0
        else if (t >= tinf && t <= tp)
          return(Vp*(t-tinf)/Tp); // Viral load rises before peak: Tp = tp-tinf
        else // Viral load falls after peak: (t > tp)
           return(Vp+(lod-Vp)*(t-tp)/Tc); // Tc = tc-tp
    }
}

data{
  int<lower=0> N; // Number of concatenated data points
  int<lower=0> n_id; // Number of individuals / patterns
  int lod; // Limit of detection of Ct (Ct=10)
  int Vinf; // Ct delta from lod at infection (Ct=0) 
  int<lower=0> id[N]; // Vector marking which datum belongs to which id
  real tSS[n_id]; // Vector of individual time of symptom onset 
  int nb_random; // Total number of parameters with a random effect
  int Ntest[n_id]; // Number of observations for each individual
  int weight[n_id]; // Weight in likelihood for each observations
  int max_Ntest; // Number of maximum of repeated test in the population
  matrix[n_id, max_Ntest] time; // array of dim n_id,max_Ntest containing time of observations 
  matrix[n_id, max_Ntest] y;  // array of dim n_id,max_Ntest containing Ct of observations
  matrix[n_id, max_Ntest] censor; // array of dim n_id,max_Ntest containing censor indication
}

transformed data{
    real log_0002 = log(0.0002); // log proba of false positive
    real log_9998 = log(0.9998); // log proba of true negative
    matrix[n_id,max_Ntest] use_obs_ll;  // 0 = lod; 1 = obs
    for(i in 1:n_id){
      for (n in 1:Ntest[i]){
        if(censor[i,n] == 1){
            use_obs_ll[i,n] = 0;
        } else{
            use_obs_ll[i,n] = 1;
        }
      }
    }
}

parameters{
  real<upper=50> mu_Vp; // fixed effect of Vp 
  real<lower=0> mu_Tp; // fixed effect of Tp
  real<lower=0> mu_Tc; // fixed effect of Tc
  real<lower=0, upper=14> mu_Tincub; // fixed effect of Tincub
  real<lower=0> sigma; // additive measurement error
  matrix[n_id,nb_random] eta_tilde; // Gaussian to center random effect to 0
  real<lower=0> eta_omega[nb_random]; // Variance of random effect for Vp, Tp, Tc, and Tincub
  real<lower=0, upper=1> Pinf; // Proportion of infected individuals in the population
}

transformed parameters{
  real<lower=0, upper=14> Tincub[n_id]; // Incubation period cannot exceed 14 days
  matrix[n_id,nb_random] eta; // matrix of random effects for Vp, Tp, Tc, and Tincub
  real<upper=50> Vp[n_id]; // Vector of individual Vp
  real<lower=0> Tp[n_id]; // Vector of individual Tp
  real<lower=0> Tc[n_id]; // Vector of individual Tc
  row_vector[n_id] tp; // Vector of individual tp
  row_vector[n_id] tinf; // Vector of individual tinf
  matrix<upper=50>[n_id,max_Ntest] Ct; // array of dim n_id,N_max_Tests containing Ct pred of each observation
  real num_arg[n_id, max_Ntest, 2]; // array containing individual LL contribution
  num_arg = rep_array(0, n_id, max_Ntest, 2); // Initializing num_arg to 0 (defaut was NaN)
  Ct =  to_matrix(rep_array(0, n_id, max_Ntest)); // // Initializing Ct to 0 (defaut was NaN)
  matrix[n_id,2] ll_Pinf; // matrix of two columns : Pinf and 1-Pinf for each individual
  ll_Pinf[,1] = to_vector(rep_array(log(Pinf), n_id)); // filling the matrix with Pinf
  ll_Pinf[,2] = to_vector(rep_array(log1m(Pinf), n_id)); // filling the matrix with 1-Pinf
  for(j in 1:nb_random){
    eta[,j] = eta_tilde[,j]*eta_omega[j];
  }
  Vp = to_array_1d(mu_Vp * exp(eta[ ,1]));
  Tp = to_array_1d(mu_Tp * exp(eta[ ,2]));
  Tc = to_array_1d(mu_Tc * exp(eta[ ,3]));
  Tincub = to_array_1d(mu_Tincub * exp(eta[ ,4]));
  tinf = to_row_vector(tSS) - to_row_vector(Tincub);
  tp = to_row_vector(tinf) + to_row_vector(Tp);
    for(i in 1:n_id){  // Looping on each individual
      for (n in 1:Ntest[i]){  // Each observation of the individual
        Ct[i,n] = CtVinffun(time[i,n], tinf[i], lod, Vinf, tp[i], Tp[i], Tc[i], Vp[i]); // Prediction of the model
        if(time[i,n] < tinf[i]){ // Observations before the start of infection
            if(use_obs_ll[i,n] == 1){  // Data are uncensored
                num_arg[i,n,1] += log_0002 + ll_Pinf[i,1]; // proba of false positive
                num_arg[i,n,2] += log_0002 + ll_Pinf[i,2]; // proba of false positive
            } else{// Data are censored
                num_arg[i,n,1] += log_9998 + ll_Pinf[i,1]; // proba of true negative
                num_arg[i,n,2] += log_9998 + ll_Pinf[i,2]; // proba of true negative
            }
        } else{ // Observations after the start of infection
            if(use_obs_ll[i,n] == 1){ // Data are uncensored
                num_arg[i,n,1] += normal_lpdf(y[i,n] |  Ct[i,n], sigma) + ll_Pinf[i,1]; // If infected 
                num_arg[i,n,2] += log_0002 + ll_Pinf[i,2]; // If not infected, proba of false positive 
            } else{ // Data are censored
                num_arg[i,n,1] += normal_lcdf(10 |  Ct[i,n], sigma) + ll_Pinf[i,1]; // If infected 
                num_arg[i,n,2] += log_9998 + ll_Pinf[i,2]; // If not infected, proba of true negative 
            }
            
        }
    }
  }
}

model{
  // Priors //
  mu_Vp ~ normal(25, 2); // hierarchical mean (mu)
  mu_Tp ~ normal(6, 1); // hierarchical mean (mu)
  mu_Tc ~ normal(15, 2); // hierarchical mean (mu)
  mu_Tincub ~ normal(5, 1); // hierarchical mean (mu)
  Pinf ~ beta(2,2); // probability to be infected => proportion of infected individuals in the population
  sigma ~ normal(0,3); // mesurement error
  to_vector(eta_tilde) ~ std_normal();
  to_vector(eta_omega) ~ normal(0.15,0.2); // variance of random effect 95% of density [0.05;0.25]
  for(i in 1:n_id){ // Likelihood (looped on each observation of each individual)
    for (n in 1:Ntest[i]){
      target += weight[i] * log_sum_exp(num_arg[i,n,1], num_arg[i,n,2]); // Weighting LL with the number of individuals with the same pattern
    }
  }
}

Sorry about not understanding the statistical question you’re asking. Since I still don’t get it, let me answer what I can.

Top level, even without understanding what you’re specifically trying to do with all this, I would still strongly urge you to go back to a test case where you have a working version that doesn’t aggregate weights, then you can aggregate weights to make sure you get the same answer. You can do this with simulated data. Then you can convince yourself that the aggregation is or isn’t the problem.

Stan jointly estimates everything, so it’s not like we estimate the standard deviations first and then the effects.

Are you using “data reduction” to mean your use of weighted counts here? That doesn’t actually reduce the amount of information in the data, it just compresses it.

Control all of what?

The difference is that normal_lupdf(y | mu, sigma) drops the normalizing constants when they don’t depend on parameters. For instance, if sigma is data rather than a parameter, we drop the -log(sigma) term in log density, and we always drop the -log(sqrt(2 * pi)) term. Without the u, you get all the normalizing terms.

If censor truly takes on values 1 or 0 integers, then it’d be better to code it this way:

array[n_id, max-Ntest] int<lower=0, upper=1> censor;

That will do the error checking on dad load and it also documents the possible values.

Then you can define use_obs_ll by setting

use_obs_ll[m, n] = 1 - use_obs_ll[m, n];

Germante April 26, 2025, 6:54pm 14

Thanks Bob for your very quick response ! I’ve probably expressed my problem in the wrong way.

To clarify, I’m currently working with a simulated dataset, with 500 infected individuals (with uncensored and/or censored data) and 500 uninfected individuals (they only have censored data). My previous model (a little higher up in the forum in my post of 14th March, which we’ll call M1), strictly identical, apart from the addition of the weight in the ‘data’ section and the addition of the weight in the target += line.
My M1 model (without the weight) works very well. Based on this observation, I tried using the same dataset, without aggregating the identical rows (we assume a weighted count for each observation equal to 1), but with the model with the weight (M2), and it still worked perfectly well.

This brings us to my current problem: I’d like to use my M2 model, which works well without reducing the number of rows (weight = 1), but with the same dataset by concatenating the individuals with 1 same point (same observation time conditional on the time since symptom onset, and the same Ct value), which contributes exactly the same to the likelihood (I’m sure about that). However, when I do this, the model struggles to run, and the chains stagnate on their own. So I’m sure the problem arises when there are weights larger than 1 in my dataset.

I’m wondering if this isn’t due to calculating random effects in a context where I’ve reduced the number of distinct individuals in my dataset by reducing the number of rows (because I’m only calculating a random effect for an individual who has an observation that occurs N times in the dataset, with a weight of N in the likelihood).

Yes I do, because by using weighting in the likelihood, I reduce the number of individuals in my dataset, as well as the number of observations (so the dataset I input has a smaller number of rows than the unweighted dataset), but I’m aware that the amount of information present in the dataset is the same.

By weighting (and aggregating observations), I reduce the heterogeneity in my data. If I have a weight of 50 for an observation, during an iteration, I will only sample 1 random effect to determine the prediction of my model corresponding to this observation, whereas without poderation, I would have sample 50 in the same iteration.

Okay, it’s very clear, in my case, as the sigma is estimated, I cannot use lupdf, but I have to use normal_lpdf.
Thank you for the tips concerning censoring.

Have you understood the question behind my post ?

jsocolar April 27, 2025, 2:02am 15

From your latest post, it sounds completely right that your have fundamentally changed the model by collapsing over individuals that formerly had separate per-individual random effects. This means that, if by “likelihood” we mean the likelihood conditional on the random effect parameters, it is not true that

In your M1 model, these individuals have (at any given iteration), different numerical contributions to the likelihood, because they take on different values for the random effect parameter.

Given that you need to collapse across individuals for computational reasons, it might be worth looking into whether you can integrate out the random effect so that you can write down the shared likelihood just in terms of the hyperparameters (without needing to condition on the actual random effect parameters). I haven’t waded deeply enough into your model to understand how easy it would be to integrate out the random effects for these single-observation individuals. But for example, if the likelihood were normal with linear predictor l, residual standard deviation sigma_resid, and random effect standard deviation sigma_ranef, then the shared likelihood for all individuals sharing a value for l and a single response value y would simply be normal_lpdf(y | l, sqrt(sigma_ranef^2 + sigma_resid^2).

This is tying a bunch of random effects into one, which will reduce the uncertainty in that random effect and change the posterior in the way @jsocolar noted.