Help for package pECV (original) (raw)
| Type: | Package |
|---|---|
| Title: | Entrywise Splitting Cross-Validation for Factor Models |
| Version: | 1.0.1 |
| Description: | Implements entrywise splitting cross-validation (ECV) and its penalized variant (pECV) for selecting the number of factors in generalized factor models. |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| Language: | en-US |
| Depends: | R (≥ 3.5.0) |
| Imports: | stats, Rcpp (≥ 1.0.0), irlba |
| Suggests: | mirtjml, testthat (≥ 3.0.0) |
| LinkingTo: | Rcpp, RcppArmadillo |
| URL: | https://github.com/wangATsu/ECV |
| BugReports: | https://github.com/wangATsu/ECV/issues |
| RoxygenNote: | 7.3.2 |
| Config/testthat/edition: | 3 |
| ByteCompile: | true |
| NeedsCompilation: | yes |
| Packaged: | 2025-08-23 02:29:40 UTC; clswt-wangzhijing |
| Author: | Zhijing Wang [aut, cre] |
| Maintainer: | Zhijing Wang wangzhijing@sjtu.edu.cn |
| Repository: | CRAN |
| Date/Publication: | 2025-08-28 08:50:07 UTC |
Estimate constraint constant C for continuous data
Description
Data-driven estimation of the constraint constant C in alternating maximization algorithm for continuous data using truncated SVD approach. This function decomposes the data matrix and estimates C based on the maximum row norms.
Usage
estimate_C(X, qmax = 8, safety = 1.2)
Arguments
| X | n x p continuous data matrix |
|---|---|
| qmax | Rank for truncated SVD (default 8) |
| safety | Safety parameter for conservative estimation (default 1.2) |
Details
The function performs the following steps: 1. Computes truncated SVD of X with rank qmax 2. Constructs factor matrices A = U * sqrt(D) and B = V * sqrt(D) 3. Calculates row 2-norms for matrices A and B 4. Takes the maximum norm and multiplies by safety parameter
For count data, it is recommended to transform the data using log(X + 1) before applying this function.
Value
A list containing:
| qmax | Truncation rank used |
|---|---|
| safety | Safety parameter applied |
| C_norm_hat | Original maximum row norm |
| C_est | Final conservative estimate of C |
| a_norms | Row norms of factor matrix A |
| b_norms | Row norms of factor matrix B |
Examples
# Example 1: Continuous data
set.seed(123)
n <- 100; p <- 50; q <- 3
theta_true <- matrix(runif(n * q), n, q)
A_true <- matrix(runif(p * q), p, q)
X <- theta_true %*% t(A_true) + matrix(rnorm(n * p, sd = 0.5), n, p)
# Estimate C
C_result <- estimate_C(X, qmax = 5)
print(C_result$C_est)
# Example 2: Count data (apply log transformation)
lambda <- exp(theta_true %*% t(A_true))
X_count <- matrix(rpois(n * p, lambda = as.vector(lambda)), n, p)
X_transformed <- log(X_count + 1)
C_count <- estimate_C(X_transformed, qmax = 5)
print(C_count$C_est)
Estimate constraint constant C for binary data
Description
Data-driven estimation of the constraint constant C for binary data using cross-window smoothing and empirical logit transformation.
Usage
estimate_C_binary(X, qmax = 8, safety = 1.5, eps = 1e-12, radius = 1)
Arguments
| X | n x p binary data matrix (0/1 values) |
|---|---|
| qmax | Rank for truncated SVD (default 8) |
| safety | Safety parameter for conservative estimation (default 1.5) |
| eps | Small constant to avoid logit divergence when p=0 or p=1 (default 1e-12) |
| radius | Radius for cross-window smoothing (default 1) |
Details
The function performs the following steps: 1. Applies cross-window smoothing to estimate probabilities 2. Performs empirical logit transformation with smoothing 3. Computes truncated SVD of the transformed matrix 4. Constructs matrices A and B and calculates row norms 5. Estimates C as the maximum norm times safety parameter
The cross-window smoothing helps stabilize probability estimates, especially for sparse binary data.
Value
A list containing:
| radius | Cross-window radius used |
|---|---|
| qmax | Truncation rank used |
| safety | Safety parameter applied |
| C0 | Original maximum row norm |
| C_est | Final conservative estimate of C |
| a_norms | Row norms of factor matrix A |
| b_norms | Row norms of factor matrix B |
| Mhat | Logit-transformed matrix |
| P_smooth | Smoothed probability matrix |
| N_counts | Count of values in each smoothing window |
Generate binary data example
Description
Generate simulated data from a binary (logistic) factor model.
Usage
generate_binary_data(n = 100, p = 50, q = 3)
Arguments
| n | Integer. Number of observations. |
|---|---|
| p | Integer. Number of variables. |
| q | Integer. True number of latent factors. |
Value
A named list with components:
resp
Binary matrix (n x p). Generated 0/1 responses.
true_q
Integer. True number of factors used in simulation.
theta_true
Numeric matrix (n x q). True latent factor scores.
A_true
Numeric matrix (p x q). True factor loadings.
d_true
Numeric vector (length p). Item intercepts.
Generate binary data with missing values
Description
Generate simulated data from a binary (logistic) factor model with missing values.
Usage
generate_binary_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)
Arguments
| n | Integer. Number of observations. |
|---|---|
| p | Integer. Number of variables. |
| q | Integer. True number of latent factors. |
| miss_prop | Numeric in (0,1). Proportion of missing values (default 0.05). |
Value
A named list with components:
resp
Binary matrix (n x p). Generated 0/1 responses with missing values (NA).
resp_complete
Binary matrix (n x p). Complete data before missingness.
true_q
Integer. True number of factors used in simulation.
theta_true
Numeric matrix. True latent factor scores.
A_true
Numeric matrix. True factor loadings.
d_true
Numeric vector (length p). Item intercepts.
miss_prop
Numeric. Proportion of entries set to missing.
Generate continuous data example
Description
Generate simulated data from a Gaussian factor model.
Usage
generate_continuous_data(n = 100, p = 50, q = 3, noise_sd = 1)
Arguments
| n | Integer. Number of observations. |
|---|---|
| p | Integer. Number of variables. |
| q | Integer. True number of latent factors. |
| noise_sd | Numeric. Standard deviation of Gaussian noise. |
Value
A named list with components:
resp
Numeric matrix (n x p). Generated observed data.
true_q
Integer. True number of factors used in simulation.
theta_true
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
A_true
Numeric matrix (p x (q+1)). True factor loadings.
Generate continuous data with missing values
Description
Generate simulated data from a Gaussian factor model with missing values.
Usage
generate_continuous_data_miss(
n = 100,
p = 50,
q = 3,
noise_sd = 1,
miss_prop = 0.05
)
Arguments
| n | Integer. Number of observations. |
|---|---|
| p | Integer. Number of variables. |
| q | Integer. True number of latent factors. |
| noise_sd | Numeric. Standard deviation of Gaussian noise. |
| miss_prop | Numeric in (0,1). Proportion of missing values (default 0.05). |
Value
A named list with components:
resp
Numeric matrix (n x p). Generated data with missing values (NA).
resp_complete
Numeric matrix (n x p). Complete data before missingness.
true_q
Integer. True number of factors used in simulation.
theta_true
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
A_true
Numeric matrix (p x (q+1)). True factor loadings.
miss_prop
Numeric. Proportion of entries set to missing.
Generate count data example
Description
Generate simulated data from a Poisson factor model.
Usage
generate_count_data(n = 100, p = 50, q = 3)
Arguments
| n | Integer. Number of observations. |
|---|---|
| p | Integer. Number of variables. |
| q | Integer. True number of latent factors. |
Value
A named list with components:
resp
Integer matrix (n x p). Generated Poisson observations.
true_q
Integer. True number of factors used in simulation.
theta_true
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
A_true
Numeric matrix (p x (q+1)). True factor loadings.
Generate count data with missing values
Description
Generate simulated data from a Poisson factor model with missing values.
Usage
generate_count_data_miss(n = 100, p = 50, q = 3, miss_prop = 0.05)
Arguments
| n | Integer. Number of observations. |
|---|---|
| p | Integer. Number of variables. |
| q | Integer. True number of latent factors. |
| miss_prop | Numeric in (0,1). Proportion of missing values (default 0.05). |
Value
A named list with components:
resp
Integer matrix (n x p). Generated data with missing values (NA).
resp_complete
Integer matrix (n x p). Complete data before missingness.
true_q
Integer. True number of factors used in simulation.
theta_true
Numeric matrix (n x (q+1)). True latent factor scores with intercept.
A_true
Numeric matrix (p x (q+1)). True factor loadings.
miss_prop
Numeric. Proportion of entries set to missing.
Entrywise Splitting Cross-Validation for Factor Models
Description
Uses (Penalized) Entrywise Splitting Cross-Validation (ECV / pECV) to estimate the number of latent factors in generalized factor models.
Usage
pECV(
resp,
C = 5,
qmax = 8,
fold = 5,
tol_val = 0.01,
theta0 = NULL,
A0 = NULL,
seed = 1,
data_type = NULL
)
Arguments
| resp | Observation data matrix (n x p); can be continuous, count, or binary. |
|---|---|
| C | Constraint constant, default is 5. |
| qmax | Maximum number of factors to consider, default is 8. |
| fold | Number of folds in cross-validation, default is 5. |
| tol_val | Convergence tolerance, default is 0.01 (interpreted as 0.01 / number of estimated elements). |
| theta0 | Optional initial matrix for factors; sampled from Uniform if not provided. |
| A0 | Optional initial matrix for loadings; sampled from Uniform if not provided. |
| seed | Random seed, default is 1. |
| data_type | Data type, one of "continuous", "count", "binary". If not specified, it is auto-detected. |
Details
The example below may take more than 5 seconds on some machines and is therefore not run during routine checks.
Value
A named list with components:
ECV
Integer. Number of factors selected by standard ECV.
p1ECV
Integer. Number of factors selected by ECV with penalty 1.
p2ECV
Integer. Number of factors selected by ECV with penalty 2.
p3ECV
Integer. Number of factors selected by ECV with penalty 3.
p4ECV
Integer. Number of factors selected by ECV with penalty 4.
ECV_loss
Numeric vector. Cross-validation loss for each candidate factor number (typically of length qmax).
data_type
Character. The detected/used data type: "continuous", "count", or "binary".
The return value has base R types (no special S3/S4 class).
Examples
set.seed(123)
# Generate count data
n <- 50; p <- 50; q <- 2
theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q))
A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1))
lambda <- exp(theta_true %*% t(A_true))
resp <- matrix(
rpois(length(lambda), lambda = as.vector(lambda)),
nrow = nrow(lambda), ncol = ncol(lambda)
)
result <- pECV(resp, C = 4, qmax = 4, fold = 5)
print(result)
Entrywise Splitting Cross-Validation with Missing Data
Description
Uses (Penalized) Entrywise Splitting Cross-Validation to estimate the number of latent factors in generalized factor models when the data contain missing values.
Usage
pECV.miss(
resp,
C = 5,
qmax = 8,
fold = 5,
tol_val = 0.01,
theta0 = NULL,
A0 = NULL,
seed = 1,
data_type = NULL
)
Arguments
| resp | Observation data matrix (n x p) with missing values as NA; can be continuous, count, or binary. |
|---|---|
| C | Constraint constant, default is 5. |
| qmax | Maximum number of factors to consider, default is 8. |
| fold | Number of folds in cross-validation, default is 5. |
| tol_val | Convergence tolerance, default is 0.01 (interpreted as 0.01 / number of estimated elements). |
| theta0 | Optional initial matrix for factors; sampled from Uniform if not provided. |
| A0 | Optional initial matrix for loadings; sampled from Uniform if not provided. |
| seed | Random seed, default is 1. |
| data_type | Data type, one of "continuous", "count", "binary". If not specified, it is auto-detected. |
Details
The example below may take more than 5 seconds on some machines and is therefore not run during routine checks.
Value
A named list with components:
ECV
Integer. Number of factors selected by standard ECV.
p1ECV
Integer. Number of factors selected by ECV with penalty 1.
p2ECV
Integer. Number of factors selected by ECV with penalty 2.
p3ECV
Integer. Number of factors selected by ECV with penalty 3.
p4ECV
Integer. Number of factors selected by ECV with penalty 4.
ECV_loss
Numeric vector. Cross-validation loss for each candidate factor number (typically of length qmax).
data_type
Character. The detected/used data type: "continuous", "count", or "binary".
miss_percent
Numeric scalar. Percentage of missing entries in resp.
The return value uses base R types (no special S3/S4 class).
Examples
set.seed(123)
# Generate count data with missing values
n <- 50; p <- 50; q <- 2
theta_true <- cbind(1, matrix(runif(n * q, -2, 2), n, q))
A_true <- matrix(runif(p * (q + 1), -2, 2), p, (q + 1))
lambda <- exp(theta_true %*% t(A_true))
resp <- matrix(
rpois(length(lambda), lambda = as.vector(lambda)),
nrow = nrow(lambda), ncol = ncol(lambda)
)
# Introduce 5% missing values
miss_idx <- sample(1:(n * p), size = 0.05 * n * p)
resp[miss_idx] <- NA
result <- pECV.miss(resp, C = 4, qmax = 4, fold = 5)
print(result)