Compute a single dataset-level MNAR index using logistic regression
mnar_global_score.RdSummarises missingness structure across the whole dataset into a single value by fitting two pooled logistic regressions across all (feature x sample) observations simultaneously:
Usage
mnar_global_score(
obj,
i,
group_cols,
min_observed = 2,
min_missing = 1,
subset_features = TRUE,
log_transform = FALSE
)Arguments
- obj
A QFeatures object.
- i
Integer or character. Index or name of the assay to analyse.
- group_cols
Character vector. One or more column names from colData(obj) that define experimental groups (e.g. c("condition", "treatment")). Multiple columns are combined into an interaction term.
- min_observed
Integer. Minimum number of observed (non-missing) values required for a feature to be included in the model. Default: 2.
- min_missing
Integer. Minimum number of missing values required for a feature to be included in the model. Default: 1.
- subset_features
Logical. If TRUE (default), features with fewer than
min_observedobserved values or fewer thanmin_missingmissing values are excluded, as they are uninformative and would dilute the intensity and group effect estimates.- log_transform
Logical. If TRUE, apply log2 to the assay intensities before computing the per-feature mean intensity covariate. Set to TRUE if your data has not already been log-transformed. Default: FALSE.
Value
A list with the following elements:
- tjur_intensity_only
Numeric. Tjur's R² of the intensity-only model. The discrimination between missing and observed explained by feature abundance alone — the universal intensity-dependent baseline present in all proteomics data.
- tjur_incremental
Numeric. The increase in Tjur's R² when condition is added on top of intensity. This is the primary quantity of interest: condition-specific missingness beyond what intensity already explains. Near zero means missingness is purely intensity-driven; positive values indicate condition-specific depletion of particular features.
- tjur_condition_fraction
Numeric. The fraction of total explainable missingness (tjur_full) that is condition-specific rather than intensity-driven. Ranges from 0 (all intensity-driven) to 1 (all condition-specific).
- tjur_full
Numeric. Tjur's R² of the full model (intensity + condition). Equal to tjur_intensity_only + tjur_incremental.
- lrt_chi2
Numeric. Chi-squared statistic from the likelihood ratio test of the full vs intensity-only model.
- lrt_df
Integer. Degrees of freedom for the LRT.
- lrt_pvalue
Numeric. P-value for the LRT — tests whether condition adds significant explanatory power over intensity alone.
- model_intensity
The fitted intensity-only
glmobject.- model_full
The fitted full
glmobject.- interpretation
Character string. Human-readable summary of all key quantities.
- n_features
Integer. Number of features included in the models.
- n_obs_total
Integer. Total number of (feature x sample) rows.
Details
mod_intensity : missing ~ intensity mod_full : missing ~ intensity + group
The incremental Tjur's R² (full - intensity) is the primary quantity of interest: it measures condition-specific missingness beyond what feature abundance alone would predict.
Tjur's R² (discrimination coefficient) is computed as:
mean(P(missing | truly missing)) - mean(P(missing | truly observed))
It ranges from 0 (no discrimination) to 1 (perfect discrimination) and is a natural measure of how well the model separates missing from observed.
Pooled logistic regression (no random effects) is used in preference to a linear mixed model. A per-feature random intercept would absorb the very variance that intensity is meant to explain (low-abundance features have both low intensity and high baseline missingness), suppressing the intensity fixed effect. The pooled approach is analogous to the detection probability curve (DPC) fitted by the limpa package.
The intensity covariate is the per-feature mean observed intensity, scaled
to mean 0 and sd 1. If your data has not been log-transformed upstream,
set log_transform = TRUE so that the sigmoidal intensity-missingness
relationship is approximately linear on the log scale, consistent with how
DPC curves are typically modelled.
The decomposition is:
tjur_full = tjur_intensity_only + tjur_incremental
A large tjur_intensity_only with small tjur_incremental means missingness is driven by abundance uniformly across conditions — the typical baseline. A substantial tjur_incremental means certain features are specifically depleted in particular conditions beyond what their overall abundance would predict.
Important limitation — cancellation of opposing condition effects:
Because the group coefficients are estimated once across all features simultaneously, this function measures whether one condition tends to have more missingness than another on average across all features. It captures asymmetric, directional missingness (e.g. a condition that is globally depleted), but it cannot detect condition effects that cancel in aggregation.
Consider a dataset where peptide A is missing entirely in condition X and
peptide B is missing entirely in condition Y. Both features are strongly
MNAR, but their opposing condition associations cancel in the pooled model:
condition X sees elevated missingness from peptide A and suppressed
missingness from peptide B, and vice versa for condition Y. The net group
coefficients may be near zero, and tjur_incremental can be close to
zero even when every missing value in the dataset is condition-structured.
If your experiment may contain features with opposing condition-missingness
patterns — which is common when comparing multiple biological conditions —
use mnar_index on the output of mnar_score
instead. mnar_index aggregates absolute per-feature Tjur R² scores
and is immune to this cancellation problem. mnar_global_score is
best suited to experiments where you expect missingness to be directionally
consistent across features (e.g. one condition is globally lower abundance).
References
Tjur T (2009). Coefficients of determination in logistic regression models — a new proposal: the coefficient of discrimination. The American Statistician, 63(4), 366-372.
Examples
if (FALSE) { # \dontrun{
idx <- mnar_global_score(obj, i = "peptides", group_cols = "condition")
# Primary quantities
cat("Intensity-only Tjur R²: ", round(idx$tjur_intensity_only, 3), "\n")
cat("Incremental (condition): ", round(idx$tjur_incremental, 3), "\n")
cat("Condition fraction: ", round(idx$tjur_condition_fraction, 3), "\n")
cat("LRT p-value: ", format.pval(idx$lrt_pvalue), "\n")
cat(idx$interpretation, "\n")
# If cancellation is a concern, prefer mnar_index() instead
mnar_res <- mnar_score(obj, i = "peptides", group_cols = "condition")
idx2 <- mnar_index(mnar_res$summary)
} # }