Perform independent filtering to threshold the over-representation testing on the number of features in each category to limit the multiple testing burden — add_independent_filtering

When testing for enrichment of feature annotations, we frequently test thousands of annotations simultaneously, which increases the multiple testing burden considerably. Given many annotations have very few annotated features, there will be too little power to detect over-representation. Inspired by DESeq2, which was in turn inspired by Bourgon, Gentleman, and Huber (2010), this function applies the principle of 'independent filtering' to threshold the minimum number of features per annotation (in the background) to optimally limit the multiple testing burden. This is statistically valid because the filtering criteria (number of features per annotation) is statistically independent from the test statistic under the null hypothesis. The independent filtering works because the filtering criteria is correlated with the test statistic under the alternative hypothesis. For further justification and discussion of independent filtering, see Bourgon, Gentleman, and Huber (2010) and the DESeq2 vignette.

The code function is largely lifted from DEseq:::pvalueAdjustment

Please cite Bourgon, Gentleman, and Huber (2010) when using this function.

Usage

add_independent_filtering_padj(
  obj,
  alpha = 0.1,
  theta = seq(0, 1, 0.01),
  p_value_col = "over_represented_pvalue",
  filter_col = "numInCat",
  plot = TRUE
)

Arguments

obj: data.frame containing goseq results as generated by get_enriched_go See below for an example.
alpha: numeric alpha value to use for rejection of null hypothesis. Note this is only used for optimising the threshold and any alternative value for alpha can be used downstream
theta: numeric vector of thresholds (fractions of the data to remove)
p_value_col: character Column with p-values
filter_col: character Column with filtering criteria values
plot: logical Plot the relationship between the

Value

Returns the data.frame with an added column denoting the optimised adjusted p-value ('padj_if')