Towards a post-clustering test for differential expression (original) (raw)

New Results

doi: https://doi.org/10.1101/463265

Loading

Abstract

Single-cell technologies have seen widespread adoption in recent years. The datasets generated by these technologies provide information on up to millions or more individual cells; however, the identities of the cells are often only determined computationally. Single-cell computational pipelines involve two critical steps: organizing the cells in a biologically meaningful way (clustering) and identifying the markers driving this organization (differential expression analysis). Because clustering algorithms force separation, performing differential expression analysis after clustering on the same dataset will generate artificially low _p_-values, potentially resulting in false discoveries. In this work, we introduce the truncated normal (TN) test, a test based on the truncated normal distribution that significantly corrects for this problem. We present a data-splitting-based framework that leverages the TN test to return reasonable _p_-values for arbitrary clustering schemes. We demonstrate the efficacy of our solution on simulated and real datasets, and we provide our code at https://github.com/jessemzhang/tn_test.

Copyright

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.