Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data - PubMed (original) (raw)
doi: 10.1038/s41467-025-56077-5.
Caitlin Guccione # 1 2 3, Lucas Patel # 2 3 4, Yoshihiko Tomofuji 5 6 7, Antonio Gonzalez 3, Gregory D Sepich-Poore 8 9, Kyuto Sonehara 5 6 7, Mohsen Zakeri 10, Yang Chen 3 11 12, Amanda Hazel Dilmore 3 11, Neil Damle 13 14, Sergio E Baranzini 15, George Hightower 12 16, Teruaki Nakatsuji 12, Richard L Gallo 12 17, Ben Langmead 10, Yukinori Okada 5 6 7 18 19, Kit Curtius 20 21 22, Rob Knight 23 24 25 26 27
Affiliations
- PMID: 39827261
- PMCID: PMC11742726
- DOI: 10.1038/s41467-025-56077-5
Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data
Caitlin Guccione et al. Nat Commun. 2025.
Abstract
As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases and inadvertently permit flow-through of host-specific DNA during bioinformatic analyses, which could be exploited for individual identification. To address these issues, we introduce and benchmark three host filtration methods of varying throughput, with concomitant applications across low biomass samples such as skin and high microbial biomass datasets including fecal samples. We find that these methods are important for obtaining accurate results in low biomass samples (e.g., tissue, skin). Overall, we demonstrate that rigorous host filtration is a key component of privacy-minded analyses of patient microbiomes and provide computationally efficient pipelines for accomplishing this task on large-scale datasets.
© 2025. The Author(s).
Conflict of interest statement
Competing interests: D.M. is a consultant for BiomeSense, Inc., has equity and receives income. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. G.H. is the recipient of the Robert A. Winn Diversity in Clinical Trials: Career Development Award, which is partly funded by Bristol-Meyer Squibb Foundation. B.L. is the owner of InOrder Labs LLC. K.C. has research grant support from Phathom Pharmaceuticals. R.K. is a scientific advisory board member, and consultant for BiomeSense, Inc., has equity and receives income. He is a scientific advisory board member and has equity in GenCirq. He is a consultant for DayTwo, and receives income. He has equity in and acts as a consultant for Cybele. He is a co-founder of Biota, Inc., and has equity. He is a cofounder of Micronoma, and has equity and is a scientific advisory board member. The terms of these arrangements have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. The remaining authors declare no competing interests.
Figures
Fig. 1. Sex biases identified in inadequately host-filtered human tumor tissue data.
a RPCA of microbial relative abundance quantification from tumor samples in the Hartwig Medical Foundation Database, which was originally subject to GRCh38.p7 filtration exclusively. Statistically significant differences were found between male and female groups (PERMANOVA; pseudo-F = 65.4, p = 0.00025). b Identical dataset and pre-processing steps done in a but with the addition of the T2T-CHM13v2.0 reference genome in host filtration. Differences were not statistically significant between male and female groups (PERMANOVA; pseudo-F = 1.23, p = 0.29).
Fig. 2. Host filtration pipeline and runtime evaluation.
a Pipeline of host filtration methods. b Using simulated data with a 50/50 mix of human data from HPRC and microbial data from FDA-ARGOS, we applied the 3 host filtration methods with 3 different sample sizes. Runtimes were averaged across 10 runs per sample size. HG38: GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release.
Fig. 3. Host filtration pipeline simulated data validation.
Using the 10 simulated datasets of 1 million reads as described in Fig. 2b, we a calculated the number of human reads remaining, and b number of microbial reads remaining, for host filtration Methods 1–3 (HPRC host filtration performed excluding the 10 genomes used for data simulation). HG38: GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, minimum and maximum values at whisker ends, and points representing individual observations both within and beyond the whisker range.
Fig. 4. Comparing human exome and tumor tissue samples across host filtration methods.
a The number of reads remaining after host-filtering 30 human exomes subset to 1 million reads across methods. b 100 metastatic colorectal cancer tissue samples were selected from HMF and read counts were calculated following application of improved host filtration methods. HG38 GRCH38.p14, T2T T2T-CHM13v2.0, HPRC Human Pangenome Reference Consortium 2024 release. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, minimum and maximum values at whisker ends, and points representing individual observations both within and beyond the whisker range.
Fig. 5. Comparing human skin and fecal samples across host filtration methods.
a 87 human skin samples were host-filtered with the improved methods, we then calculated the percentage of reads remaining. b We calculated the percentage of reads remaining on a per-sample basis for each of the 50 human fecal samples examined. HG38: GRCH38.p14, T2T: T2T-CHM13v2.0, HPRC: Human Pangenome Reference Consortium 2024 release. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), and whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR. Box plots show the median (center line), interquartile range (IQR; Q1–Q3; box), whiskers extending to Q1 − 1.5 × IQR and Q3 + 1.5 × IQR, minimum and maximum values at whisker ends, and points representing individual observations both within and beyond the whisker range.
Fig. 6. Re-identification from a set of genotype data based on the human reads in fecal samples prevented with improved host filtration.
The 343 fecal samples from Tomofuji et al. Nature Microbiology 2023, with paired genotype data, were re-analyzed with various combinations of updated host filtration methods (GRCh38.p14, T2T-CHM13v2.0, Human Pangenome Reference Consortium 2024 release) resolving host data leakage. The x-axis of the plots indicates the number of bases used for the calculation of the likelihood scores. The y-axis of the plot indicates the two-sided P values calculated using a standard normal distribution based on the standardized likelihood scores. The red and blue dashed lines indicate p = 4.3 × 10−7 (0.05/117,649 tests) and p = 1.5 × 10−4 (0.05/343 tests), respectively. The results of the 117,649 tests (343 genotype data × 343 metagenome data) are indicated as the colors of the points. Some samples could not be used for the re-identification analysis because too few reads remained after filtering, hence the fewer dots shown across host filtration methods. Full description on the calculation of P values can be found in the Methods.
Update of
- Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data.
Guccione C, Patel L, Tomofuji Y, McDonald D, Gonzalez A, Sepich-Poore GD, Sonehara K, Zakeri M, Chen Y, Dilmore AH, Damle N, Baranzini SE, Nakatsuji T, Gallo RL, Langmead B, Okada Y, Curtius K, Knight R. Guccione C, et al. Res Sq [Preprint]. 2024 Oct 23:rs.3.rs-4721159. doi: 10.21203/rs.3.rs-4721159/v1. Res Sq. 2024. PMID: 39502785 Free PMC article. Updated. Preprint.
References
MeSH terms
Grants and funding
- R01 CA241728/CA/NCI NIH HHS/United States
- DP1 AT010885/AT/NCCIH NIH HHS/United States
- R01 CA270235/CA/NCI NIH HHS/United States
- AGA Research Scholar Award AGA2022-13-05/AGA Research Foundation
- NIH/NIGMS T32GM007198/U.S. Department of Health & Human Services | National Institutes of Health (NIH)
- R21 HG013433/HG/NHGRI NIH HHS/United States
- T32 GM007198/GM/NIGMS NIH HHS/United States
- CDC award 75D301-22-C-14717/U.S. Department of Health & Human Services | Centers for Disease Control and Prevention (CDC)
- NIH Pioneer DP1AT010885/U.S. Department of Health & Human Services | National Institutes of Health (NIH)
- U19 AG063744/AG/NIA NIH HHS/United States
- NCI U24CA248454/U.S. Department of Health & Human Services | NIH | National Cancer Institute (NCI)
- P30 DK120515/DK/NIDDK NIH HHS/United States
- P30 CA023100/CA/NCI NIH HHS/United States
- U24 CA248454/CA/NCI NIH HHS/United States