Transfer learning improves outcome predictions for ASD from gene expression in blood (original) (raw)

2021, bioRxiv (Cold Spring Harbor Laboratory)

Background Predicting outcomes on human genetic studies is difficult because the number of variables (genes) is often much larger than the number of observations (human subject tissue samples). We investigated means for improving model performance on the types of under-constrained problems that are typical in human genetics, where the number of strongly correlated genes (features) may exceed 10,000, and the number of study participants (observations) may be limited to under 1,000. Methods We created 'train', 'validate' and 'test' datasets from 240 microarray observations from 127 subjects diagnosed with autism spectrum disorder (ASD) and 113 'typically developing' (TD) subjects. We trained a neural network model (a.k.a., the 'naive' model) on 10,422 genes using the 'train' dataset, composed of 70 ASD and 65 TD subjects, and we restricted the model to one, fully-connected hidden layer to minimize the number of trainable parameters, including a dropout layer to help prevent overfitting. We experimented with alternative network architectures and tuned the hyperparameters using the 'validate' dataset, and performed a single, final evaluation using the holdout 'test' dataset. Next, we trained a neural network model using the identical architecture and identical genes to predict tissue type in GTEx data. We transferred that learning by replacing the top layer of the GTEx model with a layer to predict ASD outcome and we retrained the new layer on the ASD dataset, again using the identical 10,422 genes.