What is Machine Learning? A Primer for the Epidemiologist (original) (raw)

Journal Article

Search for other works by this author on:

Correspondence to Dr. Justin Lessler, Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe Street, Room E6545, Baltimore, MD 21231 (e-mail: justin@jhu.edu).

Search for other works by this author on:

Received:

16 November 2018

Revision received:

29 July 2019

Published:

21 October 2019

Cite

Qifang Bi, Katherine E Goodman, Joshua Kaminsky, Justin Lessler, What is Machine Learning? A Primer for the Epidemiologist, American Journal of Epidemiology, Volume 188, Issue 12, December 2019, Pages 2222–2239, https://doi.org/10.1093/aje/kwz189
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Machine learning is a branch of computer science that has the potential to transform epidemiologic sciences. Amid a growing focus on “Big Data,” it offers epidemiologists new tools to tackle problems for which classical methods are not well-suited. In order to critically evaluate the value of integrating machine learning algorithms and existing methods, however, it is essential to address language and technical barriers between the two fields that can make it difficult for epidemiologists to read and assess machine learning studies. Here, we provide an overview of the concepts and terminology used in machine learning literature, which encompasses a diverse set of tools with goals ranging from prediction to classification to clustering. We provide a brief introduction to 5 common machine learning algorithms and 4 ensemble-based approaches. We then summarize epidemiologic applications of machine learning techniques in the published literature. We recommend approaches to incorporate machine learning in epidemiologic research and discuss opportunities and challenges for integrating machine learning and existing epidemiologic research methods.

Machine learning is a branch of computer science that broadly aims to enable computers to “learn” without being directly programmed (1). It has origins in the artificial intelligence movement of the 1950s and emphasizes practical objectives and applications, particularly prediction and optimization. Computers “learn” in machine learning by improving their performance at tasks through “experience” (2, p. xv). In practice, “experience” usually means fitting to data; hence, there is not a clear boundary between machine learning and statistical approaches. Indeed, whether a given methodology is considered “machine learning” or “statistical” often reflects its history as much as genuine differences, and many algorithms (e.g., least absolute shrinkage and selection operator (LASSO), stepwise regression) may or may not be considered machine learning depending on who you ask. Still, despite methodological similarities, machine learning is philosophically and practically distinguishable. At the liberty of (considerable) oversimplification, machine learning generally emphasizes predictive accuracy over hypothesis-driven inference, usually focusing on large, high-dimensional (i.e., having many covariates) data sets (3, 4). Regardless of the precise distinction between approaches, in practice, machine learning offers epidemiologists important tools. In particular, a growing focus on “Big Data” emphasizes problems and data sets for which machine learning algorithms excel while more commonly used statistical approaches struggle.

This primer provides a basic introduction to machine learning with the aim of providing readers a foundation for critically reading studies based on these methods and a jumping-off point for those interested in using machine learning techniques in epidemiologic research. The “Concepts and Terminology” section of this paper presents concepts and terminology used in the machine learning literature. The “Machine Learning Algorithms” section provides a brief introduction to 5 common machine learning algorithms: artificial neural networks, decision trees, support vector machines, naive Bayes, and _k_-means clustering. These are important and commonly used algorithms that epidemiologists are likely to encounter in practice, but they are by no means comprehensive of this large and highly diverse field. The following two sections, “Ensemble Methods” and “Epidemiologic Applications,” extend this examination to ensemble-based approaches and epidemiologic applications in the published literature. “Brief Recommendations” provides some recommendations for incorporating machine learning into epidemiologic practice, and the last section discusses opportunities and challenges.

CONCEPTS AND TERMINOLOGY

For epidemiologists seeking to integrate machine learning techniques into their research, language and technical barriers between the two fields can make reading source materials and studies challenging. Some machine learning concepts lack statistical or epidemiologic parallels, and machine learning terminology often differs even where the underlying concepts are the same. Here we briefly review basic machine learning principles and provide a glossary of machine learning terms and their statistical/epidemiologic equivalents (Table 1).

Table 1.

Glossary of Machine Learning and Epidemiology Terminology

Machine Learning Term(s)	Epidemiology Term(s)	Definition and Notes	Example
Attribute, feature, predictor, or field	Independent variable	Machine learning uses various terms to reference what epidemiologists would consider an “independent variable,” including attribute, feature, predictor, and field.	In a data set with 4 independent variables (BMIa, age, race, and SES) and a dependent variable (diabetes mellitus), BMI, age, race, and SES are attributes.
Domain	Range of possible variable values	The domain is the set of possible values of an attribute. It can be continuous or categorical/binary.	If race is recorded in a data set as “1 = Caucasian, 2 = African-American, and 3 = other,” its domain is categorical and includes only the 3 referenced categories.
Input and output	Independent (exposure) and dependent (outcome) variables	In machine learning, “input” refers to all of the predictors or independent variables that enter the model, and “output” generally refers to the predicted value (whether a number, classification, etc.) of the dependent variable or outcome.	BMI, age, race, and SES are model input. In a binary classification algorithm, the model output is a prediction of whether a subject does (D = 1) or does not (D = 0) have diabetes.
Classifier, estimator	Model	“Classifiers” or “estimators” are used generally in the machine learning literature to refer to algorithms that perform a prediction or classification of interest. Their less common, though more technical, usage specifically refers to fully parameterized models that are used to predict or classify.	A decision tree is one type of machine learning classifier (general usage). The more specific usage of this term would refer only to a parameterized decision tree that has been fit in a data set (e.g., that predicts diabetes outcomes from BMI, age, sex, and SES).
Learner	Model-fitting algorithm	A learner inputs a training set and outputs a classifier. Usually, but not always, learner refers to the fitting algorithm, while classifier refers to the fitted model.	In decision tree learning, the classification and regression trees (CART) algorithm, developed by Breiman et al. (27) in 1984, is one of multiple available learners for developing a decision tree classifier.
Dimensionality	No. of covariates	No. of independent variables under consideration in a model.	A data set with 4 independent variables (BMI, age, race, and SES) and a dependent variable (diabetes) has 4 dimensions.
Label	Value of dependent variables, outcomes	A variable’s label is its value for each observation (e.g., 0 or 1). Although labels can technically describe any variable, common shorthand is that “labeled data” refers to data in which the dependent variable assumes a value for all observations.	In a data set for which an investigator has collected information on diabetes status (outcome) for all subjects, this is “labeled” data. The label for diabetes is 0 or 1. Partially labeled data would have diabetes status missing for some subjects.
Imbalanced data	Data set in which some cases or risk categories occur much less frequently than the others	In imbalanced machine learning data sets, the outcome or another risk category of interest occurs much less frequently, either because of the intrinsic nature of the problem (e.g., a rare disease in a database of medical records) or because of the sampling strategy (e.g., prevalence of cases in the study population is much lower than that in the target/source population). Heavily imbalanced data may pose challenges in some classification algorithms and require tuning parameters in order to correct for or otherwise address this imbalance. One method for addressing imbalanced data sets is to “balance” them artificially, either by oversampling instances of the minority class or undersampling instances of the majority class.	Assume a hypothetical data set of pediatric, normal-weight patients in which the prevalence of diabetes is 2%. This data set is imbalanced because the outcome is very rare, which can lead to poor sensitivity of classification algorithms without parameter tuning or other corrective methods. This imbalance is due to the intrinsic nature of the population we are evaluating (i.e., healthy children) and not due to the sampling strategy or other bias.
Loss function	Error measure	In machine learning, a loss function is generally considered a penalty for misclassification when assessing a model’s predictive performance.	A simple loss function may be the absolute value of (predicted value minus true value). If a model predicts that a subject has diabetes (D = 1) and the subject does not (D = 0), the value of the loss function for this prediction is “1.”

Machine Learning Term(s)	Epidemiology Term(s)	Definition and Notes	Example
Attribute, feature, predictor, or field	Independent variable	Machine learning uses various terms to reference what epidemiologists would consider an “independent variable,” including attribute, feature, predictor, and field.	In a data set with 4 independent variables (BMIa, age, race, and SES) and a dependent variable (diabetes mellitus), BMI, age, race, and SES are attributes.
Domain	Range of possible variable values	The domain is the set of possible values of an attribute. It can be continuous or categorical/binary.	If race is recorded in a data set as “1 = Caucasian, 2 = African-American, and 3 = other,” its domain is categorical and includes only the 3 referenced categories.
Input and output	Independent (exposure) and dependent (outcome) variables	In machine learning, “input” refers to all of the predictors or independent variables that enter the model, and “output” generally refers to the predicted value (whether a number, classification, etc.) of the dependent variable or outcome.	BMI, age, race, and SES are model input. In a binary classification algorithm, the model output is a prediction of whether a subject does (D = 1) or does not (D = 0) have diabetes.
Classifier, estimator	Model	“Classifiers” or “estimators” are used generally in the machine learning literature to refer to algorithms that perform a prediction or classification of interest. Their less common, though more technical, usage specifically refers to fully parameterized models that are used to predict or classify.	A decision tree is one type of machine learning classifier (general usage). The more specific usage of this term would refer only to a parameterized decision tree that has been fit in a data set (e.g., that predicts diabetes outcomes from BMI, age, sex, and SES).
Learner	Model-fitting algorithm	A learner inputs a training set and outputs a classifier. Usually, but not always, learner refers to the fitting algorithm, while classifier refers to the fitted model.	In decision tree learning, the classification and regression trees (CART) algorithm, developed by Breiman et al. (27) in 1984, is one of multiple available learners for developing a decision tree classifier.
Dimensionality	No. of covariates	No. of independent variables under consideration in a model.	A data set with 4 independent variables (BMI, age, race, and SES) and a dependent variable (diabetes) has 4 dimensions.
Label	Value of dependent variables, outcomes	A variable’s label is its value for each observation (e.g., 0 or 1). Although labels can technically describe any variable, common shorthand is that “labeled data” refers to data in which the dependent variable assumes a value for all observations.	In a data set for which an investigator has collected information on diabetes status (outcome) for all subjects, this is “labeled” data. The label for diabetes is 0 or 1. Partially labeled data would have diabetes status missing for some subjects.
Imbalanced data	Data set in which some cases or risk categories occur much less frequently than the others	In imbalanced machine learning data sets, the outcome or another risk category of interest occurs much less frequently, either because of the intrinsic nature of the problem (e.g., a rare disease in a database of medical records) or because of the sampling strategy (e.g., prevalence of cases in the study population is much lower than that in the target/source population). Heavily imbalanced data may pose challenges in some classification algorithms and require tuning parameters in order to correct for or otherwise address this imbalance. One method for addressing imbalanced data sets is to “balance” them artificially, either by oversampling instances of the minority class or undersampling instances of the majority class.	Assume a hypothetical data set of pediatric, normal-weight patients in which the prevalence of diabetes is 2%. This data set is imbalanced because the outcome is very rare, which can lead to poor sensitivity of classification algorithms without parameter tuning or other corrective methods. This imbalance is due to the intrinsic nature of the population we are evaluating (i.e., healthy children) and not due to the sampling strategy or other bias.
Loss function	Error measure	In machine learning, a loss function is generally considered a penalty for misclassification when assessing a model’s predictive performance.	A simple loss function may be the absolute value of (predicted value minus true value). If a model predicts that a subject has diabetes (D = 1) and the subject does not (D = 0), the value of the loss function for this prediction is “1.”

Abbreviations: BMI, body mass index; SES, socioeconomic status.

a Weight (kg)/height (m)2.

Table 1.

Glossary of Machine Learning and Epidemiology Terminology

Machine Learning Term(s)	Epidemiology Term(s)	Definition and Notes	Example
Attribute, feature, predictor, or field	Independent variable	Machine learning uses various terms to reference what epidemiologists would consider an “independent variable,” including attribute, feature, predictor, and field.	In a data set with 4 independent variables (BMIa, age, race, and SES) and a dependent variable (diabetes mellitus), BMI, age, race, and SES are attributes.
Domain	Range of possible variable values	The domain is the set of possible values of an attribute. It can be continuous or categorical/binary.	If race is recorded in a data set as “1 = Caucasian, 2 = African-American, and 3 = other,” its domain is categorical and includes only the 3 referenced categories.
Input and output	Independent (exposure) and dependent (outcome) variables	In machine learning, “input” refers to all of the predictors or independent variables that enter the model, and “output” generally refers to the predicted value (whether a number, classification, etc.) of the dependent variable or outcome.	BMI, age, race, and SES are model input. In a binary classification algorithm, the model output is a prediction of whether a subject does (D = 1) or does not (D = 0) have diabetes.
Classifier, estimator	Model	“Classifiers” or “estimators” are used generally in the machine learning literature to refer to algorithms that perform a prediction or classification of interest. Their less common, though more technical, usage specifically refers to fully parameterized models that are used to predict or classify.	A decision tree is one type of machine learning classifier (general usage). The more specific usage of this term would refer only to a parameterized decision tree that has been fit in a data set (e.g., that predicts diabetes outcomes from BMI, age, sex, and SES).
Learner	Model-fitting algorithm	A learner inputs a training set and outputs a classifier. Usually, but not always, learner refers to the fitting algorithm, while classifier refers to the fitted model.	In decision tree learning, the classification and regression trees (CART) algorithm, developed by Breiman et al. (27) in 1984, is one of multiple available learners for developing a decision tree classifier.
Dimensionality	No. of covariates	No. of independent variables under consideration in a model.	A data set with 4 independent variables (BMI, age, race, and SES) and a dependent variable (diabetes) has 4 dimensions.
Label	Value of dependent variables, outcomes	A variable’s label is its value for each observation (e.g., 0 or 1). Although labels can technically describe any variable, common shorthand is that “labeled data” refers to data in which the dependent variable assumes a value for all observations.	In a data set for which an investigator has collected information on diabetes status (outcome) for all subjects, this is “labeled” data. The label for diabetes is 0 or 1. Partially labeled data would have diabetes status missing for some subjects.
Imbalanced data	Data set in which some cases or risk categories occur much less frequently than the others	In imbalanced machine learning data sets, the outcome or another risk category of interest occurs much less frequently, either because of the intrinsic nature of the problem (e.g., a rare disease in a database of medical records) or because of the sampling strategy (e.g., prevalence of cases in the study population is much lower than that in the target/source population). Heavily imbalanced data may pose challenges in some classification algorithms and require tuning parameters in order to correct for or otherwise address this imbalance. One method for addressing imbalanced data sets is to “balance” them artificially, either by oversampling instances of the minority class or undersampling instances of the majority class.	Assume a hypothetical data set of pediatric, normal-weight patients in which the prevalence of diabetes is 2%. This data set is imbalanced because the outcome is very rare, which can lead to poor sensitivity of classification algorithms without parameter tuning or other corrective methods. This imbalance is due to the intrinsic nature of the population we are evaluating (i.e., healthy children) and not due to the sampling strategy or other bias.
Loss function	Error measure	In machine learning, a loss function is generally considered a penalty for misclassification when assessing a model’s predictive performance.	A simple loss function may be the absolute value of (predicted value minus true value). If a model predicts that a subject has diabetes (D = 1) and the subject does not (D = 0), the value of the loss function for this prediction is “1.”

Machine Learning Term(s)	Epidemiology Term(s)	Definition and Notes	Example
Attribute, feature, predictor, or field	Independent variable	Machine learning uses various terms to reference what epidemiologists would consider an “independent variable,” including attribute, feature, predictor, and field.	In a data set with 4 independent variables (BMIa, age, race, and SES) and a dependent variable (diabetes mellitus), BMI, age, race, and SES are attributes.
Domain	Range of possible variable values	The domain is the set of possible values of an attribute. It can be continuous or categorical/binary.	If race is recorded in a data set as “1 = Caucasian, 2 = African-American, and 3 = other,” its domain is categorical and includes only the 3 referenced categories.
Input and output	Independent (exposure) and dependent (outcome) variables	In machine learning, “input” refers to all of the predictors or independent variables that enter the model, and “output” generally refers to the predicted value (whether a number, classification, etc.) of the dependent variable or outcome.	BMI, age, race, and SES are model input. In a binary classification algorithm, the model output is a prediction of whether a subject does (D = 1) or does not (D = 0) have diabetes.
Classifier, estimator	Model	“Classifiers” or “estimators” are used generally in the machine learning literature to refer to algorithms that perform a prediction or classification of interest. Their less common, though more technical, usage specifically refers to fully parameterized models that are used to predict or classify.	A decision tree is one type of machine learning classifier (general usage). The more specific usage of this term would refer only to a parameterized decision tree that has been fit in a data set (e.g., that predicts diabetes outcomes from BMI, age, sex, and SES).
Learner	Model-fitting algorithm	A learner inputs a training set and outputs a classifier. Usually, but not always, learner refers to the fitting algorithm, while classifier refers to the fitted model.	In decision tree learning, the classification and regression trees (CART) algorithm, developed by Breiman et al. (27) in 1984, is one of multiple available learners for developing a decision tree classifier.
Dimensionality	No. of covariates	No. of independent variables under consideration in a model.	A data set with 4 independent variables (BMI, age, race, and SES) and a dependent variable (diabetes) has 4 dimensions.
Label	Value of dependent variables, outcomes	A variable’s label is its value for each observation (e.g., 0 or 1). Although labels can technically describe any variable, common shorthand is that “labeled data” refers to data in which the dependent variable assumes a value for all observations.	In a data set for which an investigator has collected information on diabetes status (outcome) for all subjects, this is “labeled” data. The label for diabetes is 0 or 1. Partially labeled data would have diabetes status missing for some subjects.
Imbalanced data	Data set in which some cases or risk categories occur much less frequently than the others	In imbalanced machine learning data sets, the outcome or another risk category of interest occurs much less frequently, either because of the intrinsic nature of the problem (e.g., a rare disease in a database of medical records) or because of the sampling strategy (e.g., prevalence of cases in the study population is much lower than that in the target/source population). Heavily imbalanced data may pose challenges in some classification algorithms and require tuning parameters in order to correct for or otherwise address this imbalance. One method for addressing imbalanced data sets is to “balance” them artificially, either by oversampling instances of the minority class or undersampling instances of the majority class.	Assume a hypothetical data set of pediatric, normal-weight patients in which the prevalence of diabetes is 2%. This data set is imbalanced because the outcome is very rare, which can lead to poor sensitivity of classification algorithms without parameter tuning or other corrective methods. This imbalance is due to the intrinsic nature of the population we are evaluating (i.e., healthy children) and not due to the sampling strategy or other bias.
Loss function	Error measure	In machine learning, a loss function is generally considered a penalty for misclassification when assessing a model’s predictive performance.	A simple loss function may be the absolute value of (predicted value minus true value). If a model predicts that a subject has diabetes (D = 1) and the subject does not (D = 0), the value of the loss function for this prediction is “1.”

Abbreviations: BMI, body mass index; SES, socioeconomic status.

a Weight (kg)/height (m)2.

Supervised, unsupervised, and semisupervised learning

Machine learning is broadly classifiable by whether the computer’s learning (i.e., model-fitting) is “supervised” or “unsupervised.” Supervised learning is akin to the type of model-fitting that is standard in epidemiologic practice: The value of the outcome (i.e., the dependent variable), often called its “label” in machine learning, is known for each observation. Data with specified outcome values are called “labeled data.” Common supervised learning techniques include standard epidemiologic approaches such as linear and logistic regression, as well as many of the most popular machine learning algorithms (e.g., decision trees, support vector machines).

In unsupervised learning, the algorithm attempts to identify natural relationships and groupings within the data without reference to any outcome or the “right answer” (5, p. 517). Unsupervised learning approaches share similarities in goals and structure with statistical approaches that attempt to identify unspecified subgroups with similar characteristics (e.g., “latent” variables or classes) (6). Clustering algorithms, which group observations on the basis of similar data characteristics (e.g., both oranges and beach balls are round), are common unsupervised learning implementations. Examples may include _k_-means clustering and expectation-maximization clustering using Gaussian mixture models (7, 8).

Semisupervised learning fits models to both labeled and unlabeled data. Labeling data (outcomes) is often time-consuming and expensive, particularly for large data sets. Semisupervised learning supplements limited labeled data with an abundance of unlabeled data with the goal of improving model performance (studies show that unlabeled data can help build a better classifier, but appropriate model selection is critical) (9). For example, in a study of Web page classification, Nigam et al. (10) fit a naive Bayes classifier to labeled data and then used the same classifier to probabilistically label unlabeled observations (i.e., fill in missing outcome data). They then retrained a new classifier on the resulting, fully labeled data set, thereby achieving a 30% increase in Web page classification accuracy on data outside of the training set. Semisupervised learning can bear some similarity to statistical approaches for missing data and censoring (e.g., multiple imputation), but as an approach that focuses on imputing missing outcomes rather than missing covariates.

Classification versus regression algorithms

Within the domain of supervised learning, machine learning algorithms can be further divided into classification or regression applications, depending upon the nature of the response variable. In general, in the machine learning literature, classification refers to prediction of categorical outcomes, while regression refers to prediction of continuous outcomes. We use this terminology throughout this primer and are explicit when referring to specific regression algorithms (e.g., logistic regression). Many machine learning algorithms that were developed to perform classification have been adapted to also address regression problems, and vice versa.

Generative versus discriminative algorithms

Machine learning algorithms, both supervised and unsupervised, can be discriminative or generative (11, 12). Discriminative algorithms directly model the conditional probability of an outcome, Pr(y|x) (the probability of y given x), in a set of observed data—for example, the probability that a subject has type 2 diabetes mellitus given a certain body mass index (BMI; weight (kg)/height (m)2). Most statistical approaches familiar to epidemiologists (e.g., linear and logistic regression) are discriminative, as are most of the algorithms discussed in this primer.

In contrast, while generative algorithms can also compute the conditional probability of an outcome, this computation occurs indirectly. Generative algorithms first model the joint probability distribution, Pr(x, y) (the probabilities associated with all possible combinations of x and y), or, continuing our example, a probabilistic model that accounts for all observed combinations of BMIs and diabetes outcomes (Table 2). This joint probability distribution can be transformed into a conditional probability distribution in order to classify data, as Pr(y|x) = Pr(x, y)/Pr(x). Because the joint probability distribution models the underlying data-generating process, generative models can also be used, as their name suggests, for directly generating new simulated data points reflecting the distribution of the covariates and outcome in the modeled population (11). However, because they model the full joint distribution of outcomes and covariates, generative models are generally more complex and require more assumptions to fit than discriminative algorithms (12, 13). Examples of generative algorithms include naive Bayes and hidden Markov models (11).

Table 2.

Matrix of Joint Probabilities for Body Mass Indexa (x) and Diabetes Mellitus (y) in a Data Set With 4 Dichotomized Observations: (0, 1), (0, 1), (0, 1), and (0, 0)

Diabetes Status	BMI Status
Overweight BMI = 1	Overweight BMI = 0
D = 1	0/4	1/4
D = 0	2/4	1/4

Diabetes Status	BMI Status
Overweight BMI = 1	Overweight BMI = 0
D = 1	0/4	1/4
D = 0	2/4	1/4

Abbreviation: BMI, body mass index.

a Weight (kg)/height (m)2.

Table 2.

Matrix of Joint Probabilities for Body Mass Indexa (x) and Diabetes Mellitus (y) in a Data Set With 4 Dichotomized Observations: (0, 1), (0, 1), (0, 1), and (0, 0)

Diabetes Status	BMI Status
Overweight BMI = 1	Overweight BMI = 0
D = 1	0/4	1/4
D = 0	2/4	1/4

Diabetes Status	BMI Status
Overweight BMI = 1	Overweight BMI = 0
D = 1	0/4	1/4
D = 0	2/4	1/4

Abbreviation: BMI, body mass index.

a Weight (kg)/height (m)2.

Reinforcement learning

In reinforcement learning, systems learn to excel at a task over time through trial and error (14). Reinforcement learning techniques take an iterative approach to learning by obtaining positive or negative feedback based on performance of a given task on some data (whether prediction, classification, or another action) and then self-adapting and attempting the task again on new data (though old data may be reencountered) (15). Depending on how it is implemented, this approach can be akin to supervised learning, or it may represent a semisupervised approach (as in generative adversarial neural networks (16)). Reinforcement learning algorithms often optimize the use of early, “exploratory” versions of a model—that is, task attempts—that perform poorly to gain information to perform better on future attempts, and then become less labile as the model “learns” more (15). Medical and epidemiologic applications of reinforcement learning have included modeling the effect of sequential clinical treatment decisions on disease progression (17) (e.g., optimizing first- and second-line therapy decisions for schizophrenia management (18)) and personalized, adaptive medication dosing strategies. For example, Nemati et al. (19) used reinforcement learning with artificial neural networks in a cohort of intensive-care-unit patients to develop individualized heparin dosing strategies that evolve as a patient’s clinical phenotype changes, in order to maximize the amount of time that blood drug levels remain within the therapeutic window.

MACHINE LEARNING ALGORITHMS

In this section, we introduce 5 common machine learning algorithms: artificial neural networks, decision trees, support vector machines, naive Bayes, and _k_-means clustering. For each, we include a brief description, summarize strengths and limitations, and highlight implementations available on common statistical computing platforms. This section is intended to provide a high-level introduction to these algorithms, and we refer interested readers to the cited references for further information.

Artificial neural networks

Artificial neural networks (ANNs) are inspired by the signaling behavior of neurons in biological neural networks. ANNs, which consist of a population of neurons interconnected through complex signaling pathways, use this structure to analyze complex interactions between a group of measurable covariates in order to predict an outcome. ANNs possess layers of “neurons” connected by “axons” (20) (Figure 1A). These layers are grouped into 1) an input layer, 2) one or more middle “hidden” layers, and 3) an output layer. The neurons in the input and output layers correspond to the independent and dependent variables, respectively. Neurons in adjacent layers communicate with each other through activation functions, which convert the weighted sum of a neuron’s inputs into an output (Figure 1B). Depending on the type of activation function, the output can be dichotomous (“1” when the weighted sum exceeds a given threshold and “0” otherwise) or continuous. The weighted sum of a neuron’s inputs is somewhat analogous to coefficients in linear or logistic regression.

Figure 1.

A single artificial neuron, also called a perceptron (panel A), and a feed-forward neural network (a collection of multiple neurons organized in layers) that examines the hypothetical relationship between clinical and demographic predictors and a numerical outcome, fasting blood sugar level (panel B). Line (axon) thickness reflects input weight, and line type indicates direction of effect (solid = excitatory or positive; dashed = inhibitory or negative). Lack of a line (e.g., connecting “sex” to neuron C) indicates no input. Connections between input and output layers are exclusively mediated through the hidden layer (more complex artificial neural networks can have multiple hidden layers). At hidden layer neuron A, we observe that both body mass index (BMI; weight (kg)/height (m)2) and age exert positive inputs, and they demonstrate interactive effects with each other and race (the latter’s input is negative, as indicated by the dashed line). The weighted sum of these inputs results in activation of neuron A and positive output. In contrast, neuron B converts inputs from age, socioeconomic status (SES), and race into negative output (inversely correlated with fasting blood sugar), while neuron C’s inputs fail to surpass the activation function threshold; that is, there is no effect on the outcome mediated through neuron C.

Figure 1 illustrates a simple neural network with a single hidden layer and a feed-forward structure (i.e., signals progress unidirectionally from input to output layers). For supervised learning applications, once the numbers of layers and neurons are selected, the connection weights of the ANN are fit on a training set of labeled data through a reinforcement learning approach. Initial connection weights are generally selected randomly, and network output is compared with the correct output (class labels) using a loss function, which is based on the difference between the predicted and true values of the outcome. The goal is to reduce the loss function to zero—that is, to make the ANN’s predicted output match truth as closely as possible, albeit while also protecting against overfitting. In response, 1) resulting error values are distributed backwards through the network, from output to input, in order to assign an error value contribution to each hidden and input layer neuron (called “back-propagation”; for additional technical information on this process, see, for example, Rumelhart et al. (21)), and 2) connection weights are updated in order to minimize the loss function (“weight adjustment”). This 2-fold optimization process repeats for a number of “epochs” or iterations until the network meets a prespecified stopping rule or error rate threshold (22, 23).

Strengths and limitations

Strengths of ANNs include their ability to accommodate variable interactions and nonlinear associations without user specification (22). The primary limitation of ANNs is that, although it is arguably not completely a “black box” (23, p. 1112), the underlying model nevertheless remains largely opaque. Effects are mediated exclusively through hidden layer(s), making interpreting relationships between input and output layers challenging, especially for “deep” ANNs, which include multiple hidden layers. This lack of transparency complicates commonsense or etiological interpretation of individual variable effects and connection weights, although there are continuing efforts to enhance ANN interpretability (20, 24, 25). ANN training parameters can also be complex, and setting and tuning these parameters generally necessitates technical expertise. Moreover, complex ANNs, including deep networks, can require large data sets (potentially in the tens or hundreds of thousands, although there is no hard-and-fast rule) in order to achieve optimal model performance, which may be prohibitive for some epidemiologic applications (26).

Sample statistical packages and modules

Available software includes neuralnet, nnet, deepnet, and TensorFlow in R (R Foundation for Statistical Computing, Vienna, Austria); Enterprise Miner Neural Network and AutoNeural in SAS (SAS Institute, Inc., Cary, North Carolina); and sklearn and TensorFlow in Python (Python Software Foundation, Wilmington, Delaware).

Decision trees

Decision trees (i.e., classification and regression trees (CART)) create a series of decision rules based on continuous and/or categorical input variables to predict an outcome (5, 27). Classification trees predict categorical outcomes, and regression trees predict continuous outcomes. CART analysis has been popularized as an umbrella term for any decision tree learning method (27). However, “CART” is also a common implementation algorithm in the epidemiologic and medical literature, although a number of other decision tree algorithms have also been developed (e.g., ID3, CHAID) (28–30).

Figure 2 presents a hypothetical classification tree for a binary outcome, diabetes. To derive a decision tree, the algorithm applies a splitting rule on successively smaller partitions of data, with each partition being a node on the tree. The partition consisting of all data is the root node; in Figure 2 this node is split on the basis of BMI. Splits are selected to minimize some measure of node impurity (i.e., diversity of classes) or heterogeneity (i.e., variance) in each resulting partition (the “daughter nodes”) (5, 27). The splitting process repeats on each branch of the tree until additional splits yield no further reductions in node impurity, or some other stopping criterion is reached (e.g., a specified minimum number of observations in terminal nodes or the value at which error is minimized in cross-validation (31)). In many algorithms, this splitting is often followed by a “pruning” step in which partitions are remerged (i.e., some bottom nodes are removed, making the final tree smaller) based on some criterion designed to increase generalizability (32).

Figure 2.

A hypothetical classification decision tree for predicting a binary outcome, type 2 diabetes mellitus. Body mass index (BMI; weight (kg)/height (m)2) occupies the root node (the most discriminatory variable in the data set); age, consumption of sweetened beverages, and physical activity occupy daughter nodes; and predicted diabetes status (yes/no) is reflected in the terminal or “leaf” nodes. Terminal node predictions proceed on the basis of simple majority rule (e.g., if 60% of patients in a terminal node are diabetes-positive, the entire terminal node will be classified as “Diabetes”). The cutpoints for the continuous variables, BMI and age, are algorithm-derived. The presence of age at different cutpoints in 2 different daughter nodes reflects likely interaction effects: The relationship between age and diabetes differs in patients with BMI ≤32 compared with patients with BMI >32 who do not routinely consume sweetened beverages.

Strengths and limitations

Decision trees are generally easy to understand—its having been said that “[o]n interpretability, trees rate an A+” (4, p. 206)—making their output ideal for a range of target audiences. They are also flexible to nonlinear covariate effects and can incorporate higher-order interactions between covariates (27, 33). Trees may lose information by dichotomizing or categorizing variables where associations are continuous, and they can be unstable to even small data changes. Because most decision tree algorithms are “greedy” (splitting decisions are locally optimized at nodes), through a domino effect, dramatically different trees can result if even a single higher-level node shifts to a different variable (34). Hence, decision trees can be highly sensitive to small perturbations in data. Perhaps most fundamentally, decision trees are prone to overfitting, and their ultimate utility depends heavily on appropriately implemented pruning and/or stopping criteria. Ensemble-based decision trees (e.g., random forests) can address some of these concerns (see “Ensemble Methods” section), but they do not produce a single, easily interpretable tree.

Sample statistical packages and modules

Available software includes rpart, caret, ctree, and randomForest (ensemble decision trees) in R; CART (failure-time data only), CHAID, and CHAIDFOREST (ensemble decision trees) in Stata (StataCorp LLC, College Station, Texas); Enterprise Miner Decision Tree in SAS; and sklearn in Python.

Support vector machines

Support vector machines (SVMs) are a set of supervised learning methods used for classification and regression problems (35, 36). SVMs construct an optimal boundary, called a hyperplane, that best separates observations of different classes. In 1 dimension, this boundary is a point; in 2 dimensions, a line; and in 3, a plane (Figure 3). However, many observations often need to be transformed before they can be separated by a hyperplane. SVMs address this problem by applying a data transformation called a “kernel function” to the data (3). Kernel functions project the data into a higher-dimensional space where the input variables are separable (Figure 3). The optimal kernel function is usually chosen from a set of commonly used kernel functions selected through cross-validation. Popular kernel functions include polynomial kernel, gaussian kernel, and sigmoid kernel. Following kernel function transformation, the best hyperplane maximizes the separation between the different classes (i.e., the margin, defined as the distance from the hyperplane to the closest data point), while tolerating a specified level of misclassification. SVMs are traditionally used for binary classification, but multiple pairwise comparison can be applied for multiclass classification (36). Extensions to SVM techniques have also been developed that can be used to predict continuous outcomes (called support vector regression) (37).

Figure 3.

An illustration of data transformation with a support vector machine for predicting diabetes status. A) Hypothetical age and body mass index (BMI; weight (kg)/height (m)2) distribution of diabetic (black dots) and nondiabetic (gray dots) patients in 2-dimensional space. a and b are fixed parameters estimated from the data (see text). B) After transformation, these dots/patients who are not linearly separable in 2-dimensional space become linearly separable in 3-dimensional space. A hyperplane in 3-dimensional space is shown as a surface.

In Figure 3, persons with and without diabetes cannot be separated by a line in the 2-dimensional space based upon the predictors, age and BMI (Figure 3A). However, when we project the data into a 3-dimensional space by applying a kernel given by φ((age, BMI) = (age, BMI, (BMI − a) × (age − b)), where a and b are fixed parameters estimated from the data, the data are now separable in the 3-dimensional space by a plane (Figure 3B).

Strengths and limitations

SVMs generally demonstrate low misclassification error and scale well to high-dimensional data (38). SVMs have reasonable interpretability, especially when a kernel function is not used. Where a kernel function is necessary, however, selecting the optimal kernel function typically requires experimenting with a set of standard functions. This approach can be time-consuming and does not guarantee that the set of standard kernel functions that were evaluated included the optimal function, and in some cases hand-crafted kernel functions are used instead.

Sample statistical packages and modules

Available software includes e1071, kernlab, and caret in R; svmachines (39) in Stata; PROC SVM in SAS; and sklearn in Python.

Naive Bayes algorithms

A naive Bayes algorithm is a simple probabilistic classification algorithm based upon Bayes’ theorem that makes the “naive” assumption of independence between predictive variables (40). Naive Bayes calculates the probability associated with each possible class conditional on a set of covariates—that is, the product of the prior probability and the likelihood function. The classifier then selects the class with the highest probability as the “correct” class (Figure 4). The prior probability typically reflects one’s belief about the outcome, either based on the study itself or from other published literature. The independence assumption in naive Bayes greatly simplifies the calculation by decomposing the likelihood function into a product of likelihood functions, one for each covariate. Though adaptations of naive Bayes for regression exist (41), the algorithm is most commonly used for classification.

Figure 4.

A hypothetical naive Bayes algorithm for predicting a binary outcome, type 2 diabetes mellitus, in the subpopulation whose body mass index (BMI; weight (kg)/height (m)2) is over 32, whose age is over 55 years, and who are female. The prior probability of the class (e.g., diabetes status) and a product of the likelihood functions, one for each patient characteristic, determine the class assignment. If the posterior probability of being diabetic (D+) in this population, Pr(D+|BMI > 32, age > 55, female), is larger than the posterior probability of not being diabetic (D−) in this population, Pr(_D_−|BMI > 32, age > 55, female), then this population would be classified as having diabetes. The prior probability of being diabetic, Pr(D+), approximates the overall diabetes prevalence. Because of the independence assumption, the likelihood of observing people with this set of attributes—BMI > 32, age > 55 years, and female sex—among the persons with diabetes (i.e., Pr(BMI > 32, age > 55, female|D+)) can be approximated by the product of the likelihood of observing each attribute among persons with diabetes (i.e., Pr(BMI > 32|D+) × Pr(age > 55|D+) × Pr(female|D+)). For example, Pr(BMI > 32|D+) represents, among persons with diabetes, the likelihood of observing people with BMI >32.

Continuing our diabetes example, a naive Bayes classifier would calculate the likelihood of each observation (e.g., BMI > 32, age > 55 years, and female sex) among people who are and are not diabetic (Figure 4). Assuming equal prior probability for diabetes, an individual would be assigned to the class (i.e., diabetic vs. not diabetic) that had the highest likelihood of independently producing each observation.

Strengths and limitations

The simplicity of the naive Bayes approach contributes to the popularity of these algorithms. It has been shown to perform relatively well in the presence of noise, missing data, and irrelevant features (42). Because of the independence assumption, naive Bayes requires estimation of fewer parameters, and thus a smaller training set, than more complex algorithms (43, 44).

Arguably the most important limitation of naive Bayes is that its independence assumption is often violated in the real world. In addition, the most probable class may weigh heavily on the chosen prior. Thus, proper adjustment for underlying class frequencies is necessary when prior probability in the training set is not representative of the general population. In addition, when data are correlated, naive Bayes gives more influence to the likelihood function of highly correlated features and may bias the prediction (43). These limitations will not affect classification performance, however, so long as the ordering of the biased probabilities is the same as that of the correct ones. Naive Bayes probability outputs nevertheless should never be interpreted as actual probabilities of class membership.

Sample statistical packages and modules

Available software includes e1071, klaR, bnlearn, H2O, and naivebayes in R; multinomial mixture models in StataStan (45); PROC HPBNET in SAS; and sklearn in Python.

_K_-means clustering

_K_-means clustering is one of the simplest unsupervised learning algorithms (46). It partitions observations into a prespecified number of distinct clusters (k), such that within-cluster variation (e.g., squared Euclidean distance) is as small as possible (47). _K_-means clustering first randomly selects k centroids, with each centroid defining 1 cluster (i.e., each observation is assigned to its closest centroid). Following k selection, the algorithm iteratively alternates between 2 steps until classification remains unchanged: 1) assign each observation to its nearest centroid, typically defined by squared Euclidean distance, and 2) move the location of the centroid to the mean of all data points assigned to that centroid’s cluster (Figure 5). There are a variety of methods for selecting k. Often investigators prespecify k based on background knowledge or visual examination of the data; however, likelihood and error-based approaches to selecting k have been developed (48).

Figure 5.

A hypothetical _k_-means algorithm for dichotomizing (k = 2) patients on the basis of their age and body mass index (BMI; weight (kg)/height (m)2). Each unclassified observation (hollow dots) is assigned to a diabetes classification (solid dots), with black and gray representing the predicted diabetes classifications (black, diabetic; gray, nondiabetic) at each step. Squares are centroids, with a single centroid per cluster. The steps of a _k_-means algorithm classifying hypothetical data are: A) obtain unclassified data; B) randomly select k = 2 centroids; C) assign each observation to its nearest centroid and predict its diabetes status (black dots are closer to the black square and gray dots are closer to the gray square); C) move the black centroid to the mean of all black dots, and similarly for the gray dots, as represented by centroid arrows; D) reclassify observations to the nearest, updated centroid; E) repeat step C; F) repeat step D; and G) perform final classifications, assuming that clusters have stabilized.

Strengths and limitations

_K_-means clustering is simple, easy to interpret, and computationally efficient. However, one important limitation is that the number of clusters needs to be prespecified. A slight difference in k can produce very different results, and methods for estimating k (49) do not necessarily agree with each other (50). In addition, when the distance between observations and cluster centroids is calculated with Euclidean metrics, the algorithm assumes that clusters have the same within-cluster variance. If some clusters are much larger than others, _k_-means can produce nonintuitive results (50) (Figure 6).

One limitation of the k-means algorithm, as illustrated with simulated data on age and body mass index (BMI; weight (kg)/height (m)2). When one cluster (upper right) is much larger than the other (lower left), k-means can produce counterintuitive classifications (A). The more intuitive classification is shown in the right-hand panel (B).

Figure 6.

One limitation of the _k_-means algorithm, as illustrated with simulated data on age and body mass index (BMI; weight (kg)/height (m)2). When one cluster (upper right) is much larger than the other (lower left), _k_-means can produce counterintuitive classifications (A). The more intuitive classification is shown in the right-hand panel (B).

Sample statistical packages and modules

Available software includes ClusterR, fpc, akmeans, and kmeans in base R; cluster kmeans in Stata; FASTCLUS and HPCLUS in SAS; and sklearn in Python.

ENSEMBLE METHODS

Ensemble methods utilize information from multiple models to improve predictive performance in comparison with a single model. The idea is that even though any individual model within an ensemble is not adequate to capture the characteristics of the entire phenomenon, so long as the models perform better than they would with random class assignment, once combined they can borrow strength from each other and achieve high predictive accuracy. Broadly, ensemble methods improve performance by creating a population of models through either 1) training the same underlying algorithm to different versions of a data set (e.g., bagging and boosting) or 2) training qualitatively different models on the same data set (e.g., Bayesian model averaging, Super Learner) (see Web Figure 1, available at https://academic.oup.com/aje) and then combining results across these models on the basis of a defined algorithm. While the primary objective of bagging and boosting is to minimize overfitting, multiple algorithm ensembles capitalize on different models’ strengths and avoid the need for model preselection. These alternative ensemble approaches are often used in combination, either as part of the same algorithm or through nested approaches.

Bagging

Bagging (or bootstrap aggregating) fits the same underlying algorithm to each bootstrapped copy of the original training data and then creates a final prediction based on outputs from the resulting, parameterized, models (51). The final prediction for a quantitative outcome is obtained by averaging the predictions. For a qualitative outcome, the final prediction either takes the majority vote among the classifiers or averages probabilities across the number of bootstrap fits. Bagging reduces model variance significantly without affecting bias (52, 53).

Feature bagging attempts to further reduce overfitting. It trains models on random subsets of variables/features instead of all variables in an attempt to reduce correlation between models in an ensemble. When applied to tree-based methods, the resulting models are called random forests, which force each split to consider a random subset of predictors (54), giving other weak predictors a greater chance to be selected as split candidates. Otherwise, when there is a strong predictor for the outcome, many trees would choose to first split on that predictor, creating highly correlated predictions regardless of the variables chosen at the subsequent splits.

Aside from _k_-fold cross-validation, one way to estimate prediction errors specifically for random forests is to compute out-of-bag error (55). Out-of-bag error is the mean prediction error for each observation, using only the models that did not include the observation in their bootstrapped samples. Variable importance rankings summarize the relative importance of each predictor across all fitted trees. These rankings reflect the importance of a variable for predicting outcomes by averaging the impurity decrease for all nodes where the variable is used across all trees in the forest (51). Impurity decrease measures changes in the accuracy of a tree and can be described by, for example, Gini impurity (a measure of the probability of mistaken categorization within a node) for bagging classification trees or the residual sum of squares for bagging regression trees. “Important” variables change the accuracy of the trees the most. Importance rankings can be used to assess the relative impact of individual predictors, as well as the interaction between predictors, in predicting the outcome (56, 57).

Available software includes caret, randomForest, and adabag in R; CHAIDFOREST for random forests in Stata; SAS (58); and sklearn.ensemble in Python.

Boosting

Like bagging, boosting also trains models on subsets of data, but it does it in a sequential fashion and improves the classifiers by analyzing prediction errors (59, 60). AdaBoost is a well-known boosting method that sets weights to both observations and classifiers (61, 62). Observations are given weights, initially equal, that increase if incorrectly classified by the last iteration of the classifier; hence, subsequent iterations will prioritize correctly classifying these observations. The final output classifier is a weighted average from the classifier built in each iteration, with higher weight given to classifiers with higher predictive accuracy (i.e., lower error rates on training data). Gradient boosting is a generalization of AdaBoost that uses gradient descent to optimize any differentiable loss function (i.e., a measure of classifier performance other than simple classification error) (63, 64).

Available software includes gbm, adabag, fastAdaboost, xgboost, ada, and caret in R; Stata (65); SAS (58); and sklearn.ensemble in Python.

Bayesian model averaging

Bayesian model averaging (BMA) estimates the posterior distribution of a predicted value (or the parameters defining a parametric relationship) by calculating the weighted average of model-specific estimates, where the weights are driven by how much the data support each competing model (66). BMA has been applied to many statistical models, including linear regression, generalized linear models, and Cox proportional hazards models, and it provides better predictive ability than using any single model (66). Its variants, such as Bayesian model combination, have emerged to further tackle the issue of overfitting, as BMA has a tendency to place too much weight on the most probable model (67). Bayesian model combination creates a set of ensembles, each representing a combination of individual models, and weights the ensemble-specific estimate of the effect size (as opposed to estimates based on the most probable model in BMA) by the probability that the ensemble is correct given the data (67, 68).

Available software includes BMS/BAS/BMA in R; SAS (69); and pyBMA in Python.

Super Learner

Super Learner is a prediction algorithm that uses cross-validation to determine the optimal weighted combination of predictions from a group of candidate learners (70–72). Building on the “stacked generalization” approach proposed by Wolpert (72), this approach allows the use of machine learning algorithms (e.g., random forests) in addition to standard parametric algorithms (e.g., logistic regression). _K_-fold cross-validation (Web Figure 1) is used to assign weights to each of a user-defined pool of component algorithms based on out-of-training set performance, and then the component models are fit to the entire data set. Model outputs are based on the predictions of these candidate models weighted by the cross-validation–derived weights. It has been applied to predict the drug susceptibility of human immunodeficiency virus as a function of its mutations (71), and it has been used as part of procedures to estimate causal effects (see “Epidemiologic Applications” section).

Available software includes SuperLearner in R and scikit-learn in Python.

EPIDEMIOLOGIC APPLICATIONS

In this section, we give an overview of the way in which machine learning algorithms have been used in various applications related to epidemiologic practice. While this is not a comprehensive review and we do not intend to discuss every limitation and nuance of these approaches, we hope to direct readers to areas of active research in the literature.

Causal inference

Relative to classical statistical or epidemiologic approaches, machine learning algorithms have historically placed less emphasis on causal inference. Indeed, machine learning has been described as a “black box” method because it is difficult to draw etiological inferences from the output of some algorithms (e.g., ANNs). However, machine learning techniques can still be an important component of approaches to estimating causal effects in observational studies, with sometimes superior performance for reducing bias and controlling for confounding (73).

Propensity score weighting is a common approach for estimating causal effects in observational studies (74). Propensity scores have traditionally been estimated with logistic regression, but this approach requires assumptions that, if unmet, may render biased effect estimates despite propensity score conditioning. Machine learning algorithms often deal implicitly with interactions and nonlinearities, whereas such high-order terms must be explicitly specified (and are commonly ignored) in logistic regression. Machine learning algorithms also perform well in estimating propensity scores in the presence of high-dimensional data and can reduce underlying model misspecification (75). Although these machine learning benefits may exist at the expense of easy interpretability, these concerns are not pertinent to propensity score estimation, as the interpretability of propensity scores is not relevant to their performance. Multiple studies have empirically demonstrated bias reductions where propensity scores are generated with machine learning methods, particularly ensemble-based approaches (75–78). Under certain conditions, however, bias may persist or be exacerbated by machine learning methods (79–81). Studies in which researchers calculated propensity scores with machine learning approaches have included those assessing the effects of early sexual initiation on young adult health (82), vaccination on birth outcomes (83), and combination antibiotic treatment on Gram-negative bacteremia (84).

Likewise, machine learning algorithms can be used as a component of any causal inference framework where an estimate of the likelihood (or distribution) of an outcome is an important component of the inferential process but need not be directly interpretable. For example, the Super Learner algorithm has been used as a component of targeted maximum likelihood estimation and marginal structural model approaches (85). Examples include investigations of the relationship between alcohol outlet density and alcohol consumption patterns (86) and the relationship between childhood adversity and mental disorders by race/ethnicity (87).

Machine learning methods have been used more directly to attempt to understand heterogeneity in treatment effects across subpopulations. For example, Athey et al. (88) have developed an approach to building “casual trees” that create decision trees where groupings are based on treatment effect and provide principled estimates of treatment effects within these strata using an approach they call “honest estimation.” This approach has been extended to apply the random forest algorithm to these trees, creating so-called “casual forests” that can be used to estimate treatment effects in persons with particular covariate profiles (89).

Another application of machine learning more directly related to problems of causal inference is causal structure learning, which has grown as a distinct branch of machine learning. Causal structure learning encompasses a group of exploratory techniques that identify an optimal directed acyclic graph consistent with conditional independence relationships in the data and provided background knowledge. Approaches to causal structure learning include Bayesian network approaches (see Scutari and Denis (90) for an overview) and linear, nongaussian, acyclic models (LiNGAMs) (91–93). The former have been applied to derive causal influences in cellular signaling pathways (94) and to infer causal associations between gene expression and disease (95), while the latter have been used to estimate causal directionality between sleep disorders and depression (96) and to explore causality between television viewing habits and weight change (97).

Diagnostics, prognostic predictive models, and other clinical decision support tools

Disease diagnosis and prognosis are perhaps the oldest clinical utilizations of machine learning techniques (98) and remain common applications in the epidemiologic literature. Machine learning is particularly well-suited to certain diagnostic questions (e.g., those that involve imaging and/or high-dimensional data), and it can enhance prognostic models and clinical decision support tools through, for example, automation and ease of use.

Diagnostics that involve imaging, where each pixel can be conceptualized as a feature, and other high-dimensional data are problems well suited for machine learning approaches. SVMs have been utilized extensively in oncology for diagnosis and disease staging from radiological and tissue data (99–107). They have also been utilized for tumor typing from tissue microarray gene expression data, which, because of their high dimensionality, can be problematic for traditional statistical models (108–111). Outside of oncology, SVMs have shown promise for neuroimaging diagnostics, including for dementia (112) and autism spectrum disorder (113–115).

Machine learning techniques are also well-suited to prognostic models and other clinical decision support tools where accurate diagnosis (i.e., low classification error) is the primary objective, or where automation is desired. For example, Palaniappan and Awang (116) employed a combination of methods (ANNs, naive Bayes, and decision trees) in order to develop an automated, Web-based prediction tool, the Intelligent Heart Disease Prediction System. Incorporation into hospital and emergency room operations research is also common. ANNs have been used in emergency room populations, for example, to predict death among sepsis patients (117) and prolonged hospital stays among the elderly (118), and random forests have been used to build electronic triage models for risk-stratifying patients (119, 120). Because many machine learning methods can accommodate complex variable interactions without a priori specification, they may also uncover previously unknown prognostic subgroups (121). For example, Brims et al.’s (121) application of CART algorithms to a data set of malignant pleural mesothelioma cases revealed 4 distinct prognostic groups based upon clinical characteristics.

Conversely, machine learning methodologies can also be helpful where manual use, rather than automation, is contemplated. In particular, decision trees are popular clinical prediction tools for both diagnosis and prognosis due to their simplicity and interpretability. Because their output uses branching logic rather than calculations, decision trees are generally user-friendly for clinicians to apply at the bedside (e.g., to predict the likelihood that an infection is drug-resistant while awaiting microbiological confirmation (122, 123)).

Genome-wide association studies

Genome-wide association studies seek to identify genetic variants that influence disease risk. Genome data sets generally contain large numbers of genes and single nucleotide polymorphisms of interest but, because of sequencing costs and other practical constraints, are of limited sample size. These high-dimensional data are the types of data on which machine learning algorithms perform well. Hence, ensemble machine learning approaches such as random forests are commonly used. Random forests can rank the most important single nucleotide polymorphisms for a disease outcome. For example, they have been used to predict drug response in epilepsy patients based on clinical and genetic information (124); to identify genetic variants associated with Parkinson disease and other neurological disorders (125); and to “data-mine” high-density genetic data to predict Alzheimer disease risk (126).

Other, nonensemble algorithms are also popular in genome-wide association studies. Researchers have applied SVMs with Bayesian model averaging to genome-wide data to predict late-onset Alzheimer disease (127) and _k_-nearest neighbors (a relatively simple unsupervised classification algorithm (128)) to predict the heritable genetic susceptibility of common cancers (129). Microbiome studies, which also involve high-dimensional (albeit bacterial) genetic data, have likewise utilized machine learning to identify disease risk factors among microbiota/microbiome signatures (130). Moreover, because interactions do not require a priori specification under many machine learning algorithms, machine learning approaches are well-suited to identification of complex gene-gene (131) and gene-environment interactions that may modulate disease risk (e.g., use of ANNs to explore interactions between nutrient intake and metabolic pathway polymorphisms in breast cancer susceptibility (132)).

Geospatial applications

Machine learning can help to predict and map disease occurrence and health indicators in areas where data are limited. Its ability to efficiently process high-dimensional data sets from heterogeneous contexts and multiple geographic scales makes it particularly suitable for this task. A major focus is the development of the WorldPop Project (www.worldpop.org), which is an open-source archive of demographic parameters on fine spatial scales (133). It uses random forests to map the global population distribution on a per-pixel scale by combining remote sensing data (e.g., satellite) across multiple geospatial scales (133). Beyond WorldPop’s use of random forests, another type of ensemble machine learning algorithm, boosted regression trees, has also been widely used to map environmental suitability for disease transmission, including dengue (134), leishmaniasis (135), Ebola (136), Crimean-Congo hemorrhagic fever (137), and Zika virus (138, 139). In general, investigators in these studies 1) chose a set of known or proposed environmental and socioeconomic covariates, 2) incorporated global assessments regarding whether the disease(s) of interest was circulating in the country or region, and 3) with these data, built boosted regression tree models. The resulting models were used to predict infection probabilities on a pixel-by-pixel scale.

Text mining

Electronic health records provide an unprecedented amount of clinical information for research, but utilization of these data sources effectively in studies or for surveillance is generally cost-prohibitive without some form of automated data extraction. Machine learning offers automated tools for extracting unstructured information from textual clinical documents. For example, i2b2 (Informatics for Integrating Biology and the Bedside) Challenges address a range of projects aiming to develop and evaluate information extraction methods for clinical text (140). The 2009 Medication Challenge focused on providing a schema with which to extract information including medications, dosages, modes (routes) of administration, frequencies, durations, and reasons for administration from discharge summaries (141).

Other applications include deidentifying personal health information, research subject recruitment, coding, and surveillance. Machine learning has been used to remove personal health information from clinical records, such that deidentified records may be made public for research purposes without obtaining individual informed consent (142). Studies have also used textual data and machine learning algorithms to identify patients who may qualify for and benefit from participation in clinical studies (143). Furthermore, text mining can improve the efficiency of systematic reviews by facilitating the identification, rapid categorization, and summarization of relevant literature (144). Finally, natural language processing of clinical documents can supplement manual surveillance and has been used to identify a range of reportable postoperative complications (145).

In addition to clinical settings, text mining algorithms have been incorporated into automated infectious disease surveillance systems that acquire, classify, and process Web-accessible data. These algorithms can improve detection of early outbreaks and complement traditional surveillance efforts performed by government and international organizations. For example, HealthMap graphically displays areas where diseases are circulating by combining search query data, social media data, validated official reports, and expert-curated accounts (e.g., ProMED-mail) (146, 147). Similarly, the BioCaster system tracks infectious disease outbreaks on Google maps (Google, Inc., Mountain View, California) on the basis of residual sum-of-squares feeds (148).

Prediction and forecasting of infectious disease

Machine learning methods have been incorporated into prediction and forecasting models for infectious disease. For example, SVMs have been used to predict whether dengue incidence exceeded a chosen threshold using Google search terms (149). Researchers have also used SVMs to predict levels of influenza-like illness from Twitter data (Twitter, Inc., San Francisco, California) 1–2 weeks before official reports (150). In addition, infectious disease forecasters have adopted ensemble-based methods traditionally used for meteorological and oceanographic predictions. For example, climate forecasting from multimodel ensembles has been adapted to produce early malaria warning systems (151). Moreover, ensemble-based forecasting methods based on sequential data assimilation approaches are increasingly common infectious disease forecasting tools, because of their ability to correct for various sources of uncertainty in mathematical simulations as compared with traditional linear time-series models such as negative binomial models and autoregressive integrated moving average (ARIMA) models. One type of sequential ensemble filtering, the ensemble adjustment Kalman filter, has been used to forecast seasonal outbreaks of influenza (152), to reconstruct the transmission network of the 2014–2015 Ebola epidemic in Sierra Leone (153), and to retrospectively “forecast” cases of West Nile virus (154) and respiratory syncytial virus (155).

BRIEF RECOMMENDATIONS

In this primer, we have discussed several important algorithms, but this is only the tip of the iceberg. We refer readers to the machine learning textbooks referenced herein for a more comprehensive review (2, 5, 31). The choice of an algorithm is highly tied to the research goals associated with its use, and there is no single recommendation for all projects. However, epidemiologists interested in adopting machine learning methodologies will often be most interested in accurate prediction in the context of a large number of covariates. In these cases, we encourage them to start with ensemble-based boosting or bagging approaches. Through refitting the same underlying model to different versions of a data set, these ensembles are less susceptible to overfitting and less sensitive to tuning parameters. They are also easy to implement with many commonly available tools and packages, with random forests analysis being a popular choice. The Super Learner approach, which fits many different models to a data set, is also attractive since it allows simultaneous consideration of multiple algorithms and automates many of the best practices for fitting and validating machine learning models. However, as with traditional epidemiologic or statistical approaches, a rigorous approach to assessing performance and appropriate matching of a model to its use are more important than the specific algorithm used.

Despite the benefits of boosting and bagging, as a general rule, these ensemble approaches add another stage to modeling, making their results harder to interpret. Investigators should carefully consider their primary objective: Is it predictive accuracy or interpretability? Where interpretability is important, as in many clinical applications, researchers might consider single, more easily understandable algorithms such as decision trees. However, many machine learning algorithms, particularly nonensemble approaches, are prone to overfitting.

Measures of fit alone (e.g., _R_2) should be interpreted with caution, as they can be effectively meaningless for some machine learning applications (156). Without a likelihood function, techniques such as Akaike Information Criterion evaluation are not available metrics for assessing the generalizability of machine learning models; hence, cross-validation (whether _k_-fold, leave-one-out, or another approach) is a critical tool for evaluating model performance. These methods must be used appropriately, however, or they can fool the researcher. The testing and validation plan should be specified a priori and must be applied to the full algorithm: For example, if there is a data-based variable selection step, it should be executed in each data partition used in cross-validation, not in the full data set prior to the cross-validation. It is important that researchers understand clearly that these cross-validation approaches give expected out-of-data-set performance given the algorithm used, not an assessment of the particular fitted model, and that they recognize that the quality of these measures depends on the representativeness of the population and the correlation between observations in the training and testing sets (i.e., if there is high correlation, cross-validated performance will be deceptively high).

OPPORTUNITIES AND CHALLENGES

The field of machine learning is rapidly developing and can make any technical review seem obsolete within months. Growing interest in the field from the general public, as reflected in extensive coverage of self-driving cars and AlphaGo (Alphabet, Inc., Mountain View, California) in the mainstream media, is accompanied by efforts from the machine learning community to make advanced machine learning technologies more accessible. Educational companies such as Udacity (Udacity, Inc., Mountain View, California) and Coursera (Coursera, Inc., Mountain View, California) have partnered with companies like Google and academic institutions to create online and freely available courses on machine learning and deep learning.

In addition to the growing educational resources, large technological companies, including Google, IBM (International Business Machines Corporation, Armonk, New York), and Amazon Web Services (Amazon Web Services, Inc., Seattle, Washington), are heavily investing in open-source machine learning that uses data-flow graphs to build models (e.g., TensorFlow (Google, Inc.) (157)). The use of data-flow graphs in TensorFlow enables developers and data scientists to focus on the high-level overall logic of the algorithms rather than the technical coding details, which greatly increases the reproducibility and optimizability of the models. Models built with TensorFlow can be integrated into mobile devices, making on-device/bedside diagnosis practical when combined with mobile sensors. The ability of TensorFlow to build and run models on the cloud also dramatically increases processing power and storage ability, which is particularly helpful for analyzing large data sets with complex algorithms. These machine learning developments continue to ease the entry barriers for epidemiologists interested in using advanced machine learning technologies, and they have the potential to transform epidemiologic research.

Yet, there continue to be challenges that impede greater integration of machine learning into epidemiologic research. Classically trained epidemiologists often lack the skills to take full advantage of machine learning technologies, partly because of the continued popularity of closed-source programming languages (e.g., SAS, Stata) in epidemiology. In addition, despite the promise of “Big Data,” logistical roadblocks to sharing deidentified patient data and amassing large health-care data sets can make it challenging for epidemiologists to leverage these opportunities, particularly compared with the private sector. Even when data are available, epidemiologists should be mindful of the class-imbalance issue (see Table 1) often inherent in health-care and surveillance data, which can pose challenges for many standard algorithms (158). Most importantly, a general lack of working knowledge on machine learning algorithms, despite their substantial methodological overlap with statistical methods, reduces the practical uptake of these techniques in the epidemiologic literature.

Ultimately, advanced machine learning algorithms offer epidemiologists new tools for tackling problems that classical methods are not well-suited for, but they by no means serve as a cure-all for poor study design or poor data quality. Further eroding the cultural and language barriers between machine learning and epidemiology serves as an essential first step toward understanding the value of, and achieving greater integration with, machine learning and existing epidemiologic research methods.

ACKNOWLEDGMENTS

Author affiliations: Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland (Qifang Bi, Katherine E. Goodman, Joshua Kaminsky, Justin Lessler).

Q.B. and K.E.G. contributed equally to this paper.

We thank Dr. Maria Glymour for her very helpful suggestions.

Conflict of interest: none declared.

Abbreviations

ANN
artificial neural networks
BMA
BMI
CART
classification and regression trees
SVM

REFERENCES

Samuel

Some studies in machine learning using the game of checkers

IBM J Res Dev

1959

;

(

210

–

229

Mitchell

Machine Learning

. 1st ed.

New York, NY

McGraw-Hill Education

;

1997

Rasmussen

Williams

CKI

Gaussian Processes for Machine Learning

. 1st ed.

Cambridge, MA

MIT Press

;

2006

Breiman

Statistical modeling: the two cultures (with comments and a rejoinder by the author)

Stat Sci

2001

;

(

199

–

231

Duda

Hart

Stork

Pattern Classification

. 2nd ed.

Hoboken, NJ

John Wiley & Sons, Inc.

;

2012

517

Bartholomew

Knott

Moustaki

Latent Variable Models and Factor Analysis: A Unified Approach

. 3rd ed.

Hoboken, NJ

John Wiley & Sons, Inc.

;

2011

Hennig

Meila

Murtagh

, et al.

Handbook of Cluster Analysis

. 1st ed.

Boca Raton, FL

CRC Press

;

2015

Bishop

Pattern Recognition and Machine Learning

. 1st ed.

New York, NY

Springer Publishing Compnay

;

2006

424

Zhu

Goldberg

Introduction to Semi-Supervised Learning

. 1st ed.

San Rafael, CA

Morgan & Claypool Publishers

;

2009

Nigam

McCallum

Thrun

, et al.

Text classification from labeled and unlabeled documents using EM

Mach Learn

2000

;

(

103

–

134

Jordan

. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In:

Dietterich

Becker

Ghahramani

, eds.

Advances in Neural Information Processing Systems 14

Cambridge, MA

MIT Press

;

2002

841

–

848

Vapnik

Statistical Learning Theory

. 1st ed.

Hoboken, NJ

Wiley-Interscience

;

1998

–

Pernkopf

Bilmes

. Discriminative versus generative parameter and structure learning of Bayesian network classifiers. In: Proceedings of the 22nd International Conference on Machine Learning—ICML ’05. New York, NY: Association for Computing Machinery; 2005:657–664. https://dl.acm.org/citation.cfm?id=1102434. Accessed July 4, 2019.

_Reinforcement Learning: An Introduction_—Richard S. Sutton and Andrew G. Bartow

[book review].

IEEE Trans Neural Netw

1998

;

(

1054

Sutton

Barto

Reinforcement Learning: An Introduction

. 2nd ed.

Cambridge, MA

MIT Press

;

2018

Ganguly

Learning Generative Adversarial Networks

. 1st ed.

Birmingham, United Kingdom

Packt Publishing

;

2017

Asoh

Shiro

Akaho

, et al. An application of inverse reinforcement learning to medical records of diabetes treatment. Presented at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Prague, Czech Republic, September 23–27,

2013

. https://pdfs.semanticscholar.org/5f44/548a3cd1932fc5035236b9be9018df5103c5.pdf. Accessed July 3, 2019.

Shortreed

Laber

Lizotte

, et al.

Informing sequential clinical decision-making through reinforcement learning: an empirical study

Mach Learn

2011

;

(

1-2

109

–

136

Nemati

Ghassemi

Clifford

Optimal medication dosing from suboptimal clinical examples: a deep reinforcement learning approach

Conf Proc IEEE Eng Med Biol Soc

2016

;

2016

2978

–

2981

Olden

Jackson

Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks

Ecol Modell

2002

;

154

(

1-2

135

–

150

Rumelhart

Hinton

Williams

Learning representations by back-propagating errors

Nature

1986

;

323

(

6088

533

–

536

McCulloch

Pitts

A logical calculus of the ideas immanent in nervous activity

Bull Math Biol

1990

;

(

1-2

–

115

. [Reprinted from Bull Math Biophys. 1943;5:115–133].

Duh

Walker

Ayanian

Epidemiologic interpretation of artificial neural networks

Am J Epidemiol

1998

;

147

(

1112

–

1122

Papadokonstantakis

Lygeros

Jacobsson

Comparison of recent methods for inference of variable influence in neural networks

Neural Netw

2006

;

(

500

–

513

Hershey

Chaudhuri

Ellis

DPW

, et al. CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017). Piscataway, NJ: Institute of Electrical and Electronics Engineers; 2017:131–135. https://ai.google/research/pubs/pub45611. Accessed July 15, 2019.

Breiman

Friedman

Stone

, et al.

Classification and Regression Trees

. 1st ed.

London, United Kingdom

Chapman & Hall Ltd.

;

1984

Kass

An exploratory technique for investigating large quantities of categorical data

Appl Stat

1980

;

(

119

–

127

Biggs

De Ville

Suen

A method of choosing multiway partitions for classification and decision trees

J Appl Stat

1991

;

(

–

Quinlan

Induction of decision trees

Mach Learn

1986

;

(

–

106

James

Witten

Hastie

, et al.

An Introduction to Statistical Learning: With Applications in R.

New York, NY

Springer Publishing Company

;

2013

Almuallim

An efficient algorithm for optimal pruning of decision trees

Artif Intell

1996

;

(

347

–

362

Boulesteix

Janitza

Hapfelmeier

, et al.

Letter to the editor: on the term “interaction” and related phrases in the literature on random forests

[letter].

Brief Bioinform

2015

;

(

338

–

345

Aluja-Banet

Nafria

Stability and scalability in decision trees

Comput Stat

2003

;

(

505

–

520

Boser

Guyon

Vapnik

. A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory. New York, NY: Association for Computing Machinery; 1992:144–152. http://www.svms.org/training/BOGV92.pdf. Accessed July 3, 2019.

Noble

What is a support vector machine?

Nat Biotechnol

2006

;

(

1565

–

1567

Smola

Schölkopf

A tutorial on support vector regression

Stat Comput

2004

;

(

199

–

222

Belousov

Verzakov

von Frese

A flexible classification approach with optimal generalisation performance: support vector machines

Chemometr Intell Lab Syst

2002

;

(

–

Guenther

Schonlau

Support vector machines

Stata J

2016

;

(

917

–

937

Lewis

. Naive (Bayes) at forty: the independence assumption in information retrieval. In:

Nédellec

Rouveirol

, eds.

Machine Learning: ECML-98

Berlin, Germany

Springer Berlin-Heidelberg

;

1998

–

Frank

Trigg

Holmes

, et al.

Technical note: naive Bayes for regression

Mach Learn

2000

;

(

–

Russek

Kronmal

Fisher

The effect of assuming independence in applying Bayes’ theorem to risk estimation and classification in diagnosis

Comput Biomed Res

1983

;

(

537

–

552

Hand

Statistical methods in diagnosis

Stat Methods Med Res

1992

;

(

–

Jain

Data clustering: 50 years beyond K-means

Pattern Recognit Lett

2010

;

(

651

–

666

Lloyd

Least squares quantization in PCM

IEEE Trans Inf Theory

1982

;

(

129

–

137

Pham

Dimov

Nguyen

Selection of K in K-means clustering

Proc Inst Mech Eng Pt C J Mechan Eng Sci

2005

;

219

(

103

–

119

Tibshirani

Walther

Hastie

Estimating the number of clusters in a data set via the gap statistic

J R Stat Soc Series B Stat Methodol

2001

;

(

411

–

423

Raykov

Boukouvalas

Baig

, et al.

What to do when K-means clustering fails: a simple yet principled alternative algorithm

PLoS One

2016

;

(

e0162259

Breiman

Bagging predictors

Mach Learn

1996

;

(

123

–

140

Schapire

Freund

Bartlett

, et al.

Boosting the margin: a new explanation for the effectiveness of voting methods

Ann Stat

1998

;

(

1651

–

1686

Breiman

Random forests

Mach Learn

2001

;

(

–

Friedman

Hastie

Tibshirani

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

. 2nd ed.

New York, NY

Springer Publishing Company

;

2009

van der Laan

Statistical inference for variable importance

Int J Biostat

2006

;

(

1557

–

4679

Schapire

The strength of weak learnability

Mach Learn

1990

;

(

197

–

227

Freund

Boosting a weak learning algorithm by majority

Inf Comput

1995

;

121

(

256

–

285

Schapire

. A brief introduction to boosting. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence. Stanford, CA: International Joint Conferences on Artificial Intelligence Organization;

1999

:1401–1406. http://rob.schapire.net/papers/Schapire99c.pdf. Accessed July 3, 2019.

Freund

Schapire

A desicion-theoretic generalization of on-line learning and an application to boosting

J Comput Syst Sci

1997

;

(

119

–

139

Friedman

Greedy function approximation: a gradient boosting machine

Ann Stat

2001

;

(

1189

–

1232

Friedman

Stochastic gradient boosting

Comput Stat Data Anal

2002

;

(

367

–

378

Schonlau

Boosted regression (boosting): an introductory tutorial and a Stata plugin

Stata J

2005

;

(

330

–

354

Hoeting

Madigan

Raftery

, et al.

Bayesian model averaging: a tutorial

Stat Sci

1999

;

(

382

–

417

Monteith

Carroll

Seppi

, et al. Turning Bayesian model averaging into Bayesian model combination. In: The 2011 International Joint Conference on Neural Networks (IJCNN 2011—San Jose). Piscataway, NJ: Institute of Electrical and Electronics Engineers;

2011

:2657–2663. http://axon.cs.byu.edu/papers/Kristine.ijcnn2011.pdf. Accessed July 3, 2019.

Whitney

Ngo

. Bayesian model averaging using SAS® software. (Paper 203-29). In: _Proceedings of the Twenty-Ninth Annual SAS_® Users Group International Conference. Cary, NC: SAS Institute, Inc.;

2004

:203–229. http://www2.sas.com/proceedings/sugi29/203-29.pdf. Accessed July 3, 2019.

van der Laan

Polley

Hubbard

Super Learner

Stat Appl Genet Mol Biol

2007

;

Article 25

Sinisi

Polley

Petersen

, et al.

Super learning: an application to the prediction of HIV-1 drug resistance

Stat Appl Genet Mol Biol

2007

;

Article 7

Wolpert

Stacked generalization

Neural Netw

1992

;

(

241

–

259

Hernán

Robins

Using big data to emulate a target trial when a randomized trial is not available

Am J Epidemiol

2016

;

183

(

758

–

764

Rosenbaum

Rubin

Reducing bias in observational studies using subclassification on the propensity score

J Am Stat Assoc

1984

;

(

387

516

–

524

Westreich

Lessler

Funk

Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression

J Clin Epidemiol

2010

;

(

826

–

833

Lee

Lessler

Stuart

Improving propensity score weighting using machine learning

Stat Med

2010

;

(

337

–

346

Pirracchio

Petersen

van der Laan

Improving propensity score estimators’ robustness to model misspecification using Super Learner

Am J Epidemiol

2015

;

181

(

108

–

119

Watkins

Jonsson-Funk

Brookhart

, et al.

An empirical comparison of tree-based methods for propensity score estimation

Health Serv Res

2013

;

(

1798

–

1817

Schnitzer

Lok

Gruber

Variable selection for confounder control, flexible modeling and collaborative targeted minimum loss-based estimation in causal inference

Int J Biostat

2016

;

(

–

115

Moodie

EEM

Stephens

Treatment prediction, balance, and propensity score adjustment

Epidemiology

2017

;

(

e51

–

e53

Bahamyirou

Blais

Forget

, et al.

Understanding and diagnosing the potential for bias when using machine learning methods with doubly robust causal estimators

Stat Methods Med Res

2019

;

(

1637

–

1650

Kugler

Vasilenko

Butera

, et al.

Long-term consequences of early sexual initiation on young adult health: a causal inference approach

J Early Adolesc

2017

;

(

662

–

676

Oppermann

Fritzsche

Weber-Schoendorfer

, et al.

A(H1N1)v2009: a controlled observational prospective cohort study on vaccine safety in pregnancy

Vaccine

2012

;

(

4445

–

4452

Tamma

Turnbull

Harris

, et al.

Less is more: combination antibiotic therapy for the treatment of gram-negative bacteremia in pediatric patients

JAMA Pediatr

2013

;

167

(

903

–

910

Schuler

Rose

Targeted maximum likelihood estimation for causal inference in observational studies

Am J Epidemiol

2017

;

185

(

–

Ahern

Balzer

Galea

The roles of outlet density and norms in alcohol use disorder

Drug Alcohol Depend

2015

;

151

144

–

150

Ahern

Karasek

Luedtke

, et al.

Racial/ethnic differences in the role of childhood adversities for mental disorders among a nationally representative sample of adolescents

Epidemiology

2016

;

(

697

–

704

Athey

Imbens

Recursive partitioning for heterogeneous causal effects

Proc Natl Acad Sci U S A

2016

;

113

(

7353

–

7360

Wager

Athey

Estimation and inference of heterogeneous treatment effects using random forests

J Am Stat Assoc

2018

;

113

(

523

1228

–

1242

Scutari

Denis

Bayesian Networks: With Examples in R

. 1st ed.

London, United Kingdom

Chapman & Hall Ltd.

;

2014

Shimizu

Hoyer

Hyvärinen

, et al.

A linear non-gaussian acyclic model for causal discovery

J Mach Learn Res

2006

;

2003

–

2030

Hoyer

Shimizu

Kerminen

, et al.

Estimation of causal effects using linear non-Gaussian causal models with hidden variables

Int J Approx Reason

2008

;

(

362

–

378

Shimizu

LiNGAM: non-Gaussian methods for estimating causal structures

Behaviormetrika

2014

;

(

–

Sachs

Perez

Pe’er

, et al.

Causal protein-signaling networks derived from multiparameter single-cell data

Science

2005

;

308

(

5721

523

–

529

Schadt

Lamb

Yang

, et al.

An integrative genomics approach to infer causal associations between gene expression and disease

Nat Genet

2005

;

(

710

–

717

Rosenström

Jokela

Puttonen

, et al.

Pairwise measures of causal direction in the epidemiology of sleep problems and depression

PLoS One

2012

;

(

e50841

Helajärvi

Rosenström

Pahkala

, et al.

Exploring causality between TV viewing and weight change in young and middle-aged adults. The Cardiovascular Risk in Young Finns Study

PLoS One

2014

;

(

e101860

Warner

Toronto

Veasey

, et al.

A mathematical approach to medical diagnosis. Application to congenital heart disease

JAMA

1961

;

177

–

183

Blumenthal

Artzi

Liberman

, et al.

Classification of high-grade glioma into tumor and nontumor components using support vector machine

AJNR Am J Neuroradiol

2017

;

(

908

–

914

100

Artzi

Liberman

Nadav

, et al.

Differentiation between treatment-related changes and progressive disease in patients with high grade brain tumors using support vector machine classification based on DCE MRI

J Neurooncol

2016

;

127

(

515

–

524

101

Zarinabad

Abernethy

Avula

, et al.

Application of pattern recognition techniques for classification of pediatric brain tumors by in vivo 3T 1H-MR spectroscopy—a multi-center study

Magn Reson Med

2018

;

(

2359

–

2366

102

Chang

Paul

Kim

, et al.

Computer-aided diagnosis for classifying benign versus malignant thyroid nodules based on ultrasound images: a comparison with radiologist-based assessments

Med Phys

2016

;

(

554

–

567

103

El-Naqa

Yang

Wernick

, et al.

A support vector machine approach for detection of microcalcifications

IEEE Trans Med Imaging

2002

;

(

1552

–

1563

104

Polat

Güneş

Breast cancer diagnosis using least square support vector machine

Digit Signal Process

2007

;

(

694

–

701

105

Akay

Support vector machines combined with feature selection for breast cancer diagnosis

Expert Syst Appl

2009

;

(

3240

–

3247

106

Wang

Zhou

Chen

, et al.

Support vector machines model of computed tomography for assessing lymph node metastasis in esophageal cancer with neoadjuvant chemotherapy

J Comput Assist Tomogr

2017

;

(

455

–

460

107

Zhang

Wang

Tang

, et al.

Support vector machine model for diagnosis of lymph node metastasis in gastric cancer with multidetector computed tomography: a preliminary study

BMC Cancer

2011

;

Article 10

108

Brown

Grundy

Lin

, et al.

Knowledge-based analysis of microarray gene expression data by using support vector machines

Proc Natl Acad Sci U S A

2000

;

(

262

–

267

109

Furey

Cristianini

Duffy

, et al.

Support vector machine classification and validation of cancer tissue samples using microarray expression data

Bioinformatics

2000

;

(

906

–

914

110

Golub

Slonim

Tamayo

, et al.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

1999

;

286

(

5439

531

–

537

111

Pomeroy

Tamayo

Gaasenbeek

, et al.

Prediction of central nervous system embryonal tumour outcome based on gene expression

Nature

2002

;

415

(

6870

436

–

442

112

Orrù

Pettersson-Yeo

Marquand

, et al.

Using Support Vector Machine to identify imaging biomarkers of neurological and psychiatric disease: a critical review

Neurosci Biobehav Rev

2012

;

(

1140

–

1152

113

Costafreda

Chu

Ashburner

, et al.

Prognostic and diagnostic potential of the structural neuroanatomy of depression

PLoS One

2009

;

(

e6353

114

Costafreda

Khanna

Mourao-Miranda

, et al.

Neural correlates of sad faces predict clinical remission to cognitive behavioural therapy in depression

Neuroreport

2009

;

(

637

–

641

115

Gong

Scarpazza

, et al.

Prognostic prediction of therapeutic response in depression using high-field MR imaging

Neuroimage

2011

;

(

1497

–

1503

116

Palaniappan

Awang

Intelligent heart disease prediction system using data mining techniques

IJCSNS Int J Comput Sci Netw Secur

2008

;

(

343

–

350

117

Jaimes

Farbiarz

Alvarez

, et al.

Comparison between logistic regression and neural networks to predict death in patients with suspected sepsis in the emergency room

Crit Care

2005

;

(

R150

–

R156

118

Launay

Rivière

Kabeshova

, et al.

Predicting prolonged length of hospital stay in older emergency department users: use of a novel analysis method, the artificial neural network

Eur J Intern Med

2015

;

(

478

–

482

119

Demšar

Zupan

Aoki

, et al.

Feature mining and predictive model construction from severe trauma patient’s data

Int J Med Inform

2001

;

(

1-2

–

120

Levin

Toerper

Hamrock

, et al.

Machine-learning-based electronic triage more accurately differentiates patients with respect to clinical outcomes compared with the emergency severity index

Ann Emerg Med

2018

;

(

565

–

574.e2

121

Brims

FJH

Meniawy

Duffus

, et al.

A novel clinical prediction model for prognosis in malignant pleural mesothelioma using decision tree analysis

J Thorac Oncol

2016

;

(

573

–

582

122

Goodman

Lessler

Cosgrove

, et al.

A clinical decision tree to predict whether a bacteremic patient is infected with an extended-spectrum β-lactamase-producing organism

Clin Infect Dis

2016

;

(

896

–

903

123

Dias

Pereira Rodrigues

Fernandes

, et al.

The risk of disabling, surgery and reoperation in Crohn’s disease—a decision tree-based approach to prognosis

PLoS One

2017

;

(

e0172165

124

Silva-Alves

Secolin

Carvalho

, et al.

A prediction algorithm for drug response in patients with mesial temporal lobe epilepsy based on clinical and genetic information

PLoS One

2017

;

(

e0169214

125

Nguyen

Huang

, et al.

Genome-wide association data classification and SNPs selection using two-stage quality-based random forests

BMC Genomics

2015

;

(

suppl 2

Article S5

126

Briones

Dinu

Data mining of high density genomic variant data for prediction of Alzheimer’s disease risk

BMC Med Genet

2012

;

Article 7

127

Wei

Visweswaran

Cooper

The application of naive Bayes model averaging to predict Alzheimer’s disease from genome-wide data

J Am Med Inform Assoc

2011

;

(

370

–

375

128

Altman

An introduction to kernel and nearest-neighbor nonparametric regression

Am Stat

1992

;

(

175

–

185

129

Kim

Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method

Proc Natl Acad Sci U S A

2018

;

115

(

1322

–

1327

130

Montassier

Al-Ghalith

Ward

, et al.

Pretreatment gut microbiome predicts chemotherapy-related bloodstream infection

Genome Med

2016

;

Article 49

131

Upstill-Goddard

Eccles

Fliege

, et al.

Machine learning approaches for the discovery of gene-gene interactions in disease data

Brief Bioinform

2013

;

(

251

–

260

132

Naushad

Janaki Ramaiah

Pavithrakumari

, et al.

Artificial neural network-based exploration of gene-nutrient interactions in folate and xenobiotic metabolic pathways that modulate susceptibility to breast cancer

Gene

2016

;

580

(

159

–

168

133

Stevens

Gaughan

Linard

, et al.

Disaggregating census data for population mapping using random forests with remotely-sensed and ancillary data

PLoS One

2015

;

(

e0107042

134

Bhatt

Gething

Brady

, et al.

The global distribution and burden of dengue

Nature

2013

;

496

(

7446

504

–

507

135

Pigott

Bhatt

Golding

, et al.

Global distribution maps of the leishmaniases

ELife

2014

;

e02851

136

Pigott

Golding

Mylne

, et al.

Mapping the zoonotic niche of Ebola virus disease in Africa

ELife

2014

;

e04395

137

Messina

Pigott

Golding

, et al.

The global distribution of Crimean-Congo hemorrhagic fever

Trans R Soc Trop Med Hyg

2015

;

109

(

503

–

513

138

Messina

Kraemer

Brady

, et al.

Mapping global environmental suitability for Zika virus

ELife

2016

;

pii:15272

139

Perkins

Siraj

Ruktanonchai

, et al.

Model-based projections of Zika virus infections in childbearing women in the Americas

Nat Microbiol

2016

;

(

16126

141

Uzuner

Solti

Cadag

Extracting medication information from clinical text

J Am Med Inform Assoc

2010

;

(

514

–

518

142

Meystre

Friedlin

South

, et al.

Automatic de-identification of textual documents in the electronic health record: a review of recent research

BMC Med Res Methodol

2010

;

Article 70

143

Pakhomov

Weston

Jacobsen

, et al.

Electronic medical records for clinical research: application to the identification of heart failure

Am J Manag Care

2007

;

(

281

–

288

144

Thomas

McNaught

Ananiadou

Applications of text mining within systematic reviews

Res Synth Methods

2011

;

(

–

145

Murff

FitzHenry

Matheny

, et al.

Automated identification of postoperative complications within an electronic medical record using natural language processing

JAMA

2011

;

306

(

848

–

855

146

Brownstein

Freifeld

Reis

, et al.

Surveillance sans frontières: Internet-based emerging infectious disease intelligence and the HealthMap project

PLoS Med

2008

;

(

e151

147

Freifeld

Mandl

Reis

, et al.

HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports

J Am Med Inform Assoc

2008

;

(

150

–

157

148

Collier

Doan

Kawazoe

, et al.

BioCaster: detecting public health rumors with a Web-based text mining system

Bioinformatics

2008

;

(

2940

–

2941

149

Althouse

Cummings

Prediction of dengue incidence using search query surveillance

PLoS Negl Trop Dis

2011

;

(

e1258

150

Signorini

Segre

Polgreen

The use of Twitter to track levels of disease activity and public concern in the US during the influenza A H1N1 pandemic

PLoS One

2011

;

(

e19467

151

Thomson

Doblas-Reyes

Mason

, et al.

Malaria early warnings based on seasonal climate forecasts from multi-model ensembles

Nature

2006

;

439

(

7076

576

–

579

152

Shaman

Karspeck

Forecasting seasonal outbreaks of influenza

Proc Natl Acad Sci U S A

2012

;

109

(

20425

–

20430

153

Yang

Zhang

Kargbo

, et al.

Transmission network of the 2014–2015 Ebola epidemic in Sierra Leone

J R Soc Interface

2015

;

(

112

20150536

154

DeFelice

Little

Campbell

, et al.

Ensemble forecast of human West Nile virus cases and mosquito infection rates

Nat Commun

2017

;

14592

155

Reis

Shaman

Retrospective parameter estimation and forecast of respiratory syncytial virus in the United States

PLoS Comput Biol

2016

;

(

e1005133

156

Mountford

Principles and Procedures of Statistics with Special Reference to the Biological Sciences by R. G. D. Steel, J. H. Torrie

[book review].

Biometrics

1962

;

(

127

157

Abadi

Agarwal

Barham

, et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv.

2015

. (doi:

). Accessed September 10, 2019.

158

Khalilia

Chakraborty

Popescu

Predicting disease risks from highly imbalanced data using random forest

BMC Med Inform Decis Mak

2011

;

Article 51

159

Jain

Mao

Mohiuddin

Artificial neural networks: a tutorial

Computer

1996

;

(

–

160

Olden

Lawler

Poff

Machine learning methods without tears: a primer for ecologists

Q Rev Biol

2008

;

(

171

–

193

162

Polley

van der Laan

. Super Learner in Prediction. (U.C. Berkeley Division of Biostatistics Working Paper Series, working paper 266). Berkeley, CA: Division of Biostatistics, University of California, Berkeley;

2010

. http://biostats.bepress.com/ucbbiostat/paper266/. Accessed May 20, 2018.

163

Opitz

Maclin

Popular ensemble methods: an empirical study

J Artif Intell Res

1999

;

169

–

198

APPENDIX

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 80,146

57,898 Pageviews

22,248 PDF Downloads

Since 9/1/2019

Month:	Total Views:
September 2019	400
October 2019	855
November 2019	580
December 2019	485
January 2020	370
February 2020	531
March 2020	357
April 2020	223
May 2020	145
June 2020	154
July 2020	118
August 2020	144
September 2020	169
October 2020	276
November 2020	256
December 2020	192
January 2021	213
February 2021	204
March 2021	413
April 2021	969
May 2021	919
June 2021	644
July 2021	800
August 2021	731
September 2021	904
October 2021	1,057
November 2021	1,538
December 2021	813
January 2022	1,078
February 2022	1,376
March 2022	1,890
April 2022	1,906
May 2022	1,805
June 2022	1,412
July 2022	1,093
August 2022	1,331
September 2022	1,354
October 2022	1,787
November 2022	1,959
December 2022	1,940
January 2023	1,757
February 2023	1,928
March 2023	2,214
April 2023	2,240
May 2023	2,049
June 2023	1,592
July 2023	1,686
August 2023	2,003
September 2023	1,975
October 2023	2,170
November 2023	2,531
December 2023	2,453
January 2024	2,405
February 2024	2,310
March 2024	2,201
April 2024	2,687
May 2024	2,528
June 2024	2,025
July 2024	1,711
August 2024	1,672
September 2024	1,540
October 2024	2,098
November 2024	980

Citations

265 Web of Science

What is Machine Learning? A Primer for the Epidemiologist (original) (raw)

Cite

Abstract

CONCEPTS AND TERMINOLOGY

Supervised, unsupervised, and semisupervised learning

Classification versus regression algorithms

Generative versus discriminative algorithms

Reinforcement learning

MACHINE LEARNING ALGORITHMS

Artificial neural networks

Strengths and limitations

Sample statistical packages and modules

Decision trees

Strengths and limitations

Sample statistical packages and modules

Support vector machines

Strengths and limitations

Sample statistical packages and modules

Naive Bayes algorithms

Strengths and limitations

Sample statistical packages and modules

_K_-means clustering

Strengths and limitations

Sample statistical packages and modules

ENSEMBLE METHODS

Bagging

Boosting

Bayesian model averaging

Super Learner

EPIDEMIOLOGIC APPLICATIONS

Causal inference

Diagnostics, prognostic predictive models, and other clinical decision support tools

Genome-wide association studies

Geospatial applications

Text mining

Prediction and forecasting of infectious disease

BRIEF RECOMMENDATIONS

OPPORTUNITIES AND CHALLENGES

ACKNOWLEDGMENTS

Abbreviations

REFERENCES

APPENDIX

Further Reading: Machine Learning Resources for Epidemiologists

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited