An Interactive Resource to Probe Genetic Diversity and Estimated Ancestry in Cancer Cell Lines (original) (raw)

. Author manuscript; available in PMC: 2020 Apr 1.

Abstract

Recent work points to a lack of diversity in genomics studies from genome-wide association studies to somatic (tumor) genome analyses. Yet, population-specific genetic variation has been shown to contribute to health disparities in cancer risk and outcomes. Immortalized cancer cell lines are widely used in cancer research, from mechanistic studies to drug screening. Larger collections of cancer cell lines better represent the genomic heterogeneity found in primary tumors. Yet, the genetic ancestral origin of cancer cell lines is rarely acknowledged and often unknown. Using genome-wide genotyping data from 1,393 cancer cell lines from the Catalogue of Somatic Mutations in Cancer (COSMIC) and Cancer Cell Line Encyclopedia (CCLE), we estimated the genetic ancestral origin for each cell line. Our data indicate that cancer cell line collections are not representative of the diverse ancestry and admixture characterizing human populations. We discuss the implications of genetic ancestry and diversity of cellular models for cancer research and present an interactive tool, Estimated Cell Line Ancestry (ECLA), where ancestry can be visualized with reference populations of the 1000 Genomes project. Cancer researchers can use this resource to identify cell line models for their studies by taking ancestral origins into consideration.

The diverse origins of cancer health disparities

In the US the incidence of certain cancers varies significantly by race and ethnicity, including some of the most common cancers such as breast, colorectal and prostate cancers (1). Wide disparities have also been reported in treatment outcomes and survival (1). As a first step towards addressing disparities, the National Institutes of Health Revitalization Act of 1993 resulted in the establishment of the Office of Research on Minority Health, with the mandate to conduct and support research that would be inclusive of minority populations (2). Continued efforts, including the 2010 Patient Protection and Affordable Care Act (PPACA), sought to address cancer care disparities (3). Despite these efforts, health disparities still exist (1) and exclusion of minority populations from health-related studies remains a concern (47).

Cancer disparities result in differences in risk and outcomes that are likely to be the result of a complex interplay between genetics (8,9) socioeconomic (1012), environmental factors (13) and even receipt of treatment (14). The American Society of Clinical Oncology has proposed strategies for reducing disparities through insurance reform, access to care, quality of care, prevention and wellness, research on health care disparities, and diversity in the health care workforce (3). While these strategies will reduce disparities, they do not address biological factors. Evidence is accumulating that the cancer discoveries driving progress in prevention, screening strategies and treatment derive disproportionately from populations of European descent. This review focuses on research indicating variation in biological and molecular aspects of cancers in populations.

Genetic-based studies have identified differences among ancestral populations in tumor biology and clinical response (15). However, closely associated with these findings are the rather imprecise social terms of ethnicity and race (16,17). In this paper, we have followed the convention of referring to genetic ancestry, and only secondarily comparing to self-reported race and/or ethnicity (18,19). However, this area remains controversial (20). The use of genetic ancestry as a basis for scientific studies may help understand disease prevention and intervention (21,22) although this is only one factor among many (23). Assessing the role of ancestry-associated genetic variations in disease etiology is further complicated by the recent admixture that characterizes various populations of the world (24). Hence, an individual’s ancestry can be described by quantifying the proportion of the genome derived from each contributing population (global ancestry). Heterogeneity is also observed locally in the genome, as variability is observed in the ancestral origins of any particular segment of chromosomes (local ancestry) (25). Ultimately, genetics plays a role in the biological characteristics of a cancer in the form of both germline variation and somatic alterations. Further research is needed to determine the extent to which genetic differences align with ancestral genetic changes (26).

Limited cancer research in diverse populations

Cancer Genome-Wide Association Studies (GWAS) have advanced our understanding of the inherited genetic factors that influence cancer risk. Despite recent progress, however, this understanding is mostly from data obtained from populations of European ancestry (2729). Specifically, cancer GWAS have pinpointed over 700 risk loci (29) but remarkably, 80% were first discovered in European ancestry populations, approximately 15% in East Asians, and less than 1% in African and Latin American populations (29). Population structure which may result from ancestry variations in a cohort have been regarded as a confounder that can lead to spurious signals or hide true associations, (3032), and it is only recently that multiethnic cohorts have emerged as a solution to identify risk loci in more diverse populations. Despite the challenges associated with the use of multiethnic cohorts such as admixture, genetic heterogeneity, variations in the linkage disequilibrium structure around causative variants, and imputation (27), there is a demonstrated benefit to adopt a more inclusive approach. Evidence is accumulating that relying solely on populations of European descent results in an incomplete or inaccurate representation of the genetic susceptibility to cancers (27). For example, replication of risk loci found in European populations through GWAS in multiethnic cohorts has revealed that risk factors may differ in their nature and magnitude of effect (33). The recent increases in the inclusion of non-European populations in GWAS has been mostly attributed to an increase in representation of Asian populations and collectively, African, Hispanics/Latinos, and native or indigenous populations represented less than 4% of the 35 million samples included in 2,500 studies reported in the GWAS Catalog (34).

Such lack of diversity has also been observed in areas of cancer research that will have direct consequences on treatment strategies of cancer patients. For instance, the identification of actionable driver somatic (tumor) mutations has been the basis of the development of targeted cancer therapies and identification of molecular tumor subtypes. In the Cancer Genome Atlas (TCGA) exome sequencing dataset it was estimated that recurrent somatic mutations with 5% frequency would be detectable in whites, but not in populations of any other ethnic origin due to the paucity of samples from those populations (35). With only 33% of all samples identified as non-white (35), the TCGA dataset provides limited opportunities to study the relationship between disparities associated with race and cancer genomes (36). Cancer-related clinical trials also remain limited in ethnic and racial composition, limiting the applicability of trial findings (46,37). In 2014, less than 2% of the NCI’s clinical trials focused on non-European populations and only 20% of the randomized control studies published in higher tier journals analyzed data by race and ethnicity (7). Despite significant advances in precision medicine, we risk implementing a standard of care for only a limited segment of the population without appropriate inclusion of all groups in this type of research (38). We note that this paper addresses the use of genetic ancestry within cell line studies and is not a comprehensive review of ancestry-related contributions to health disparities; more comprehensive reviews of this topic can be found in, for example (15,3942). To illustrate the research that indicates ancestral-based disparities exist related to cancer risk, tumor biology, therapeutic options or outcomes we have focused on the example of breast cancer below.

The 6q25 breast cancer risk locus clearly illustrates the variability of risk variants across populations. A GWAS of Chinese women identified rs2046210 at 6q25.1 (centromeric to ESR1, which codes for estrogen receptor alpha) associated with breast cancer risk and validated the association in an independent European ancestry cohort (43). Further replication confirmed the finding among Chinese, Japanese and European-descent American women, but not among African American women (44). Other studies have similarly failed to identify this association in African American women (4548). In an African American replication study, only 27% of the known GWAS hits reached statistical significance, an observation that was partly explained by differences in linkage disequilibrium architecture around the causative variants as well as statistical power (49). Interestingly, a Latina breast cancer GWAS identified a protective variant of Indigenous American origin at the 6q25 locus, which acts independently of the previously known risk variants at this locus (50). Thus variants associated with risk may not validate in other populations, or even change the direction of risk association (33). Importantly, polygenic risk scores for stratifying women based on their inherited risk of developing breast cancer, which have been developed using data derived largely from European population GWAS, perform poorly in African-American populations as a consequence of inverse directionality of 30–40% of the susceptibility loci (33).

The BRCA1 and BRCA2 genes, susceptibility genes for hereditary breast cancer, also illustrate the impact of ancestral heterogeneity (51,52). In a study of 4,835 Hispanic/Latino breast cancer individuals from 13 countries in Latin America, the Caribbean; and Hispanic/Latino individuals in the United States (52), different frequencies of BRCA1_and BRCA2 variants were observed. The authors report that in the Bahamas, it was estimated that 27.1 % of breast cancer cases had_BRCA pathogenic variants compared to other regions (typically 1–5% BRCA variants observed) (52). Further, BRCA1 variant p.A1708E was observed in the top 10 most frequent pathogenic variants from Hispanic/Latino breast cancer individuals yet this variant is not reported among the top 20 most frequent BRCA1 variants (52). Higher frequencies of BRCA pathogenic variants have also been observed in young black women (53) and Hispanics in the Southwestern United States (54).

Triple negative breast cancer (TNBC) has been shown to be more frequent in women of West African ancestry (55). This has significant clinical relevance as TNBC tumors are aggressive and often have limited specific therapies available (56). Several studies have identified an increased proportion of basal-like breast cancers in populations of African ancestry (5761). Increased frequency of TNBC has also been observed in the Hispanic/Latino population (6268), American Indian/Alaska Native population (64), and women from the Indian subcontinent (69). Interestingly, Filipino women were least likely to have TNBC (69) suggesting a broad range of variability.

Transcriptional signatures of proliferation and_VEGF_-activated gene expression were significantly higher in African-American TNBC tumors compared to tumors from European Americans (60). Importantly, higher tumor vascularization in African-American patients may consequently suggest potential_VEGF_/angiogenesis-related therapeutic options for this population (60). A similar study identified that breast tumors from African-American women are more likely to present with TP53 mutations, less likely to be mutated at the_PIK3CA_ locus and show greater tumor heterogeneity, a pattern consistent with the aggressive behavior of tumors in African-Americans (61). Research has also suggested that the presence of breast cancer stem cells (as determined by_ALDH1_ expression) is also more prevalent in tumors from women of African ancestry compared to European/White-American populations (5759).

The recent pan-TCGA cancer study of the immune landscape of cancer identified relationships between ancestry and immune response (70). PD-L1 expression was lower in tumors from African ancestral populations across most cancer types including breast and colorectal cancers. Estimated lymphocyte fractions were lower in Asian ancestry in uterine and bladder cancers (UCEC, BLCA). Based on these findings, the authors suggested the hypothesis that checkpoint inhibitors could demonstrate ancestry-related efficacy (70).

Cellular models in cancer research

In vitro cultures of immortalized cell lines isolated from tumors have been used as model systems in cancer for at least 65 years. Cell lines have been developed from a variety of cancers including lung (71,72), breast (73,74), and ovarian (75,76) cancer. The National Cancer Institute assembled a panel of 60 cell lines representing a number of cancers including leukemia and many solid tumor types (non-small-cell lung, colon, ovarian, renal, prostate, breast, melanoma, CNS) (7779). However, in the era of precision medicine, 60 cell lines represents only a small number of the over 100 histologies of cancer (79). Some of the notable data panels include the Genomics of Drug Sensitivity in Cancer (GDSC) (80), the Cancer Cell Line Encyclopedia (CCLE) (81), the Catalogue of Somatic Mutations in Cancer (COSMIC) (82,83), the Cancer Therapeutic Response Portal (CTRP) (84) and CMT1000 (85) (see Supplemental Table S1 for a detailed list). These efforts have greatly expanded the number of cell line models and the data on these models available for cancer research.

The development and availability of cell line panels was driven by varied interests in the research community, governmental agencies and pharmaceutical companies predominantly as a method for screening compounds for potential efficacy (8688). At the very early stages of the drug development pipeline, drug toxicity and efficacy can be quickly assessed in collections of cell lines derived from various cancer types. The NCI-60 panel of cell lines led to many innovations including the measurements of compound activity (89), data analytics (9092) and screening automation (86,93,94). The broad diversity of cell types in the NCI60 have led to large number of compounds screened, approximately 150,000 in 2010 (95). Cell line panel drug response has also been correlated using the wealth of molecular profiling tools available such as gene expression (9699), genetics (85,100102), proteomics (103105), and others (92). In the Connectivity Map (106), 164 small molecules were used to perturb MCF7 (breast cancer), HL60 (leukemia), SKMEL5 (melanoma) and PC3 (prostate cancer). This was vastly expanded in (107) to 19,811 compounds and 9 cell lines. Cell line panels have also been used for radiation therapy modeling (108111) and metabolite profiling (112). In fact, cell line panels have been used to compare the applicability of cell lines with tumors (113115).

Although cancer cell lines represent a valuable cancer research model system, issues such as misidentification and cross-contamination of cell lines (116120) have been reported. Moreover, cell lines represent immortalized cancer cells and are often viewed skeptically as representing in vivo tumor development (71,114,121124). Recently, individual cell line genetic drift was shown in the breast cancer cell line MCF7 to result in highly disparate drug response in different laboratory isolates (125). Finally, concerns over adequate patient consent for creating cell lines have arisen most notably from HeLa cells (126130).

Leveraging cell line models in Health Disparities Research

While the NCI-60 provides a well-characterized resource of cell line models, the personalized medicine era challenged the paradigm of a single representative for an entire disease category (131,132). A broader representation of cancer was introduced through larger cell line panels such as the CCLE, although as we demonstrate large gaps still remain. Compounding this under-representation in cell line models is the lack of diversity in large molecular studies (28,35). Thus the ability to adequately address precision medicine with respect to genetic ancestry is severely limited.

When a scientist chooses a cell line model considerations should include the disease (e.g. breast cancer), molecular classification (e.g. triple-negative breast cancer) and genetic ancestry (e.g. ancestral components of a relevant population) as well as on practical laboratory considerations. The underpinnings of cancer risk associated with different genomic loci in GWAS follow-up studies requires researchers to identify cancer as well as normal tissue cell lines that reflect the population in which the association was identified. Additionally, when drug response correlations with molecular information are considered, the variable of estimated genetic ancestry should be included. For the reasons described above, genetic ancestry can impact the aggressiveness of disease (as prostate cancer in AA men), type of disease (as TNBC breast cancer in Hispanic/Latinos) or response to therapy. Thus, having accurate cell line ancestry information available supports experimental conclusions relevant to the population studied but not necessarily applicable to other populations. Further, actively selecting cell line models reflective of a study population allows for directed conclusions and actions in this population from gene perturbation (knock-down) functional studies or drug treatment response/resistance experiments.

Several research studies have addressed these considerations. For example, in (133) the authors examined the ancestry of several commonly used prostate cancer cell lines (including 22Rv1, PC3, DU145). In a larger study, germline variants were examined in 993 cell lines compared to 265 drugs for associations with drug response (134). While not explicitly examining ancestry, this result clearly indicates that the genetic background of cells can impact drug response.

Ancestral composition of cancer cell line models

We have identified a lack of research aids for determining genetic diversity in existing cell line databases. As an aid to cancer researchers and to support disparities studies, we have estimated the genetic ancestral components in existing cell line databases. First, we identify genetic ancestral populations that do not currently have representative cell line models. Secondly, we provide the admixture of genetic populations such that representative models can be identified for populations being studied. Future scientific studies can benefit from using this information on admixture of estimated ancestry within the cell line models when evaluating in vitro molecular biology endpoints and therapeutic responses. We also expect this resource to guide future efforts to generate cell lines in specific cancers in which disparities have been identified.

Using available genome-wide genotyping data (see Supplemental Material and Methods), we have determined the admixture proportions of 1,393 cancer cell lines (Supplemental Table S2) representing various cancer types (Supplemental Table S3) from the COSMIC and CCLE cell line panels using Admixture 1.3 (135). Excess genetic similarity was noted in 91 cell line pairs (Supplemental Table S4). Cell line Single Nucleotide Polymorphism (SNP) data was combined with population SNP data from The 1000 Genomes Project Consortium (24) (1kG, http://www.internationalgenome.org). This combined dataset was filtered (709,034 single nucleotide variants) and visualized using t-Distributed Stochastic Neighbor Embedding (t-SNE) (136) (Figure 1A) and principal components analysis (Figure 1B). Cell lines and 1kG populations were grouped based on the Infomap approach of detecting community structure from the adjacency graph of each sample’s 30 nearest neighbors (in Principal Component space) (137). Cell line associations were made based on most common 1kG population in the corresponding cluster: African (AFR), African American (AMR_AA), East Asian (EAS), European (EUR), Hispanic/Latino (AMR_HL) or South Asian (SAS). Admixture proportions for each cell line are presented in Supplemental Table S5.

Figure 1.

Figure 1.

Estimated genetic ancestry of cell lines within key cell line panels with the 1000 Genomes Project (1kG) reference populations. (A) t-SNE plot of SNP data for cell line panels and 1kG reference populations where each reference population is labeled with the 1kG label (see Table S8 for abbreviation definitions) and the cell lines are labeled as small purple circles primarily clustered in the JPT (Japan), GBR (Great Britain) and CEU (Utah residents with Northern and Western European Ancestry) clusters indicating the majority of cell lines are limited to a few major genetic ancestral groups. (B) Principal Component Analysis (PCA) plot of the cell line panels with the 1kG reference populations. (C) Panel of t-SNE plots showing specific estimated admixture component of ancestral populations estimated through an Admixture analysis with 1kG references and cell lines (7 populations, Q1-Q7 – see Table S5 for Admixture proportions). Shown are samples with majority admixture (Q1–7 color) for the specific population. Waterfall plots show the relative component fraction in each cell line and 1kG sample.

Comparing reported ethnicity to measured genetic ancestry

There is ample literature assessing the correspondence between genetic ancestry and self-identified race and ethnicity. While the former can be described and quantified through molecular genetic analysis, one’s perceived race and ethnicity is influenced by subjective variables. This perception stems from the complex interaction between physical characteristics and sociocultural factors. For more than half of the cell lines studied, self-reported ethnicity information could be obtained from one of the commonly used cell line databases Cellosaurus (138), COSMIC (139), Biosample (140), ATCC (https://www.atcc.org), among others. In the remaining 46.3%, information regarding the ethnicity of the individual from which it was derived could not be easily recovered. In 64 of the cell lines, the reported ethnicity did not correspond to the ancestry as measured by genetic markers. Cell lines reported as ‘African’ or ‘Black’ clustered with African American populations in 81.6% of the cases, emphasizing the ambiguity of the existing nomenclature. In fact, the proportion of the genome inferred to be of European origin in these cell lines averaged 18.32% (ranging from 0% to 95.09%). Another type of ambiguity concerns the cell line Hs 698.T labeled as originating from an ‘American Indian’, which clusters with populations of South Asia suggesting an origin in India rather than from a Native/Indigenous American individual. A total of 26 cell lines were reported as Caucasian but clustered genetically with other populations including African (n=2), African American (n=6), East Asian (n=1), Hispanic/Latinos (n=16), and South Asian (n=1). Interestingly, 89% of the cell lines identified as Hispanic/Latino from admixture patterns and clustering are reported as ‘Caucasian’. Several groups have reported a concordance between self- or observer-reported belonging to major racial/ethnic groups (141143). However, these categories do not capture the inherent heterogeneity of admixed populations (144,145) (144,146,147). What appears as inconsistencies in self-report and genetic data may result from individuals having limited knowledge of their ancestral origins, or culturally identifying to an ethnic group that is not representative of one’s admixture proportions (18). Sociological, behavioral and biological factors that underlie race, ethnicity and ancestry are likely to interact (148). Consequently, from a biomedical research perspective, both self-reports of race/ethnicity group as well as genetically determined clustering and admixture are expected to be relevant in understanding disease susceptibility, and ultimately, the causes of health disparities (148) (18,149).

Distribution of genetic ancestry of cancer cell lines

Ancestry distribution of the cell lines is shown in Figure 1C and summarized in Supplemental Table S6. Across all cell lines, there was a clear bias in the representation of ancestry, with the majority of the cancer cell lines studied determined to be from European and East Asian origin (62.46% and 29.18%, respectively). All other reference populations were represented by less than 10% of the cell lines, with cell lines from African origin accounting for 5.26%, African American 0.86%, Hispanic/Latino 1.95% and South Asian 0.29%. These overall distributions were similar for subsets of cell lines representing the COSMIC and CCLE collections. However, the NCI60 panel stood out with the majority of the cell lines originating from individuals of European descent (over 94%).

Proportions of cell lines associated with ancestral groups also varied across cancer types as detailed in Figure 2and Supplemental Table S7. While breast and lung cancer cell lines have the highest proportion of African descent cell lines (17.19% and 19.83% respectively), breast cancer had the lowest proportion of cell lines of Asian origin (6.25%). Below we describe several significant limitations by cancer types known to exhibit disparities.

Figure 2.

Figure 2.

Stacked barplots of the proportion of cell lines within population by disease type. For each annotated disease type, the cell lines are summarized by cell line panel. Each bar represents the proportion of cells within the group with the majority admixture belonging to one of 6 groups (AA: African American, AFR: African, EAS: East Asian, EUR: European, H/L: Hispanic/Latino, SAS: South Asian). The results clearly indicate the overwhelming proportion of European-ancestry cell lines within the panels.

In prostate cancer, risk alleles at the 17q21 susceptibility locus have been shown to be rare in European and Asian populations but may contribute to up to 10% of the prostate cancer risk in men of African descent (150). In a large multi-ethnic replication study of prostate cancer risk GWAS hits, the magnitude of the association of known risk loci also varied substantially across cohorts of different ethnicities (151). Novel signals unique to men of African ancestry were recently identified on chromosomes 13q34 and 22q12, further supporting the contribution of population-specific variants to prostate cancer risk (152). Recent work indicates that beyond inherited risk variants, somatic driver mutations also differ in the African population compared to European-derived tumors (153). African American men are diagnosed with prostate cancer at younger age, have different treatment profiles, and have a higher risk of prostate cancer specific mortality even after adjusting for other factors (154). Ten prostate cell lines (seven carcinoma, one hyperplasia, two normal) are reported in CCLE and NCI-60. Despite widely acknowledged differences in the incidence and severity of prostate cancer in men of African descent, African ancestral genetic factors are represented in only 1 out of 10 cell lines (Q7 > 5%). This single cell line was MDAPCA2B, consisting of an estimated 90% African component (Q7=90% AFR/AMR-AA). Most cell lines have majority European (Q1+Q6) ancestry component. Interestingly, BPH-1, while reported as “Japanese”, has a European component of 95%, and an Asian component of 4%.

Cell lines of East Asian origin were the vast majority of cancers of the stomach (86.05%). This might reflect the higher incidence of these cancers in Asian populations. However, the increased burden of gastric cancer in Latin America (155,156) suggests that better representation outside of East Asian origin will be important.

Asian/Pacific Islanders men and women experience a 70% and 95% higher incidence rate of liver cancer, respectively, than European American men and women. Hispanic men and women have a similarly elevated incidence of liver cancers (157). Liver cancer cell lines appear to be more representative when considering Asian ancestry: of the 27 listed cell lines, 16 have a reported ethnicity consistent with Asian ancestry. However, we note that 1000 Genomes does not include Pacific Islander populations, and so we are currently unable to distinguish this ancestral component. Twenty-two of the 27 cell lines have East Asian (Q3+Q4) components of > 80%. Two cell lines have African (Q7) components >70%. However, only two cell lines have Native American (Q2) components >5% (C3A, HEPG2).

Lung cancer is highly prevalent in Hispanic/Latino (HL) men and women, and is the leading cause of cancer death in HL men (158). Recent studies have shown a difference in mutation rates prevalence among common oncogenic driver genes:EGFR is more highly mutated in Asian (159) and HL (160,161), whereas_KRAS_ is more highly mutated in Non-Hispanic Whites (NHW) (160). This difference may have a direct impact on treatment and outcomes, as EGFR and_KRAS_ mutation status affects choice of treatment. Again, the majority of 230 lung cancer cell lines (including adenocarcinoma, squamous cell carcinoma, and small cell carcinoma) have majority European ancestry. Only four cell lines have Native American (Q2) components >5% (COLO668: 16.6%, HS618T: 21.6%, NCI-H716: 14.7%, NCI-H1435: 15.6%) and 75 cell lines have Asian ancestral components (Q3, Q4, Q5) >5% and 31 cell lines have African ancestral components (Q7) >5%.

Estimated Cell Line Ancestry

Using the estimated ancestry from the cell line panels and the 1000 Genome populations (described above), we have developed an online, interactive and searchable web-based tool that allows visualizing and exporting of publication-quality figures for the estimated genetic ancestry and population structure of cancer cell lines in relation to reference populations of the 1000 Genomes project. For all samples, the contribution of each inferred ancestral population to the genome is quantified and available via tooltips. The tool can be accessed at http://ecla.moffitt.org/.

The application visualizes a t-Distributed Stochastic Neighbor Embedding (t-SNE) (136) plot (Figure 1) of the genotype data for both the 1kG populations and the cell lines. A mouse-over tooltip provides detailed information on the sample. For all samples, the sample name is indicated as well as Q1-Q7 admixture proportions. The 1kG population sample detail includes the population and super-population codes. The cell line detail includes whether it is in CCLE and/or COSMIC, as well as the reported tissue type. The reported ethnicity of the cell line is also included (or NA if not available). All available annotation information on the cell lines and 1kG reference samples are present in table form in the ‘Table: Cell Line’ or ‘Table: Ref’ tabs of the application.

The 1kG clusters can be visually annotated by 1kG population or 1kG super-population. Cell lines are not assigned to clusters by default but are indicated by small, purple circles. Several options exist to categorize cell lines. Cell lines can be annotated by the reported ethnicity from the cell line panel (although a large proportion are missing ethnicity annotation); by admixture score (Q1-Q7); or from cluster association using a graph-based clustering approach. The graph-based clustering approach, Infomap (137), is used to detect community structure from the adjacency graph of each sample’s 30 nearest neighbors (in Principal Component space).

Search functionality is built into the application so that a cell line (e.g. A-549) or all cell lines (‘cell’) can be highlighted. Reference 1kG populations and super-populations (e.g MXL) can also be searched and highlighted. This functionality allows a researcher to quickly identify the estimated genetic ancestry of the cell line being considered, with respect to reference 1kG populations. The tool also allows searching and highlighting of cell lines by the “Reported Ethnicity” terms or by cell line tissues of origin.

Additional views in this tool include the 2-dimensional principal components (PCA) plot with the same functionality as the t-SNE clustering. Side-by-side plots of t-SNE and PCA can also be selected to visualize particular populations or cell lines in both visualizations simultaneously. Given the complexity of the data being represented, a 3 dimensional t-SNE clustering is also available interactively so that the view can be rotated in three dimensions to see additional structure. Finally, the t-SNE plot can be annotated with the admixture memberships (Q1-Q7) as a further method of exploring additional structure in this clustering.

This tool enables a researcher to explore the CCLE and COSMIC cell line panels with respect to 1kG reference populations. A researcher can use this tool to select cancer cell lines for study that better represent the population under examination. Further, when researchers perform drug cancer screenings or mechanistic studies, the effect of genetic ancestry can be considered in the analysis. Further descriptions of the tool and methods for generating the data are available in Supplemental Data and Methods.

Concluding remarks

In summary, we identify an important gap in our knowledge and understanding of genetic-based disparities within cancer research. Most cancer studies have not systematically taken into consideration the ancestry composition in the cell lines used to model the disease in vitro. To mitigate this problem we present an interactive tool that allows the investigation of specific global ancestry in cell line models. We expect this resource to allow a direct examination of ancestry in cell line models and to direct efforts to redress the underrepresentation in cancer types with clear disparities. Incorporating estimated genetic ancestry within cell line molecular biology and drug discovery studies can significantly improve the rigor and reproducibility of cancer research activities, not just those explicitly examining the role of genetic ancestry in cancer biology.

Supplementary Material

1

2

Acknowledgements

This work was supported by the PHSU-MCC Partnership (NCI U54 CA163071 and U54 CA163068) under the Developmental Grant program and the Quantitative Sciences Core; by NCI 1SC1CA182845; and by the Cancer Informatics Shared Resource at the H. Lee Moffitt Cancer Center & Research Institute, an NCI designated Comprehensive Cancer Center (P30-CA076292).

Financial support:

This work was supported by the PHSU-MCC Partnership (NCI U54 CA163071 and U54 CA163068) under the Developmental Grant program and the Quantitative Sciences Core; by NCI 1SC1CA182845; and by the Cancer Informatics Shared Resource at the H. Lee Moffitt Cancer Center & Research Institute, an NCI designated Comprehensive Cancer Center (P30-CA076292).

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

2