PEDSTATS: descriptive statistics, graphics and quality assessment for gene mapping data (original) (raw)
Journal Article
,
Center for Statistical Genetics, Department of Biostatistics, School of Public Health, University of Michigan Ann Arbor, MI 48103, USA
*To whom correspondence should be addressed.
Search for other works by this author on:
Center for Statistical Genetics, Department of Biostatistics, School of Public Health, University of Michigan Ann Arbor, MI 48103, USA
Search for other works by this author on:
Revision received:
05 June 2005
Navbar Search Filter Mobile Enter search term Search
Abstract
Summary: We describe a tool that produces summary statistics and basic quality assessments for gene-mapping data, accommodating either pedigree or case-control datasets. Our tool can also produce graphic output in the PDF format.
Availability: http://www.sph.umich.edu/csg/abecasis/Pedstats/download/
Contact: wiggie@umich.edu
Supplementary information: http://www.sph.umich.edu/csg/abecasis/Pedstats/
A crucial first step in the analysis of gene mapping data is the careful description of the available data, including, for example, genotyping completeness and heterozygosities for genetic markers, and distributions and familial correlations for quantitative traits. Although a number of programs now provide some facilities for data checking or summary (Mukhopadhyay et al., 2005; Lange et al., 1988; Elston et al., 2004; O'Connell et al., 1998) complete screening and summary of genetic data frequently involves the use of multiple programs and/or in-house tools. As the scale of the datasets available for analysis increases, this process can become particularly challenging. For example, with the advent of high-throughput single nucleotide polymorphism genotyping technologies, datasets will soon be available that includes genotypes for hundreds of thousands or millions of markers for each individual. In addition, with the focus on uncovering the genetic basis of complex disease, it is likely that collaborative projects will collect samples with hundreds or thousands of phenotypes each measured on thousands of individuals. We have developed PEDSTATS, a freely available utility, for summarizing salient features and performing basic quality checks on gene-mapping data. Our utility can conveniently handle these very large datasets and here we summarize its main features.
PEDSTATS runs on any platform where a modern C++ compiler is available, including those based on the Linux, UNIX, Windows and Mac OS X operating systems. It is a command-line utility that can produce both text output to the console and graphical output to a PDF file. Its major capabilities can be grouped into four areas: (1) checks of input formats and pedigree consistency, (2) checks and descriptions of genetic marker data, (3) checks and descriptions of quantitative traits and covariates and (4) descriptions of discrete traits. We describe each of these in turn below.
The first step in any analysis is the validation of input files. At this stage, common data-format errors such as missing or extraneous columns are reported. Next, the reported family structures are validated to ensure that all connecting individuals are present and that sex-codes are consistent for the various individuals. If desired, large pedigrees can be trimmed to remove uninformative individuals with no phenotype or genotype data, or separated into disconnected family units. A brief summary of the number of pedigrees, individuals and a distribution of individuals per family is produced. This information can be graphically summarized (Fig. 1A is an example summarizing the distribution of family sizes in one large dataset) and, optionally, includes counts for various types of relative pairs which can be further broken down by sex. Individuals with no phenotype or genotype information can be automatically removed and a new set of input files generated. PEDSTATS readily accepts files prepared for other packages we have developed, including those prepared for linkage analyses with Merlin (Abecasis et al., 2002), association analyses with QTDT (Abecasis et al., 2000) and relationship inference with GRR (Abecasis et al., 2001). Other popular formats, such as those used by the LINKAGE package (Lathrop et al., 1985) and by MENDEL and related tools (Lange et al., 1988) are also accommodated.
When verifying genetic marker data, PEDSTATS reports basic statistics like heterozygosity and genotyping completeness and can produce graphical summaries of allele and genotype frequencies. After automatic grouping of rare alleles, conformance of observed genotypes with Hardy–Weinberg equilibrium can be checked with a χ2 test for multi-allelic markers or an exact test for bi-allelic markers (Wigginton et al., 2005). Results of Hardy–Weinberg tests, including an exact distribution for the number of heterozygotes in the sample, can be presented graphically (e.g. Fig. 1B). Mendelian inheritance checks for both autosomal and X-linked marker data are also carried out using a genotype elimination algorithm that finds all inconsistencies in pedigrees without loops (Lange and Goradia, 1987; O'Connell and Weeks, 1999). Verifying Mendelian consistency prior to analysis of genetic marker data can be a crucial step (Lange and Goradia, 1987; O'Connell and Weeks, 1998), since most genetic analysis programs do not model genotyping error explicitly (for an exception, see Sobel et al., 2002).
For quantitative traits and covariates, PEDSTATS reports the range, mean and variance of the trait distribution along with the correlation between siblings. Several graphics, including histograms of the overall trait distribution and comparisons of distributions between males and females can be generated (as illustrated in Fig. 1, Panel C which summarizes the distribution of ‘Height’ in one large dataset). These can be helpful in detecting outliers as well as detecting deviations from approximate normality, which is important for many quantitative trait analyses (Allison et al., 1999). Optionally, correlations for other relative pair types can be calculated and plotted (as illustrated in Fig. 1, Panel D, which summarizes the correlations between ‘Weight’ for different relative pairs) and stratified by sex, if desired. Correlations between relatives can provide information about the overall impact of genes on a particular trait. In the example, it is clear that correlation of the variable ‘Weight’ for first degree relatives (in this case, parent–offspring and sibling pairs) is higher than for more distant relatives (half-sibling, avuncular, grand-parent grand-child and cousin pairs). When an age variable is present, we have implemented checks to ensure that values recorded for each individual are compatible with those of their ancestors, subject to user-specified minimum and maximum generation times.
Finally, for discrete traits, PEDSTATS reports the proportion of phenotyped individuals and provides a breakdown of affected individuals. A summary of affected, unaffected and discordant pairs can also be produced, and may help guide decisions on whether a dataset contains sufficient information for an affected relative pair analysis to be carried out (Risch, 1990; Whittemore and Halpern, 1994). As with the other analysis options, discrete trait reports can be segregatedby sex.
In addition to the ability to report statistics separately for different relative pairs and segregate results by sex, PEDSTATS can produce reports for individual families and allows various filters to be applied to input data prior to analysis. For example, all analyses can be restricted to affected individuals (for a specific trait) or to individuals with a minimal amount of genotype data.
We hope our tool will prove valuable to scientists hoping to discern important features of their data, and ease the burdensome task of verifying the consistency and integrity of input formats. Executables, source code and a web-based tutorial that explains input file format, implementation details and output for various tests are available from our website.
Fig. 1
Examples of available graphical output. (A) provides information on the distribution of family sizes; (B) summarizes the observed genotype distribution and the exact distribution of heterozygotes conditional on observed allele counts; (C) provides information on the distribution of a quantitative trait; and (D) summarizes relative pair correlations. More detailed descriptions and examples are available on our website.
This work was supported by research grants from the National Human Genome Research Institute and the National Eye Institute.
Conflict of Interest: none declared.
REFERENCES
Abecasis, G.R., et al.
2000
A general test of association for quantitative traits in nuclear families.
Am. J. Hum. Genet.,
66
279
–292
Abecasis, G.R., et al.
2001
GRR: graphical representation of relationship errors.
Bioinformatics
17
742
–743
Abecasis, G.R., et al.
2002
Merlin—rapid analysis of dense genetic maps using sparse gene flow trees.
Nat. Genet.
30
97
–101
Allison, D.B., et al.
1999
Testing the robustness of the likelihood-ratio test in a variance-component quantitative-trait loci-mapping procedure.
Am. J. Hum. Genet.
65
531
–544
Elston, R., Bailey-Wilson, J., Bonney, G., Tran, L., Keats, B., Wilson, A.
2004
SAGE Statistical Analysis for Genetic Epidemiology, Version 5.0
Lange, K. and Goradia, T.M.
1987
An algorithm for automatic genotype elimination.
Am. J. Hum. Genet.
40
250
–256
Lange, K., et al.
1988
Programs for pedigree analysis: MENDEL, FISHER, and dGENE.
Genet. Epidemiol.
5
471
–472
Lathrop, G.M., et al.
1985
Multilocus linkage analysis in humans: detection of linkage and estimation of recombination.
Am. J. Hum. Genet.
37
482
–498
Mukhopadhyay, N., et al.
2005
Mega2: data-handling for facilitating genetic linkage and association analyses.
Bioinformatics
21
2556
–2557
O'Connell, J.R. and Weeks, D.E.
1998
PedCheck: a program for identification of genotype incompatibilities in linkage analysis.
Am. J. Hum. Genet.
63
259
–266
O'Connell, J.R. and Weeks, D.E.
1999
An optimal algorithm for automatic genotype elimination.
Am. J. Hum. Genet.
65
1733
–1740
Risch, N.
1990
Linkage strategies for genetically complex traits. II. The power of affected relative pairs.
Am. J. Hum. Genet.
46
229
–241
Sobel, E., et al.
2002
Detection and integration of genotyping errors in statistical genetics.
Am. J. Hum. Genet.
70
496
–508
Whittemore, A.S. and Halpern, J.
1994
A class of tests for linkage using affected pedigree members.
Biometrics
50
118
–127
Wigginton, J.E., et al.
2005
A note on exact tests of Hardy–Weinberg equilibrium.
Am. J. Hum. Genet.
76
887
–893
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org
Supplementary data
Citations
Views
Altmetric
Metrics
Total Views 7,930
6,367 Pageviews
1,563 PDF Downloads
Since 11/1/2016
Month: | Total Views: |
---|---|
November 2016 | 5 |
December 2016 | 9 |
January 2017 | 12 |
February 2017 | 18 |
March 2017 | 27 |
April 2017 | 21 |
May 2017 | 31 |
June 2017 | 10 |
July 2017 | 32 |
August 2017 | 30 |
September 2017 | 32 |
October 2017 | 28 |
November 2017 | 18 |
December 2017 | 97 |
January 2018 | 116 |
February 2018 | 95 |
March 2018 | 111 |
April 2018 | 99 |
May 2018 | 104 |
June 2018 | 59 |
July 2018 | 79 |
August 2018 | 94 |
September 2018 | 126 |
October 2018 | 92 |
November 2018 | 110 |
December 2018 | 103 |
January 2019 | 75 |
February 2019 | 108 |
March 2019 | 177 |
April 2019 | 202 |
May 2019 | 172 |
June 2019 | 148 |
July 2019 | 167 |
August 2019 | 165 |
September 2019 | 189 |
October 2019 | 135 |
November 2019 | 133 |
December 2019 | 95 |
January 2020 | 126 |
February 2020 | 157 |
March 2020 | 145 |
April 2020 | 157 |
May 2020 | 123 |
June 2020 | 155 |
July 2020 | 111 |
August 2020 | 139 |
September 2020 | 167 |
October 2020 | 244 |
November 2020 | 156 |
December 2020 | 84 |
January 2021 | 78 |
February 2021 | 99 |
March 2021 | 145 |
April 2021 | 166 |
May 2021 | 158 |
June 2021 | 135 |
July 2021 | 92 |
August 2021 | 109 |
September 2021 | 114 |
October 2021 | 114 |
November 2021 | 146 |
December 2021 | 97 |
January 2022 | 81 |
February 2022 | 55 |
March 2022 | 65 |
April 2022 | 50 |
May 2022 | 57 |
June 2022 | 48 |
July 2022 | 47 |
August 2022 | 50 |
September 2022 | 65 |
October 2022 | 53 |
November 2022 | 40 |
December 2022 | 50 |
January 2023 | 42 |
February 2023 | 27 |
March 2023 | 31 |
April 2023 | 62 |
May 2023 | 32 |
June 2023 | 32 |
July 2023 | 14 |
August 2023 | 39 |
September 2023 | 23 |
October 2023 | 38 |
November 2023 | 16 |
December 2023 | 34 |
January 2024 | 31 |
February 2024 | 42 |
March 2024 | 35 |
April 2024 | 39 |
May 2024 | 57 |
June 2024 | 36 |
July 2024 | 34 |
August 2024 | 25 |
September 2024 | 19 |
October 2024 | 20 |
Citations
328 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic