PEDSTATS: descriptive statistics, graphics and quality assessment for gene mapping data (original) (raw)

Journal Article

,

Center for Statistical Genetics, Department of Biostatistics, School of Public Health, University of Michigan Ann Arbor, MI 48103, USA

*To whom correspondence should be addressed.

Search for other works by this author on:

Center for Statistical Genetics, Department of Biostatistics, School of Public Health, University of Michigan Ann Arbor, MI 48103, USA

Search for other works by this author on:

Revision received:

05 June 2005

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: We describe a tool that produces summary statistics and basic quality assessments for gene-mapping data, accommodating either pedigree or case-control datasets. Our tool can also produce graphic output in the PDF format.

Availability: http://www.sph.umich.edu/csg/abecasis/Pedstats/download/

Contact: wiggie@umich.edu

Supplementary information: http://www.sph.umich.edu/csg/abecasis/Pedstats/

A crucial first step in the analysis of gene mapping data is the careful description of the available data, including, for example, genotyping completeness and heterozygosities for genetic markers, and distributions and familial correlations for quantitative traits. Although a number of programs now provide some facilities for data checking or summary (Mukhopadhyay et al., 2005; Lange et al., 1988; Elston et al., 2004; O'Connell et al., 1998) complete screening and summary of genetic data frequently involves the use of multiple programs and/or in-house tools. As the scale of the datasets available for analysis increases, this process can become particularly challenging. For example, with the advent of high-throughput single nucleotide polymorphism genotyping technologies, datasets will soon be available that includes genotypes for hundreds of thousands or millions of markers for each individual. In addition, with the focus on uncovering the genetic basis of complex disease, it is likely that collaborative projects will collect samples with hundreds or thousands of phenotypes each measured on thousands of individuals. We have developed PEDSTATS, a freely available utility, for summarizing salient features and performing basic quality checks on gene-mapping data. Our utility can conveniently handle these very large datasets and here we summarize its main features.

PEDSTATS runs on any platform where a modern C++ compiler is available, including those based on the Linux, UNIX, Windows and Mac OS X operating systems. It is a command-line utility that can produce both text output to the console and graphical output to a PDF file. Its major capabilities can be grouped into four areas: (1) checks of input formats and pedigree consistency, (2) checks and descriptions of genetic marker data, (3) checks and descriptions of quantitative traits and covariates and (4) descriptions of discrete traits. We describe each of these in turn below.

The first step in any analysis is the validation of input files. At this stage, common data-format errors such as missing or extraneous columns are reported. Next, the reported family structures are validated to ensure that all connecting individuals are present and that sex-codes are consistent for the various individuals. If desired, large pedigrees can be trimmed to remove uninformative individuals with no phenotype or genotype data, or separated into disconnected family units. A brief summary of the number of pedigrees, individuals and a distribution of individuals per family is produced. This information can be graphically summarized (Fig. 1A is an example summarizing the distribution of family sizes in one large dataset) and, optionally, includes counts for various types of relative pairs which can be further broken down by sex. Individuals with no phenotype or genotype information can be automatically removed and a new set of input files generated. PEDSTATS readily accepts files prepared for other packages we have developed, including those prepared for linkage analyses with Merlin (Abecasis et al., 2002), association analyses with QTDT (Abecasis et al., 2000) and relationship inference with GRR (Abecasis et al., 2001). Other popular formats, such as those used by the LINKAGE package (Lathrop et al., 1985) and by MENDEL and related tools (Lange et al., 1988) are also accommodated.

When verifying genetic marker data, PEDSTATS reports basic statistics like heterozygosity and genotyping completeness and can produce graphical summaries of allele and genotype frequencies. After automatic grouping of rare alleles, conformance of observed genotypes with Hardy–Weinberg equilibrium can be checked with a χ2 test for multi-allelic markers or an exact test for bi-allelic markers (Wigginton et al., 2005). Results of Hardy–Weinberg tests, including an exact distribution for the number of heterozygotes in the sample, can be presented graphically (e.g. Fig. 1B). Mendelian inheritance checks for both autosomal and X-linked marker data are also carried out using a genotype elimination algorithm that finds all inconsistencies in pedigrees without loops (Lange and Goradia, 1987; O'Connell and Weeks, 1999). Verifying Mendelian consistency prior to analysis of genetic marker data can be a crucial step (Lange and Goradia, 1987; O'Connell and Weeks, 1998), since most genetic analysis programs do not model genotyping error explicitly (for an exception, see Sobel et al., 2002).

For quantitative traits and covariates, PEDSTATS reports the range, mean and variance of the trait distribution along with the correlation between siblings. Several graphics, including histograms of the overall trait distribution and comparisons of distributions between males and females can be generated (as illustrated in Fig. 1, Panel C which summarizes the distribution of ‘Height’ in one large dataset). These can be helpful in detecting outliers as well as detecting deviations from approximate normality, which is important for many quantitative trait analyses (Allison et al., 1999). Optionally, correlations for other relative pair types can be calculated and plotted (as illustrated in Fig. 1, Panel D, which summarizes the correlations between ‘Weight’ for different relative pairs) and stratified by sex, if desired. Correlations between relatives can provide information about the overall impact of genes on a particular trait. In the example, it is clear that correlation of the variable ‘Weight’ for first degree relatives (in this case, parent–offspring and sibling pairs) is higher than for more distant relatives (half-sibling, avuncular, grand-parent grand-child and cousin pairs). When an age variable is present, we have implemented checks to ensure that values recorded for each individual are compatible with those of their ancestors, subject to user-specified minimum and maximum generation times.

Finally, for discrete traits, PEDSTATS reports the proportion of phenotyped individuals and provides a breakdown of affected individuals. A summary of affected, unaffected and discordant pairs can also be produced, and may help guide decisions on whether a dataset contains sufficient information for an affected relative pair analysis to be carried out (Risch, 1990; Whittemore and Halpern, 1994). As with the other analysis options, discrete trait reports can be segregatedby sex.

In addition to the ability to report statistics separately for different relative pairs and segregate results by sex, PEDSTATS can produce reports for individual families and allows various filters to be applied to input data prior to analysis. For example, all analyses can be restricted to affected individuals (for a specific trait) or to individuals with a minimal amount of genotype data.

We hope our tool will prove valuable to scientists hoping to discern important features of their data, and ease the burdensome task of verifying the consistency and integrity of input formats. Executables, source code and a web-based tutorial that explains input file format, implementation details and output for various tests are available from our website.

Examples of available graphical output. (A) provides information on the distribution of family sizes; (B) summarizes the observed genotype distribution and the exact distribution of heterozygotes conditional on observed allele counts; (C) provides information on the distribution of a quantitative trait; and (D) summarizes relative pair correlations. More detailed descriptions and examples are available on our website.

Fig. 1

Examples of available graphical output. (A) provides information on the distribution of family sizes; (B) summarizes the observed genotype distribution and the exact distribution of heterozygotes conditional on observed allele counts; (C) provides information on the distribution of a quantitative trait; and (D) summarizes relative pair correlations. More detailed descriptions and examples are available on our website.

This work was supported by research grants from the National Human Genome Research Institute and the National Eye Institute.

Conflict of Interest: none declared.

REFERENCES

Abecasis, G.R., et al.

2000

A general test of association for quantitative traits in nuclear families.

Am. J. Hum. Genet.,

66

279

–292

Abecasis, G.R., et al.

2001

GRR: graphical representation of relationship errors.

Bioinformatics

17

742

–743

Abecasis, G.R., et al.

2002

Merlin—rapid analysis of dense genetic maps using sparse gene flow trees.

Nat. Genet.

30

97

–101

Allison, D.B., et al.

1999

Testing the robustness of the likelihood-ratio test in a variance-component quantitative-trait loci-mapping procedure.

Am. J. Hum. Genet.

65

531

–544

Elston, R., Bailey-Wilson, J., Bonney, G., Tran, L., Keats, B., Wilson, A.

2004

SAGE Statistical Analysis for Genetic Epidemiology, Version 5.0

Lange, K. and Goradia, T.M.

1987

An algorithm for automatic genotype elimination.

Am. J. Hum. Genet.

40

250

–256

Lange, K., et al.

1988

Programs for pedigree analysis: MENDEL, FISHER, and dGENE.

Genet. Epidemiol.

5

471

–472

Lathrop, G.M., et al.

1985

Multilocus linkage analysis in humans: detection of linkage and estimation of recombination.

Am. J. Hum. Genet.

37

482

–498

Mukhopadhyay, N., et al.

2005

Mega2: data-handling for facilitating genetic linkage and association analyses.

Bioinformatics

21

2556

–2557

O'Connell, J.R. and Weeks, D.E.

1998

PedCheck: a program for identification of genotype incompatibilities in linkage analysis.

Am. J. Hum. Genet.

63

259

–266

O'Connell, J.R. and Weeks, D.E.

1999

An optimal algorithm for automatic genotype elimination.

Am. J. Hum. Genet.

65

1733

–1740

Risch, N.

1990

Linkage strategies for genetically complex traits. II. The power of affected relative pairs.

Am. J. Hum. Genet.

46

229

–241

Sobel, E., et al.

2002

Detection and integration of genotyping errors in statistical genetics.

Am. J. Hum. Genet.

70

496

–508

Whittemore, A.S. and Halpern, J.

1994

A class of tests for linkage using affected pedigree members.

Biometrics

50

118

–127

Wigginton, J.E., et al.

2005

A note on exact tests of Hardy–Weinberg equilibrium.

Am. J. Hum. Genet.

76

887

–893

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 7,930

6,367 Pageviews

1,563 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 5
December 2016 9
January 2017 12
February 2017 18
March 2017 27
April 2017 21
May 2017 31
June 2017 10
July 2017 32
August 2017 30
September 2017 32
October 2017 28
November 2017 18
December 2017 97
January 2018 116
February 2018 95
March 2018 111
April 2018 99
May 2018 104
June 2018 59
July 2018 79
August 2018 94
September 2018 126
October 2018 92
November 2018 110
December 2018 103
January 2019 75
February 2019 108
March 2019 177
April 2019 202
May 2019 172
June 2019 148
July 2019 167
August 2019 165
September 2019 189
October 2019 135
November 2019 133
December 2019 95
January 2020 126
February 2020 157
March 2020 145
April 2020 157
May 2020 123
June 2020 155
July 2020 111
August 2020 139
September 2020 167
October 2020 244
November 2020 156
December 2020 84
January 2021 78
February 2021 99
March 2021 145
April 2021 166
May 2021 158
June 2021 135
July 2021 92
August 2021 109
September 2021 114
October 2021 114
November 2021 146
December 2021 97
January 2022 81
February 2022 55
March 2022 65
April 2022 50
May 2022 57
June 2022 48
July 2022 47
August 2022 50
September 2022 65
October 2022 53
November 2022 40
December 2022 50
January 2023 42
February 2023 27
March 2023 31
April 2023 62
May 2023 32
June 2023 32
July 2023 14
August 2023 39
September 2023 23
October 2023 38
November 2023 16
December 2023 34
January 2024 31
February 2024 42
March 2024 35
April 2024 39
May 2024 57
June 2024 36
July 2024 34
August 2024 25
September 2024 19
October 2024 20

Citations

328 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic