PyPop update--a software pipeline for large-scale multilocus population genomics (original) (raw)
Related papers
PyPop: A Software Framework for Population Genomics: Analyzing Large-Scale Multi-Locus Genotype Data
2003
Software to analyze multi-locus genotype data for entire populations is useful forestimating haplotype frequencies, deviation from Hardy-Weinberg equilibrium andpatterns of linkage disequilibrium. These statistical results are important to both those interested in human genome variation and disease predisposition as well asevolutionary genetics. As part of the 13th International Histocompatibility andImmunogenetics Working Group (IHWG), we have developed a software frame-work (PyPop). The primary novelty of this package is that it allows integration of statistics across large numbers of data-sets by heavily utilizing the XML file format and the R statistical package to view graphical output, while retaining the abilityto inter-operate with existing software. Largely developed to address human population data, it can, however, be used for population based data for any organism.We tested our software on the data from the 13th IHWG which involved data setsfrom at least 50 laboratories each of up to 1000 individuals with 9 MHC loci (bothclass I and class II) and found that it scales to large numbers of data sets well.
SNAP: Combine and Map modules for multilocus population genetic analysis
Bioinformatics/computer Applications in The Biosciences, 2006
We have added two software tools to our Suite of Nucleotide Analysis Programs (SNAP) for working with DNA sequences sampled from populations. SNAP Map collapses DNA sequence data into unique haplotypes, extracts variable sites and manipulates output into multiple formats for input into existing software packages for evolutionary analyses. Map collapses DNA sequence data into unique haplotypes, extracts variable sites and manipulates output into multiple formats for input into existing software packages for evolutionary analyses. Map includes novel features such as recoding insertions or deletions, including or excluding variable sites that violate an infinite-sites model and the option of collapsing sequences with corresponding phenotypic information, important in testing for significant haplotype-phenotype associations. SNAP Combine merges multiple DNA sequence alignments into a single multiple alignment file. The resulting file can be the union or intersection of the input files. SNAP Combine currently reads from and writes to several sequence alignment file formats including both sequential and interleaved formats. Combine also keeps track of the start and end positions of each separate alignment file allowing the user to exclude variable sites or taxa, important in creating input files for multilocus analyses. Availability: SNAP Combine and Map are freely available at http:// snap.cifr.ncsu.edu/. These programs can be downloaded separately for Mac, Windows and Unix operating systems or bundled in SNAP Workbench. Each program includes online documentation and a sample dataset. Contact: ignazio_carbone@ncsu.edu Supplementary information: A description of system requirements and installation instructions can be found at http://snap.cifr.ncsu.edu
Molecular Ecology Resources, 2012
We present ALLELEMATCH, an R package, to automate the identification of unique multilocus genotypes in data sets where the number of individuals is unknown, and where genotyping error and missing data may be present. Such conditions commonly occur in noninvasive sampling protocols. Output from the software enables a comparison of unique genotypes and their matches, and facilitates the review of differences between profiles. The software has a variety of applications in molecular ecology, and may be valuable where a large number of samples must be processed, unique genotypes identified, and repeated observations made over space and time. We used simulations to assess the performance of ALLELEMATCH and found that it can reliably and accurately determine the correct number of unique genotypes (±3%) across a broad range of data set properties. We found that the software performs with highest accuracy when genotyping error is below 4%. The R package is available from the Comprehensive R Archive Network (http://cran.r-project.org/). Supplementary documentation and tutorials are provided.
Population genetics of immune-related multilocus copy number variation in Native Americans
Journal of the Royal Society, Interface, 2017
While multiallelic copy number variation (mCNV) loci are a major component of genomic variation, quantifying the individual copy number of a locus and defining genotypes is challenging. Few methods exist to study how mCNV genetic diversity is apportioned within and between populations (i.e. to define the population genetic structure of mCNV). These inferences are critical in populations with a small effective size, such as Amerindians, that may not fit the Hardy-Weinberg model due to inbreeding, assortative mating, population subdivision, natural selection or a combination of these evolutionary factors. We propose a likelihood-based method that simultaneously infers mCNV allele frequencies and the population structure parameter f, which quantifies the departure of homozygosity from the Hardy-Weinberg expectation. This method is implemented in the freely available software CNVice, which also infers individual genotypes using information from both the population and from trios, if ava...
PyPop User Guide: User Guide for Python for Population Genomics
uvm.edu
PyPop (Python for Population Genomics) is an environment developed at UC Berkeley for doing large-scale population genetic analyses including:• conformity to Hardy-Weinberg expectations• tests for balancing or directional selection• estimates of haplotype frequencies and measures and tests of significance for linkage disequilib-rium (LD)
Molecular Ecology Resources, 2010
We present here a new version of the Arlequin program available under three different forms: a Windows graphical version (WINARL35), a console version of Arlequin (ARLECORE), and a specific console version to compute summary statistics (ARLSUMSTAT). The command-line versions run under both Linux and Windows. The main innovations of the new version include enhanced outputs in XML format, the possibility to embed graphics displaying computation results directly into output files, and the implementation of a new method to detect loci under selection from genome scans. Command-line versions are designed to handle large series of files, and ARLSUMSTAT can be used to generate summary statistics from simulated data sets within an Approximate Bayesian Computation framework.
widgetcon: A website and program for quick conversion among common population genetic data formats
Molecular Ecology Resources, 2019
One of the most tedious steps in genetic data analyses is the reformatting data generated with one program for use with other applications. This conversion is necessary because comprehensive evaluation of the data may be based on different algorithms included in diverse software, each requiring a distinct input format. A platform‐independent and freely available program or a web‐based tool dedicated to such reformatting can save time and efforts in data processing. Here, we report widgetcon, a website and a program which has been developed to quickly and easily convert among various molecular data formats commonly used in phylogenetic analysis, population genetics, and other fields. The web‐based service is available at https://www.widgetcon.net. The program and the website convert the major data formats in four basic steps in less than a minute. The resource will be a useful tool for the research community and can be updated to include more formats and features in the future.
Human Mutation, 2001
With the discovery of single nucleotide polymorphisms (SNP) along the genome, genotyping of large samples of biallelic multilocus genetic phenotypes for (fine) mapping of disease genes or for population studies has become standard practice. A genetic trait, however, is mainly caused by an underlying defective haplotype, and populations are best characterized by their haplotype frequencies. Therefore, it is essential to infer from the phase-unknown genetic phenotypes in a sample drawn from a population the haplotype frequencies in the population and the underlying haplotype pairs in the sample in order to find disease predisposing genes by some association or haplotype sharing algorithm. Haplotype frequencies and haplotype pairs are estimated via a maximum likelihood approach by a well-known expectation maximization (EM) algorithm, adapting it to a large number (up to 30) of biallelic loci (SNP), and including nuclear family information, if available, into the analysis. Parents are treated as an independent sample from the population. Their genotyped offspring reduces the number of potential haplotype pairs for both parents, resulting in a higher accuracy of the estimation, and may also reduce computation time. In a series of simulations our approach of including nuclear family information has been tested against both the EM algorithm without nuclear family information and an alternative approach using GENEHUNTER for the haplotyping of the families, using the locus-by-locus allele counts of the sample. Our new approach is more precise in haplotyping in cases of a high number of heterozygous loci, whereas for a moderate number of heterozygous positions in the sample all three different approaches gave the same perfect results. Hum Mutat 17:289-295, 2001.
2010
Currently, there is a demand for software to analyze polymorphism data such as microsatellite DNA and single nucleotide polymorphism with easily accessible interface in many fields of research. In this article, we would like to make an announcement of POPTREE2, a computer program package, that can perform evolutionary analyses of allele frequency data. The original version (POPTREE) was a command-line program that runs on the Command Prompt of Windows and Unix. In POPTREE2 genetic distances (measures of the extent of genetic differentiation between populations) for constructing phylogenetic trees, average heterozygosities (H) (a measure of genetic variation within populations) and G ST (a measure of genetic differentiation of subdivided populations) are computed through a simple and intuitive Windows interface. It will facilitate statistical analyses of polymorphism data for researchers in many different fields. POPTREE2 is available at
genalex 6: genetic analysis in Excel. Population genetic software for teaching and research
Molecular Ecology Notes, 2006
genalex is a user-friendly cross-platform package that runs within Microsoft Excel, enabling population genetic analyses of codominant, haploid and binary data. Allele frequency-based analyses include heterozygosity, F statistics, Nei's genetic distance, population assignment, probabilities of identity and pairwise relatedness. Distance-based calculations include amova, principal coordinates analysis (PCA), Mantel tests, multivariate and 2D spatial autocorrelation and twogener. More than 20 different graphs summarize data and aid exploration. Sequence and genotype data can be imported from automated sequencers, and exported to other software. Initially designed as tool for teaching, genalex 6 now offers features for researchers as well. Documentation and the program are available at http://www.anu.edu.au/BoZo/GenAlEx/