Lightning talk: PyPop–a software pipeline for large-scale multilocus population genomics (original) (raw)

PyPop update--a software pipeline for large-scale multilocus population genomics

Tissue Antigens, 2007

Population genetic statistics from multilocus genotype data inform our understanding of the patterns of genetic variation and their implications for evolutionary studies, generally, and human disease studies in particular. In any given population one can estimate haplotype frequencies, identify deviation from Hardy–Weinberg equilibrium, test for balancing or directional selection, and investigate patterns of linkage disequilibrium. Existing software packages are oriented primarily toward the computation of such statistics on a population-by-population basis, not on comparisons among populations and across different statistics. We developed PyPop (Python for Population Genomics) to facilitate the analyses of population genetic statistics across populations and the relationships among different statistics within and across populations. PyPop is an open-source framework for performing large-scale population genetic analyses on multilocus genotype data. It computes the statistics described above, among others. PyPop deploys a standard Extensible Markup Language (XML) output format and can integrate the results of multiple analyses on various populations that were performed at different times into a common output format that can be read into a spreadsheet. The XML output format allows PyPop to be embedded as part of a larger analysis pipeline. Originally developed to analyze the highly polymorphic genetic data of the human leukocyte antigen region of the human genome, PyPop has applicability to any kind of multilocus genetic data. It is the primary analysis platform for analyzing data collected for the Anthropological component of the 13th and 14th International Histocompatibility Workshops. PyPop has also been successfully used in studies by our group, with collaborators, and in publications by several independent research teams.

PyPop User Guide: User Guide for Python for Population Genomics

uvm.edu

PyPop (Python for Population Genomics) is an environment developed at UC Berkeley for doing large-scale population genetic analyses including:• conformity to Hardy-Weinberg expectations• tests for balancing or directional selection• estimates of haplotype frequencies and measures and tests of significance for linkage disequilib-rium (LD)

PyPop: A Software Framework for Population Genomics: Analyzing Large-Scale Multi-Locus Genotype Data

2003

Software to analyze multi-locus genotype data for entire populations is useful forestimating haplotype frequencies, deviation from Hardy-Weinberg equilibrium andpatterns of linkage disequilibrium. These statistical results are important to both those interested in human genome variation and disease predisposition as well asevolutionary genetics. As part of the 13th International Histocompatibility andImmunogenetics Working Group (IHWG), we have developed a software frame-work (PyPop). The primary novelty of this package is that it allows integration of statistics across large numbers of data-sets by heavily utilizing the XML file format and the R statistical package to view graphical output, while retaining the abilityto inter-operate with existing software. Largely developed to address human population data, it can, however, be used for population based data for any organism.We tested our software on the data from the 13th IHWG which involved data setsfrom at least 50 laboratories each of up to 1000 individuals with 9 MHC loci (bothclass I and class II) and found that it scales to large numbers of data sets well.

GEVALT: an integrated software tool for genotype analysis

BMC bioinformatics, 2007

Genotype information generated by individual and international efforts carries the promise of revolutionizing disease studies and the association of phenotypes with alleles and haplotypes. Given the enormous amounts of public genotype data, tools for analyzing, interpreting and visualizing these data sets are of critical importance to researchers. In past works we have developed algorithms for genotypes phasing and tag SNP selection, which were shown to be quick and accurate. Both algorithms were available until now only as batch executables. Here we present GEVALT (GEnotype Visualization and ALgorithmic Tool), a software package designed to simplify and expedite the process of genotype analysis, by providing a common interface to several tasks relating to such analysis. GEVALT combines the strong visual abilities of Haploview with our quick and powerful algorithms for genotypes phasing (GERBIL), tag SNP selection (STAMPA) and permutation testing for evaluating significance of assoc...

widgetcon: A website and program for quick conversion among common population genetic data formats

Molecular Ecology Resources, 2019

One of the most tedious steps in genetic data analyses is the reformatting data generated with one program for use with other applications. This conversion is necessary because comprehensive evaluation of the data may be based on different algorithms included in diverse software, each requiring a distinct input format. A platform‐independent and freely available program or a web‐based tool dedicated to such reformatting can save time and efforts in data processing. Here, we report widgetcon, a website and a program which has been developed to quickly and easily convert among various molecular data formats commonly used in phylogenetic analysis, population genetics, and other fields. The web‐based service is available at https://www.widgetcon.net. The program and the website convert the major data formats in four basic steps in less than a minute. The resource will be a useful tool for the research community and can be updated to include more formats and features in the future.

lociNGS: a lightweight alternative for assessing suitability of next-generation loci for evolutionary analysis

Genomic enrichment methods and next-generation sequencing produce uneven coverage for the portions of the genome (the loci) they target; this information is essential for ascertaining the suitability of each locus for further analysis. LOCINGS is a user-friendly accessory program that takes multi-FASTA formatted loci, next-generation sequence alignments and demographic data as input and collates, displays and outputs information about the data. Summary information includes the parameters coverage per locus, coverage per individual and number of polymorphic sites, among others. The program can output the raw sequences used to call loci from next-generation sequencing data. LOCINGS also reformats subsets of loci in three commonly used formats for multi-locus phylogeographic and population genetics analyses -NEXUS, IMa2 and Migrate. LOCINGS is available at https://github.com/SHird/lociNGS and is dependent on installation of MongoDB (freely available at http://www.mongodb.org/downloads). LOCINGS is written in Python and is supported on MacOSX and Unix; it is distributed under a GNU General Public License.

PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals

PLoS ONE, 2011

Recent statistical analyses suggest that sequencing of pooled samples provides a cost effective approach to determine genome-wide population genetic parameters. Here we introduce PoPoolation, a toolbox specifically designed for the population genetic analysis of sequence data from pooled individuals. PoPoolation calculates estimates of h Watterson , h p , and Tajima's D that account for the bias introduced by pooling and sequencing errors, as well as divergence between species. Results of genome-wide analyses can be graphically displayed in a sliding window plot. PoPoolation is written in Perl and R and it builds on commonly used data formats. Its source code can be downloaded from http://code.google.com/p/ popoolation/. Furthermore, we evaluate the influence of mapping algorithms, sequencing errors, and read coverage on the accuracy of population genetic parameter estimates from pooled data.

Tools for Evolutionary and Genetic Analysis (TEGA): A new platform for the management of molecular and environmental data

Genetics and Molecular Biology

Population genetics studies the distributions and changes in population allele frequencies in response to processes, such as mutation, natural selection, gene flow, and genetic drift. Researchers daily manage genetic, biological, and environmental data of the samples, storing them in text files or spreadsheets, which makes it difficult to maintain consistency and traceability. Here we present TEGA, a WEB-based stand-alone software developed for the easy analysis and management of population genetics data. It was designed to: 1) facilitate data management, 2) provide a way to execute the analysis procedures, and 3) supply a means to publish data, procedures, and results. TEGA is distributed under the GNU AGPL v3 license. The documentation, source code, and screenshots are available at https://github.com/darioelias/TEGA. In addition, we present Rabid Fish, the first implementation of TEGA in the Genetics Labortory of the Faculty of Humanities and Sciences at the National University of the Litoral, where research focuses on population genetics studies applied to non-model organisms.

SNAP: Combine and Map modules for multilocus population genetic analysis

Bioinformatics/computer Applications in The Biosciences, 2006

We have added two software tools to our Suite of Nucleotide Analysis Programs (SNAP) for working with DNA sequences sampled from populations. SNAP Map collapses DNA sequence data into unique haplotypes, extracts variable sites and manipulates output into multiple formats for input into existing software packages for evolutionary analyses. Map collapses DNA sequence data into unique haplotypes, extracts variable sites and manipulates output into multiple formats for input into existing software packages for evolutionary analyses. Map includes novel features such as recoding insertions or deletions, including or excluding variable sites that violate an infinite-sites model and the option of collapsing sequences with corresponding phenotypic information, important in testing for significant haplotype-phenotype associations. SNAP Combine merges multiple DNA sequence alignments into a single multiple alignment file. The resulting file can be the union or intersection of the input files. SNAP Combine currently reads from and writes to several sequence alignment file formats including both sequential and interleaved formats. Combine also keeps track of the start and end positions of each separate alignment file allowing the user to exclude variable sites or taxa, important in creating input files for multilocus analyses. Availability: SNAP Combine and Map are freely available at http:// snap.cifr.ncsu.edu/. These programs can be downloaded separately for Mac, Windows and Unix operating systems or bundled in SNAP Workbench. Each program includes online documentation and a sample dataset. Contact: ignazio_carbone@ncsu.edu Supplementary information: A description of system requirements and installation instructions can be found at http://snap.cifr.ncsu.edu

Allelematch: an R package for identifying unique multilocus genotypes where genotyping error and missing data may be present

Molecular Ecology Resources, 2012

We present ALLELEMATCH, an R package, to automate the identification of unique multilocus genotypes in data sets where the number of individuals is unknown, and where genotyping error and missing data may be present. Such conditions commonly occur in noninvasive sampling protocols. Output from the software enables a comparison of unique genotypes and their matches, and facilitates the review of differences between profiles. The software has a variety of applications in molecular ecology, and may be valuable where a large number of samples must be processed, unique genotypes identified, and repeated observations made over space and time. We used simulations to assess the performance of ALLELEMATCH and found that it can reliably and accurately determine the correct number of unique genotypes (±3%) across a broad range of data set properties. We found that the software performs with highest accuracy when genotyping error is below 4%. The R package is available from the Comprehensive R Archive Network (http://cran.r-project.org/). Supplementary documentation and tutorials are provided.