Pandera: A flexible and expressive pandas data validation library. · Issue #12 · pyOpenSci/software-submission (original) (raw)

Submitting Author: Niels Bantilan (@cosmicBboy)
All current maintainers: (@cosmicBboy)
Package Name: pandera
One-Line Description of Package: validate the types, properties, and statistics of pandas data structures
Repository Link: https://github.com/unionai-oss/pandera
Version submitted: 0.1.5
Editor: @lwasser
Reviewer 1: @mbjoseph
Reviewer 2: @xmnlab
Archive: https://github.com/pandera-dev/pandera/releases/tag/v0.2.3
Version accepted: v0.2.3
Date Accepted: 10/10/2019


Description

pandas data structures can hide a lot of information, and explicitly
validating them at runtime in production-critical or reproducible research
settings is a good idea for building reliable data transformation pipelines.
pandera enables users to:

  1. Check the types and properties of columns in a DataFrame or values in
    a Series.
  2. Perform descriptive and inferential statistical validation, e.g. two-sample
    t-tests.
  3. Seamlessly integrate with existing data analysis/processing pipelines
    via function decorators.

pandera provides a flexible and expressive API for performing data validation
on tidy (long-form) and wide data to make data processing pipelines more
readable and robust.

Scope

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Data munging: the package makes ETL, data analysis, and data processing
pipelines more robust and reliable by providing users with tools to validate
assumptions about the schema and statistical properties of datasets.
This package supports validation on long (tidy) data and wide data.

Reproducibility: This package enables users to validate DataFrame or Series
objects at runtime or as unit/integration tests, and can easily be integrated
to existing pipelines using the check_input and check_output decorators.
It also supports collaboration and reproducible research by programmatically
enforcing assertions made about the statistical properties of a dataset in
addition to making it easier to review pandas code in production-critical
contexts.

The target audience of pandera consist of data scientists, data engineers,
machine learning engineers, and machine learning scientists who use pandas in
their data processing pipelines for various purposes e.g., transforming data
for reporting, analytics, model training, and data visualization. This tool is
built on top of pandas and scipy to provide a user-friendly interface for
explicitly specifying the set of properties that a DataFrame or Series must
fulfill in order to be considered valid. Since pandera makes no assumptions
about the domain of study or contents of these pandas data structures, it
could be used in a wide variety of quantitative fields that involve the
analysis of tabular data.

There are a few alternatives to pandera in the the Python ecosystem and here
is how they compare:

Key differentiators of pandera:

https://pyopensci.discourse.group/t/candidate-package-pandera-a-flexible-pandas-data-structure-validation-package/92

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

Note: Do not submit your package separately to JOSS

Are you OK with Reviewers Submitting Issues to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Previous Repo: https://github.com/cosmicBboy/pandera