Pandera: A flexible and expressive pandas data validation library. · Issue #12 · pyOpenSci/software-submission (original) (raw)

Submitting Author: Niels Bantilan (@cosmicBboy)
All current maintainers: (@cosmicBboy)
Package Name: pandera
One-Line Description of Package: validate the types, properties, and statistics of pandas data structures
Repository Link: https://github.com/unionai-oss/pandera
Version submitted: 0.1.5
Editor: @lwasser
Reviewer 1: @mbjoseph
Reviewer 2: @xmnlab
Archive: https://github.com/pandera-dev/pandera/releases/tag/v0.2.3
Version accepted: v0.2.3
Date Accepted: 10/10/2019

Description

pandas data structures can hide a lot of information, and explicitly
validating them at runtime in production-critical or reproducible research
settings is a good idea for building reliable data transformation pipelines.
pandera enables users to:

Check the types and properties of columns in a DataFrame or values in
a Series.
Perform descriptive and inferential statistical validation, e.g. two-sample
t-tests.
Seamlessly integrate with existing data analysis/processing pipelines
via function decorators.

pandera provides a flexible and expressive API for performing data validation
on tidy (long-form) and wide data to make data processing pipelines more
readable and robust.

Scope

Please indicate which category or categories this package falls under:
- Data retrieval
- Data extraction
- Data munging
- Data deposition
- Reproducibility
- Geospatial
- Education
- Data visualization*

* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.

Explain how and why the package falls under these categories (briefly, 1-2 sentences):

Data munging: the package makes ETL, data analysis, and data processing
pipelines more robust and reliable by providing users with tools to validate
assumptions about the schema and statistical properties of datasets.
This package supports validation on long (tidy) data and wide data.

Reproducibility: This package enables users to validate DataFrame or Series
objects at runtime or as unit/integration tests, and can easily be integrated
to existing pipelines using the check_input and check_output decorators.
It also supports collaboration and reproducible research by programmatically
enforcing assertions made about the statistical properties of a dataset in
addition to making it easier to review pandas code in production-critical
contexts.

Who is the target audience and what are scientific applications of this package?

The target audience of pandera consist of data scientists, data engineers,
machine learning engineers, and machine learning scientists who use pandas in
their data processing pipelines for various purposes e.g., transforming data
for reporting, analytics, model training, and data visualization. This tool is
built on top of pandas and scipy to provide a user-friendly interface for
explicitly specifying the set of properties that a DataFrame or Series must
fulfill in order to be considered valid. Since pandera makes no assumptions
about the domain of study or contents of these pandas data structures, it
could be used in a wide variety of quantitative fields that involve the
analysis of tabular data.

Are there other Python packages that accomplish the same thing? If so, how does yours differ?

There are a few alternatives to pandera in the the Python ecosystem and here
is how they compare:

https://github.com/alecthomas/voluptuous
- not specific to pandas, applies to JSON/YAML etc.
- very flexible and reasonably simple
- no decorators, hypothesis or sophisticated checks
https://github.com/keleshev/schema
- similar to voloptuous
- validation of generic python data structures
https://github.com/TMiguelT/PandasSchema
- has a wider range of 'built-in' validator types
- limited type support (only has a conversion/coercion check)
- no decorators
- implementation has less flexibility than pandera's
- has generic 'check'-like validators
https://github.com/danielvdende/opulent-pandas
- similar to voluptuous, and conceptually similar to pandera, but lacking
  functionality
https://github.com/c-data/pandas-validator
- not maintained, inflexible syntax
https://github.com/xguse/table_enforcer
- not maintained
- the Enforcer and Column objects are very similar to pandera, but it's a
  little difficult to follow

Key differentiators of pandera:

column data types, nullability, and uniqueness are first-class concepts.
check_input and check_output decorators enable seamless integration with
existing code.
Checks provide flexibility and performance by providing access to pandas
API by design.
Hypothesis class provides a tidy-first interface for statistical hypothesis
testing.
Checks and Hypothesis objects support both tidy and wide data validation.
Comprehensive documentation on key functionality.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

https://pyopensci.discourse.group/t/candidate-package-pandera-a-flexible-pandas-data-structure-validation-package/92

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

does not violate the Terms of Service of any service it interacts with.
has an OSI approved license
contains a README with instructions for installing the development version.
includes documentation with examples for all functions.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.

Publication options

Do you wish to automatically submit to the Journal of Open Source Software? If so: JOSS Checks
The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
The package is deposited in a long-term repository with the DOI:

Note: Do not submit your package separately to JOSS

Are you OK with Reviewers Submitting Issues to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Code of conduct

I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

Editor and review templates can be found here

Previous Repo: https://github.com/cosmicBboy/pandera