Pandera: A flexible and expressive pandas data validation library. · Issue #12 · pyOpenSci/software-submission (original) (raw)
Submitting Author: Niels Bantilan (@cosmicBboy)
All current maintainers: (@cosmicBboy)
Package Name: pandera
One-Line Description of Package: validate the types, properties, and statistics of pandas data structures
Repository Link: https://github.com/unionai-oss/pandera
Version submitted: 0.1.5
Editor: @lwasser
Reviewer 1: @mbjoseph
Reviewer 2: @xmnlab
Archive: https://github.com/pandera-dev/pandera/releases/tag/v0.2.3
Version accepted: v0.2.3
Date Accepted: 10/10/2019
Description
pandas
data structures can hide a lot of information, and explicitly
validating them at runtime in production-critical or reproducible research
settings is a good idea for building reliable data transformation pipelines.pandera
enables users to:
- Check the types and properties of columns in a
DataFrame
or values in
aSeries
. - Perform descriptive and inferential statistical validation, e.g. two-sample
t-tests. - Seamlessly integrate with existing data analysis/processing pipelines
via function decorators.
pandera
provides a flexible and expressive API for performing data validation
on tidy (long-form) and wide data to make data processing pipelines more
readable and robust.
Scope
- Please indicate which category or categories this package falls under:
- Data retrieval
- Data extraction
- Data munging
- Data deposition
- Reproducibility
- Geospatial
- Education
- Data visualization*
* Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see this section of our guidebook.
- Explain how and why the package falls under these categories (briefly, 1-2 sentences):
Data munging: the package makes ETL, data analysis, and data processing
pipelines more robust and reliable by providing users with tools to validate
assumptions about the schema and statistical properties of datasets.
This package supports validation on long (tidy) data and wide data.
Reproducibility: This package enables users to validate DataFrame
or Series
objects at runtime or as unit/integration tests, and can easily be integrated
to existing pipelines using the check_input
and check_output
decorators.
It also supports collaboration and reproducible research by programmatically
enforcing assertions made about the statistical properties of a dataset in
addition to making it easier to review pandas code in production-critical
contexts.
- Who is the target audience and what are scientific applications of this package?
The target audience of pandera
consist of data scientists, data engineers,
machine learning engineers, and machine learning scientists who use pandas
in
their data processing pipelines for various purposes e.g., transforming data
for reporting, analytics, model training, and data visualization. This tool is
built on top of pandas
and scipy
to provide a user-friendly interface for
explicitly specifying the set of properties that a DataFrame
or Series
must
fulfill in order to be considered valid. Since pandera
makes no assumptions
about the domain of study or contents of these pandas
data structures, it
could be used in a wide variety of quantitative fields that involve the
analysis of tabular data.
- Are there other Python packages that accomplish the same thing? If so, how does yours differ?
There are a few alternatives to pandera in the the Python ecosystem and here
is how they compare:
- https://github.com/alecthomas/voluptuous
- not specific to pandas, applies to JSON/YAML etc.
- very flexible and reasonably simple
- no decorators, hypothesis or sophisticated checks
- https://github.com/keleshev/schema
- similar to voloptuous
- validation of generic python data structures
- https://github.com/TMiguelT/PandasSchema
- has a wider range of 'built-in' validator types
- limited type support (only has a conversion/coercion check)
- no decorators
- implementation has less flexibility than pandera's
- has generic 'check'-like validators
- https://github.com/danielvdende/opulent-pandas
- similar to voluptuous, and conceptually similar to pandera, but lacking
functionality
- similar to voluptuous, and conceptually similar to pandera, but lacking
- https://github.com/c-data/pandas-validator
- not maintained, inflexible syntax
- https://github.com/xguse/table_enforcer
- not maintained
- the
Enforcer
andColumn
objects are very similar to pandera, but it's a
little difficult to follow
Key differentiators of pandera:
- column data types, nullability, and uniqueness are first-class concepts.
check_input
andcheck_output
decorators enable seamless integration with
existing code.Check
s provide flexibility and performance by providing access topandas
API by design.Hypothesis
class provides a tidy-first interface for statistical hypothesis
testing.Check
s andHypothesis
objects support both tidy and wide data validation.- Comprehensive documentation on key functionality.
- If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or
@tag
the editor you contacted:
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
- does not violate the Terms of Service of any service it interacts with.
- has an OSI approved license
- contains a README with instructions for installing the development version.
- includes documentation with examples for all functions.
- contains a vignette with examples of its essential functions and uses.
- has a test suite.
- has continuous integration, such as Travis CI, AppVeyor, CircleCI, and/or others.
Publication options
- Do you wish to automatically submit to the Journal of Open Source Software? If so: JOSS Checks
- The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
- The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
- The package contains a
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
. - The package is deposited in a long-term repository with the DOI:
Note: Do not submit your package separately to JOSS
Are you OK with Reviewers Submitting Issues to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
- Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.
Code of conduct
- I agree to abide by pyOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.
P.S. Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
Editor and review templates can be found here
Previous Repo: https://github.com/cosmicBboy/pandera