GitHub - Yu-Group/veridical-flow: Making it easier to build stable, trustworthy data-science pipelines based on the PCS framework. (original) (raw)

vflow logo

A library for making stability analysis simple. Easily evaluate the effect of judgment calls to your data-science pipeline (e.g. choice of imputation strategy)!

mit license python3.9+ tests tests joss PyPI - version

Why use vflow?

Using vflows simple wrappers facilitates many best practices for data science, as laid out in the predictability, computability, and stability (PCS) framework for veridical data science. The goal of vflow is to easily enable data science pipelines that follow PCS by providing intuitive low-code syntax, efficient and flexible computational backends via Ray, and well-documented, reproducible experimentation viaMLflow.

Computation Reproducibility Prediction Stability
Automatic parallelization and caching throughout the pipeline Automatic experiment tracking and saving Filter the pipeline by training and validation performance Replace a single function (e.g. preprocessing) with a set of functions and easily assess the stability of downstream results

Here we show a simple example of an entire data-science pipeline with several perturbations (e.g. different data subsamples, models, and metrics) written simply using vflow.

import sklearn from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, balanced_accuracy_score from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier

from vflow import Vset, init_args

initialize data

X, y = make_classification() X_train, X_test, y_train, y_test = init_args( train_test_split(X, y), names=["X_train", "X_test", "y_train", "y_test"], # optionally name the args )

subsample data

subsampling_funcs = [sklearn.utils.resample for _ in range(3)] subsampling_set = Vset( name="subsampling", vfuncs=subsampling_funcs, output_matching=True ) X_trains, y_trains = subsampling_set(X_train, y_train)

fit models

models = [LogisticRegression(), DecisionTreeClassifier()] modeling_set = Vset(name="modeling", vfuncs=models, vfunc_keys=["LR", "DT"]) modeling_set.fit(X_trains, y_trains) preds_test = modeling_set.predict(X_test)

get metrics

binary_metrics_set = Vset( name="binary_metrics", vfuncs=[accuracy_score, balanced_accuracy_score], vfunc_keys=["Acc", "Bal_Acc"], ) binary_metrics = binary_metrics_set.evaluate(preds_test, y_test)

Once we've written this pipeline, we can easily measure the stability of metrics (e.g. "Accuracy") to our choice of subsampling or model.

Documentation

See the docs for reference on the API

Notebook examples

Note that some of these require more dependencies than just those required forvflow. To install all, run pip install vflow[nb].

Synthetic classification

Enhancer genomics

fMRI voxel prediction

Fashion mnist classification

Feature importance stability

Clinical decision rule vetting

Installation

Stable version

Development version (unstable)

pip install vflow@git+https://github.com/Yu-Group/veridical-flow

References

@software{duncan2020vflow, author = {Duncan, James and Kapoor, Rush and Agarwal, Abhineet and Singh, Chandan and Yu, Bin}, doi = {10.21105/joss.03895}, month = {1}, title = {{VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS}}, url = {https://doi.org/10.21105/joss.03895}, year = {2022} }