Tips and tricks for pandas devs · Issue #3156 · pandas-dev/pandas (original) (raw)

Working on pandas for a while now, there's a bunch of tools and tricks
I use, here's a list to help pandas devs slip into the zone:

Use ipdb rather then pdb with nose: --ipdb --ipdb-fail

https://github.com/flavioamieiro/nose-ipdb

Because tab-completion is not optional.

Re-running only failed tests

nosetests --with-id --failed will rerun only the tests which failed last
time you ran nosetests --with-id. If you use test_fast.sh

will do what you expect after you had some tests fail

Better integration of github and git commandline flow

hub a wrapper around git, with github
sugar. first and foremost:

hub checkout https://github.com/pydata/pandas/pull/1134

adds a remote, fetches it, creates a branch for it, and generally puts your right there.

Note: see comment below for a way to do this with pure git, if you don't
mind thousands of remote branches.

GH issues from the command line

ghi
open/manipulate gh issues from the command line.

I use it to open issues when I hit a bug and want to quickly
open a reminder to fix, without breaking my focus.

Testing across python version locally

tox let's you run the test suites across all python versions using virtualenvs.
Everything is setup in the repo, just install and run.
detox parallelizes tox.

Faster pandas builds/testing

Note: the build cache was baked into setup.py from roughly 0.9.1. as of 0.11.0
it's been factored out into scripts/use_build_cache.py, which rewrites setup.py
to use the build cache. The script has been tested as far back as 0.7.0.

Putting the following in your .bashrc

Use the pandas build cache

export BUILD_CACHE_DIR="$HOME/tmp/.pandas_build_cache/" if [ ! -e $BUILD_CACHE_DIR ]; then mkdir -p $BUILD_CACHE_DIR ; fi

echo $BUILD_CACHE_DIR > [pandas repo root dir]/.build_cache_dir function cdev {

any recent commit should do

git checkout c69e3aa scripts/use_build_cache.py vb_suite/test_perf.py scripts/use_build_cache.py $1 # rewire setup.py with build_cache if [ x"$VIRTUAL_ENV" == x"" ]; then _SUDO="sudo" fi

sudo chown $USER -R .; $_SUDO python ./setup.py clean; $_SUDO python ./setup.py develop; sudo chown $USER -R .; echo "Restoring setup.py" git checkout setup.py # restore setup.py }

c69e3aa can be any recent commit, needs to be bumped if there are updates
to the script.

The pandas build cache code, caches cythonization, compilation and
2to3 artifacts for reuse in subsequent builds.
To compile, use "git reset --hard" to get the commit you're after, then use cdev
to build pandas. setup.py will reuse what it can to speed this up.
Note that setup.py gets overwritten, but also restored when the build completes.
With a warm cache, moving to a given commit takes just a few seconds rather then
then the several minutes of a full compile.

You may also run scripts/use_build_cache.py prior to launching tox to speed up tetsing.

Use ccache

The build cache just described caches things on a very coarse level, if there's
any change to .pyx (cython) files, all the files will be recythonized and rebuilt.
Using ccache (an apt-get+envar away on most distors these days) can speed
up the compilation part by caching the gcc compilation results. Yes, this overlaps
with the caching from the previous section, only it also caches the cythonized
c files.

Benchmarking commits

test_perf.sh let's you compare the performance of one commit against
another or benchmark the current HEAD.
It produces a table of results suitable for posting in a PR, and can serialize
the results dataframe into a pickle file, for analysis in pandas.

It can print summary stats over mutliple runs and all sorts of things.
see test_perf.sh --help,

Easily generate dataframes of different kinds

mkdf let's you easily fabricate dataframes of varying dimensions
and arbitrary data:

from pandas.util.testing import makeCustomDataframe as mkdf In [12]: mkdf(3,2) Out[12]: C0 C_l0_g0 C_l0_g1 R0
R_l0_g0 R0C0 R0C1 R_l0_g1 R1C0 R1C1 R_l0_g2 R2C0 R2C1

or even...

In [11]: mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=3,data_gen_f=lambda r,c: r*2+c) Out[11]: C0 C_l0_g0 C_l0_g1 C_l0_g2 C1 C_l1_g0 C_l1_g1 C_l1_g2 C2 C_l2_g0 C_l2_g1 C_l2_g2 R0 R1
R_l0_g0 R_l1_g0 0 1 2 R_l0_g1 R_l1_g1 2 3 4 R_l0_g2 R_l1_g2 4 5 6 R_l0_g3 R_l1_g3 6 7 8 R_l0_g4 R_l1_g4 8 9 10

or even

In [19]: mkdf(8,3,r_idx_nlevels=3,r_ndupe_l=[4,2]) Out[19]: C0 C_l0_g0 C_l0_g1 C_l0_g2 R0 R1 R2
R_l0_g0 R_l1_g0 R_l2_g0 R0C0 R0C1 R0C2 R_l2_g1 R1C0 R1C1 R1C2 R_l1_g1 R_l2_g2 R2C0 R2C1 R2C2 R_l2_g3 R3C0 R3C1 R3C2 R_l0_g1 R_l1_g2 R_l2_g4 R4C0 R4C1 R4C2 R_l2_g5 R5C0 R5C1 R5C2 R_l1_g3 R_l2_g6 R6C0 R6C1 R6C2 R_l2_g7 R7C0 R7C1 R7C2

ipython startup file

your ipython installation has ~/.ipython/profile_default/startup directory,
put your imports, monkey-patches and utility function there and have them
always available.

Speel checking github issues

issues can quickly become stream of conciousness thing once
you start doing a lot of them, if you'd like an easy way to get red squigglies
when your comment contains silly mistaces, you might consider installing
After the deadline, available as an extension for firefox and chrome.

Handy git commands

There are too many git tricks to cover, but the following are both useful and less commonly known:

Generate a new Hash for the current commit, without any other changes to repo state.

git commit --amend -C HEAD

Report author of given commit hash:

function gauthor {
         git show --format='%an <%ae>' $@ | head -n 1
}

and properly assign authorship of a commit:

git commit --author="$(gauthor foohash)"

where foohash is any previous commit authored by that contributor.

To locate the merge commit that introduced a commit into the branch:
https://github.com/jianli/git-get-merge