ENH: allow get_dummies to accept dtype argument by Scorpil · Pull Request #18330 · pandas-dev/pandas (original) (raw)
- closes ENH: allow get_dummies to accept dtype argument #18330 (there's no issue for this one)
- tests added / passed
- passes
git diff upstream/master -u -- "*.py" | flake8 --diff - whatsnew entry
Update in version 0.19.0 made get_dummies return uint8 values instead of floats (#8725). While I agree with the argument that get_dummies should output integers by default (to save some memory), in many cases it would be beneficial for user to choose other dtype.
In my case there was serious performance degradation between versions 0.18 and 0.19. After investigation, reason behind it turned out to be the change to get_dummies output type. DataFrame with dummy values was used as an argument to np.dot in an optimization function (second argument was matrix of floats). Since there were lots of iterations involved, and on each iteration np.dot was converting all uint8 values to float64, conversion overhead took unreasonably long time. It is possible to work around this issue by converting dummy columns "manually" afterwards, but it adds unnecessary complexity to the code and is clearly less convenient than calling get_dummies with dtype=float.
Apart from performance considerations, I can imagine dtype=bool to be a common use case.
get_dummies(data, dtype=None) is allowed and will return uint8 values to match the DataFrame interface (where None allows inferring datatype, which is default behavior).
I've extended the test suite to run all the get_dummies tests (except for those that don't deal with internal dtypes, like test_just_na) twice, once with uint8 and once with float64.