Don't make dropping missing rows a default behavior for HDF append()? · Issue #9382 · pandas-dev/pandas (original) (raw)
Hi All,
At the moment, the default behavior for the HDF append() function ( docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.HDFStore.append.html?highlight=append#pandas.HDFStore.append ) is to silently drop all rows that are all NaN except for the index.
As I understand it from a PyData exchange with Jeff, the reason is that people working with panels often have sparse datasets, so this is a very reasonable default.
However, while I appreciate the appeal for time-series analysis, I think this is a dangerous default. The main reason is that the assumption is that if an index has a value but the columns do not, there is no meaningful data in the row. But while true in a time series context -- where it's easy to reconstruct the index values that are dropped -- if indexes contain information like userIDs, sensor codes, place names, etc., the index itself is meaningful, and not easy to reconstruct. Thus the default behavior is potentially deleting user data without a warning.
Given the trade-off between a default that may lead to inefficient storage (dropna = False) and one that potentially erases user data (dropna = True), I think we should error on the side of data preservation.