ENH/BUG/DOC: allow propogation and coexistance of numeric dtypes by jreback · Pull Request #2708 · pandas-dev/pandas (original) (raw)
Support for numeric dtype propogation and coexistance in DataFrames. Prior to 0.10.2, numeric dtypes passed to DataFrames were always casted to int64
or float64
. Now, if a dtype is passed (either directly via the dtype
keyword, a passed ndarray
, or a passed Series
, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste. This closes GH #622
other changes introduced in this PR (i removed all datetime like issues to PR # 2752 - should be merged first)
ENH:
- validated get_numeric_data returns correct dtypes
- added
blocks
attribute (and as_blocks()) method that returns a
dict of dtype -> homogeneous dtyped DataFrame, analagous tovalues
attribute - added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns
- changed
get_dtype_counts()
to useblocks
attribute - changed
convert_objects()
to use the internals method convert (which is block operated)- added option to
convert_numeric=False
toconvert_objects
to force numeric conversion (or set tonp.nan
, turned off by default) - added option to
convert_dates='coerce'
toconvert_objects
to force datetimelike conversions (or set toNaT
) for invalid values, turned off by default, returns datetime64[ns] dtype
- added option to
- groupby operations to respect dtype inputs wherever possible, even if intermediate casting is required (obviously if the input are ints and nans are resulting, this is casted),
all cython functions are implemented - auto generation of most groupby functions by type is now in
generated_code.py
e.g. (group_add,group_mean) - added full
float32/int16/int8
support for all numeric operations, including (diff, backfill, pad, take
) - added
dtype
display to show on Series as a default
BUG:
- fixed up tests from from_records to use record arrays directly
NOTE: using tuples will remove dtype info from the input stream (using a record array is ok though!) - fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger)
- (DataFrame.from_records incorrectly up-converts dtypes to object. #2623 can be fixed, but is dependent on BUG: various bug fixes for DataFrame/Series construction #2752)
- fixed BUG: fillna with method segfaults on zero-length input (fixes #2775) #2778, bug in pad/backfill with 0-len frame
- fixed very obscure bug in DataFrame.from_records with dictionary and columns passed and hash randomization is on!
- integer upcasts will now happend on
where
when using inplace ops (BUG: DataFrame inplace where doesn't work for mixed datatype frames #2793)
TST:
- tests added for merging changes, astype, convert
- fixes for test_excel on 32-bit
- fixed test_resample_median_bug_1688
- separated out test_from_records_dictlike
- added tests for (GH Panel constructor ignores dtype #797)
- added lots of tests for
where
DOC:
- added DataTypes section in Data Structres intro
- whatsnew examples
It would be really helpful if some users could give this a test run before merging. I have put in test cases for numeric operations, combining with DataFrame and Series, but I am sure there are some corner cases that were missed
In [17]: df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')
In [18]: df1
Out[18]:
A
0 -0.007220
1 -0.236432
2 2.427172
3 -0.998639
4 -1.039410
5 0.336029
6 0.832988
7 -0.413241
In [19]: df1.dtypes
Out[19]: A float32
In [20]: df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'),
B = Series(randn(8)),
C = Series(randn(8),dtype='uint8') ))
In [22]: df2
Out[22]:
A B C
0 1.150391 -1.033296 0
1 0.123047 1.915564 0
2 0.151367 -0.489826 0
3 -0.565430 -0.734238 0
4 -0.352295 -0.451430 0
5 -0.618164 0.673102 255
6 1.554688 0.322035 0
7 0.160767 0.420718 0
In [23]: df2.dtypes
Out[23]:
A float16
B float64
C uint8
In [24]: # here you get some upcasting
In [25]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2
In [26]: df3
Out[26]:
A B C
0 1.143170 -1.033296 0
1 -0.113385 1.915564 0
2 2.578539 -0.489826 0
3 -1.564069 -0.734238 0
4 -1.391705 -0.451430 0
5 -0.282135 0.673102 255
6 2.387676 0.322035 0
7 -0.252475 0.420718 0
In [27]: df3.dtypes
Out[27]:
A float32
B float64
C float64
the example from #622
In [23]: a = np.array(np.random.randint(10, size=1e6),dtype='int32')
In [24]: b = np.array(np.random.randint(10, size=1e6),dtype='int64')
In [25]: df = pandas.DataFrame(dict(a = a, b = b))
In [26]: df.dtypes
Out[26]:
a int32
b int64
Conversion examples
# conversion of dtypes
In [81]: df3.astype('float32').dtypes
Out[81]:
A float32
B float32
C float32
# mixed type conversions
In [82]: df3['D'] = '1.'
In [83]: df3['E'] = '1'
In [84]: df3.convert_objects(convert_numeric=True).dtypes
Out[84]:
A float32
B float64
C float64
D float64
E int64
# same, but specific dtype conversion
In [85]: df3['D'] = df3['D'].astype('float16')
In [86]: df3['E'] = df3['E'].astype('int32')
In [87]: df3.dtypes
Out[87]:
A float32
B float64
C float64
D float16
E int32
# forcing date coercion
In [18]: s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1,
....: Timestamp('20010104'), '20010105'],dtype='O')
....:
In [19]: s.convert_objects(convert_dates='coerce')
Out[19]:
0 2001-01-01 00:00:00
1 NaT
2 NaT
3 NaT
4 2001-01-04 00:00:00
5 2001-01-05 00:00:00
Dtype: datetime64[ns]