ENH/BUG/DOC: allow propogation and coexistance of numeric dtypes by jreback · Pull Request #2708 · pandas-dev/pandas (original) (raw)

Support for numeric dtype propogation and coexistance in DataFrames. Prior to 0.10.2, numeric dtypes passed to DataFrames were always casted to int64 or float64. Now, if a dtype is passed (either directly via the dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste. This closes GH #622

other changes introduced in this PR (i removed all datetime like issues to PR # 2752 - should be merged first)

ENH:

BUG:

TST:

DOC:

It would be really helpful if some users could give this a test run before merging. I have put in test cases for numeric operations, combining with DataFrame and Series, but I am sure there are some corner cases that were missed

In [17]: df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')

In [18]: df1
Out[18]: 
          A
0 -0.007220
1 -0.236432
2  2.427172
3 -0.998639
4 -1.039410
5  0.336029
6  0.832988
7 -0.413241

In [19]: df1.dtypes
Out[19]: A    float32

In [20]: df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'), 
                                                B = Series(randn(8)), 
                                                C = Series(randn(8),dtype='uint8') ))

In [22]: df2
Out[22]: 
          A         B    C
0  1.150391 -1.033296    0
1  0.123047  1.915564    0
2  0.151367 -0.489826    0
3 -0.565430 -0.734238    0
4 -0.352295 -0.451430    0
5 -0.618164  0.673102  255
6  1.554688  0.322035    0
7  0.160767  0.420718    0


In [23]: df2.dtypes
Out[23]: 
A    float16
B    float64
C      uint8

In [24]: # here you get some upcasting

In [25]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [26]: df3
Out[26]: 
          A         B    C
0  1.143170 -1.033296    0
1 -0.113385  1.915564    0
2  2.578539 -0.489826    0
3 -1.564069 -0.734238    0
4 -1.391705 -0.451430    0
5 -0.282135  0.673102  255
6  2.387676  0.322035    0
7 -0.252475  0.420718    0

In [27]: df3.dtypes
Out[27]: 
A    float32
B    float64
C    float64

the example from #622

In [23]: a = np.array(np.random.randint(10, size=1e6),dtype='int32')

In [24]: b = np.array(np.random.randint(10, size=1e6),dtype='int64')

In [25]: df = pandas.DataFrame(dict(a = a, b = b))

In [26]: df.dtypes
Out[26]: 
a    int32
b    int64

Conversion examples

# conversion of dtypes
In [81]: df3.astype('float32').dtypes
Out[81]: 
A    float32
B    float32
C    float32

# mixed type conversions
In [82]: df3['D'] = '1.'

In [83]: df3['E'] = '1'

In [84]: df3.convert_objects(convert_numeric=True).dtypes
Out[84]: 
A    float32
B    float64
C    float64
D    float64
E      int64

# same, but specific dtype conversion
In [85]: df3['D'] = df3['D'].astype('float16')

In [86]: df3['E'] = df3['E'].astype('int32')

In [87]: df3.dtypes
Out[87]: 
A    float32
B    float64
C    float64
D    float16
E      int32

# forcing date coercion
In [18]: s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1,
   ....:             Timestamp('20010104'), '20010105'],dtype='O')
   ....:

In [19]: s.convert_objects(convert_dates='coerce')
Out[19]: 
0   2001-01-01 00:00:00
1                   NaT
2                   NaT
3                   NaT
4   2001-01-04 00:00:00
5   2001-01-05 00:00:00
Dtype: datetime64[ns]