pandas (original) (raw)

Support for numeric dtype propogation and coexistance in DataFrames. Prior to 0.10.2, numeric dtypes passed to DataFrames were always casted to int64 or float64. Now, if a dtype is passed (either directly via the dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations. Furthermore, different numeric dtypes will NOT be combined. The following example will give you a taste. This closes GH #622

other changes introduced in this PR (i removed all datetime like issues to PR # 2752 - should be merged first)

ENH:

validated get_numeric_data returns correct dtypes
added blocks attribute (and as_blocks()) method that returns a
dict of dtype -> homogeneous dtyped DataFrame, analagous to values attribute
added keyword 'raise_on_error' to astype, which can be set to false to exluded non-numeric columns
changed get_dtype_counts() to use blocks attribute
changed convert_objects() to use the internals method convert (which is block operated)
- added option to convert_numeric=False to convert_objects to force numeric conversion (or set to np.nan, turned off by default)
- added option to convert_dates='coerce' to convert_objects to force datetimelike conversions (or set to NaT) for invalid values, turned off by default, returns datetime64[ns] dtype
groupby operations to respect dtype inputs wherever possible, even if intermediate casting is required (obviously if the input are ints and nans are resulting, this is casted),
all cython functions are implemented
auto generation of most groupby functions by type is now in generated_code.py
e.g. (group_add,group_mean)
added full float32/int16/int8 support for all numeric operations, including (diff, backfill, pad, take)
added dtype display to show on Series as a default

BUG:

fixed up tests from from_records to use record arrays directly
NOTE: using tuples will remove dtype info from the input stream (using a record array is ok though!)
fixed merging to correctly merge on multiple dtypes with blocks (e.g. float64 and float32 in other merger)
(DataFrame.from_records incorrectly up-converts dtypes to object. #2623 can be fixed, but is dependent on BUG: various bug fixes for DataFrame/Series construction #2752)
fixed BUG: fillna with method segfaults on zero-length input (fixes #2775) #2778, bug in pad/backfill with 0-len frame
fixed very obscure bug in DataFrame.from_records with dictionary and columns passed and hash randomization is on!
integer upcasts will now happend on where when using inplace ops (BUG: DataFrame inplace where doesn't work for mixed datatype frames #2793)

TST:

tests added for merging changes, astype, convert
fixes for test_excel on 32-bit
fixed test_resample_median_bug_1688
separated out test_from_records_dictlike
added tests for (GH Panel constructor ignores dtype #797)
added lots of tests forwhere

DOC:

added DataTypes section in Data Structres intro
whatsnew examples

It would be really helpful if some users could give this a test run before merging. I have put in test cases for numeric operations, combining with DataFrame and Series, but I am sure there are some corner cases that were missed

In [17]: df1 = DataFrame(randn(8, 1), columns = ['A'], dtype = 'float32')

In [18]: df1
Out[18]: 
          A
0 -0.007220
1 -0.236432
2  2.427172
3 -0.998639
4 -1.039410
5  0.336029
6  0.832988
7 -0.413241

In [19]: df1.dtypes
Out[19]: A    float32

In [20]: df2 = DataFrame(dict( A = Series(randn(8),dtype='float16'), 
                                                B = Series(randn(8)), 
                                                C = Series(randn(8),dtype='uint8') ))

In [22]: df2
Out[22]: 
          A         B    C
0  1.150391 -1.033296    0
1  0.123047  1.915564    0
2  0.151367 -0.489826    0
3 -0.565430 -0.734238    0
4 -0.352295 -0.451430    0
5 -0.618164  0.673102  255
6  1.554688  0.322035    0
7  0.160767  0.420718    0


In [23]: df2.dtypes
Out[23]: 
A    float16
B    float64
C      uint8

In [24]: # here you get some upcasting

In [25]: df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [26]: df3
Out[26]: 
          A         B    C
0  1.143170 -1.033296    0
1 -0.113385  1.915564    0
2  2.578539 -0.489826    0
3 -1.564069 -0.734238    0
4 -1.391705 -0.451430    0
5 -0.282135  0.673102  255
6  2.387676  0.322035    0
7 -0.252475  0.420718    0

In [27]: df3.dtypes
Out[27]: 
A    float32
B    float64
C    float64

the example from #622

In [23]: a = np.array(np.random.randint(10, size=1e6),dtype='int32')

In [24]: b = np.array(np.random.randint(10, size=1e6),dtype='int64')

In [25]: df = pandas.DataFrame(dict(a = a, b = b))

In [26]: df.dtypes
Out[26]: 
a    int32
b    int64

Conversion examples

# conversion of dtypes
In [81]: df3.astype('float32').dtypes
Out[81]: 
A    float32
B    float32
C    float32

# mixed type conversions
In [82]: df3['D'] = '1.'

In [83]: df3['E'] = '1'

In [84]: df3.convert_objects(convert_numeric=True).dtypes
Out[84]: 
A    float32
B    float64
C    float64
D    float64
E      int64

# same, but specific dtype conversion
In [85]: df3['D'] = df3['D'].astype('float16')

In [86]: df3['E'] = df3['E'].astype('int32')

In [87]: df3.dtypes
Out[87]: 
A    float32
B    float64
C    float64
D    float16
E      int32

# forcing date coercion
In [18]: s = Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1,
   ....:             Timestamp('20010104'), '20010105'],dtype='O')
   ....:

In [19]: s.convert_objects(convert_dates='coerce')
Out[19]: 
0   2001-01-01 00:00:00
1                   NaT
2                   NaT
3                   NaT
4   2001-01-04 00:00:00
5   2001-01-05 00:00:00
Dtype: datetime64[ns]