ENH: sql support for writing NaN + datetime64 values (GH2754, GH7103) by jorisvandenbossche · Pull Request #8208 · pandas-dev/pandas (original) (raw)

@jreback Thanks for the pointers. It's actually the first time that I go a bit into the internals of pandas, so some questions:

data_list = []  
for t in frame.iloc[start_i:end_i].itertuples():  
    data = dict((k, v) for k, v in zip(keys, t[1:]))  
    data_list.append(data)  

vs

data_list = [dict((k, v) for k, v in zip(keys, row)) for row in frame.values[start_i:end_i]]  

It is indeed a bit faster to directly iterate the array, but not that much.

ncols = len(keys)  
col_list = [None] * ncols  
blocks = frame._data.blocks  
for i in range(len(blocks)):  
    b = blocks[i]  
    d = np.array(b.values, dtype=object)  
    mask = pd.isnull(d)  
    d[mask] = None  
    for col_loc, col in zip(b.mgr_locs, d):  
        col_list[col_loc] = col  
data_list = [dict((k, v) for k, v in zip(keys, row)) for row in zip(*col_list)]  

is indeed somewhat faster than

frame = frame.astype(object)  
data_list = [dict((k, v) for k, v in zip(keys, row)) for row in frame.values]  

or

frame = frame.astype(object)  
col_list = []  
for col in frame:  
    col_list.append(frame[col].values)  
data_list = [dict((k, v) for k, v in zip(keys, row)) for row in zip(*col_list)]  
In [177]: df['date'].astype(object).values  
Out[177]:  
array([datetime.datetime(2012, 1, 1, 0, 0),  
     datetime.datetime(2012, 1, 2, 0, 0),  
     datetime.datetime(2012, 1, 3, 0, 0),  
     datetime.datetime(2012, 1, 4, 0, 0),  
     datetime.datetime(2012, 1, 5, 0, 0)], dtype=object)  
In [178]: df['date'].values.astype(object)  
Out[178]:  
array([1325376000000000000L, 1325462400000000000L, 1325548800000000000L,  
     1325635200000000000L, 1325721600000000000L], dtype=object)  

So if I am going through the data block by block I will have to special case the handling of the data based on the dtype/blocktype.
To make the complexity complete (for me at least), if you do .values on the whole dataframe, you even get a third result (no int or datetime.datetime, but Timestamps):

In [183]: df.values  
Out[183]:  
array([[0.0, Timestamp('2012-01-01 00:00:00')],  
     [5.2, Timestamp('2012-01-02 00:00:00')],  
     [63.2, Timestamp('2012-01-03 00:00:00')],  
     [25.0, Timestamp('2012-01-04 00:00:00')],  
     [12.2, Timestamp('2012-01-05 00:00:00')]], dtype=object)  

The main speed-up will be of not having to call the maybe_asscalar anymore on each individual value:

def maybe_asscalar(self, i):
    try:
        return np.asscalar(i)
    except AttributeError:
        return i

As this is not needed anymore if I cast to object somewhere as above (are there case where this will not be equivalent?).