ENH: sql support for writing NaN + datetime64 values (GH2754, GH7103) by jorisvandenbossche · Pull Request #8208 · pandas-dev/pandas (original) (raw)

@jreback Thanks for the pointers. It's actually the first time that I go a bit into the internals of pandas, so some questions:

Is itertuples that slow if you actually need tuples? (in fact: we need a dict, not a tuple)

data_list = []  
for t in frame.iloc[start_i:end_i].itertuples():  
    data = dict((k, v) for k, v in zip(keys, t[1:]))  
    data_list.append(data)

data_list = [dict((k, v) for k, v in zip(keys, row)) for row in frame.values[start_i:end_i]]

It is indeed a bit faster to directly iterate the array, but not that much.

What is the advantage of going block by block? (instead of just making the list of dicts directly from the frame.values? Speed? Or also something else?
Using a block by block approach to create the list of arrays and then list of dicts:

ncols = len(keys)  
col_list = [None] * ncols  
blocks = frame._data.blocks  
for i in range(len(blocks)):  
    b = blocks[i]  
    d = np.array(b.values, dtype=object)  
    mask = pd.isnull(d)  
    d[mask] = None  
    for col_loc, col in zip(b.mgr_locs, d):  
        col_list[col_loc] = col  
data_list = [dict((k, v) for k, v in zip(keys, row)) for row in zip(*col_list)]

is indeed somewhat faster than

frame = frame.astype(object)  
data_list = [dict((k, v) for k, v in zip(keys, row)) for row in frame.values]

frame = frame.astype(object)  
col_list = []  
for col in frame:  
    col_list.append(frame[col].values)  
data_list = [dict((k, v) for k, v in zip(keys, row)) for row in zip(*col_list)]

But the advantage of the two above approaches instead of going block by block, is that frame.astype(object) seems to do the thing I need (numpy numeric types to native types, datetime64 to datetime.datetime, ..), while values.astype(object) (witch in fact happens with the block approach, as it works there with numpy arrays) does not really do the same.
Example (assuming df a mixed dtype dataframe with a datetime64 column 'date'):

In [177]: df['date'].astype(object).values  
Out[177]:  
array([datetime.datetime(2012, 1, 1, 0, 0),  
     datetime.datetime(2012, 1, 2, 0, 0),  
     datetime.datetime(2012, 1, 3, 0, 0),  
     datetime.datetime(2012, 1, 4, 0, 0),  
     datetime.datetime(2012, 1, 5, 0, 0)], dtype=object)  
In [178]: df['date'].values.astype(object)  
Out[178]:  
array([1325376000000000000L, 1325462400000000000L, 1325548800000000000L,  
     1325635200000000000L, 1325721600000000000L], dtype=object)

So if I am going through the data block by block I will have to special case the handling of the data based on the dtype/blocktype.
To make the complexity complete (for me at least), if you do .values on the whole dataframe, you even get a third result (no int or datetime.datetime, but Timestamps):

In [183]: df.values  
Out[183]:  
array([[0.0, Timestamp('2012-01-01 00:00:00')],  
     [5.2, Timestamp('2012-01-02 00:00:00')],  
     [63.2, Timestamp('2012-01-03 00:00:00')],  
     [25.0, Timestamp('2012-01-04 00:00:00')],  
     [12.2, Timestamp('2012-01-05 00:00:00')]], dtype=object)

The main speed-up will be of not having to call the maybe_asscalar anymore on each individual value:

def maybe_asscalar(self, i):
    try:
        return np.asscalar(i)
    except AttributeError:
        return i

As this is not needed anymore if I cast to object somewhere as above (are there case where this will not be equivalent?).