ENH: sql support for writing NaN + datetime64 values (GH2754, GH7103) by jorisvandenbossche · Pull Request #8208 · pandas-dev/pandas (original) (raw)
@jreback Thanks for the pointers. It's actually the first time that I go a bit into the internals of pandas, so some questions:
- Is
itertuplesthat slow if you actually need tuples? (in fact: we need a dict, not a tuple)
data_list = []
for t in frame.iloc[start_i:end_i].itertuples():
data = dict((k, v) for k, v in zip(keys, t[1:]))
data_list.append(data) vs
data_list = [dict((k, v) for k, v in zip(keys, row)) for row in frame.values[start_i:end_i]] It is indeed a bit faster to directly iterate the array, but not that much.
- What is the advantage of going block by block? (instead of just making the list of dicts directly from the
frame.values? Speed? Or also something else?
Using a block by block approach to create the list of arrays and then list of dicts:
ncols = len(keys)
col_list = [None] * ncols
blocks = frame._data.blocks
for i in range(len(blocks)):
b = blocks[i]
d = np.array(b.values, dtype=object)
mask = pd.isnull(d)
d[mask] = None
for col_loc, col in zip(b.mgr_locs, d):
col_list[col_loc] = col
data_list = [dict((k, v) for k, v in zip(keys, row)) for row in zip(*col_list)] is indeed somewhat faster than
frame = frame.astype(object)
data_list = [dict((k, v) for k, v in zip(keys, row)) for row in frame.values] or
frame = frame.astype(object)
col_list = []
for col in frame:
col_list.append(frame[col].values)
data_list = [dict((k, v) for k, v in zip(keys, row)) for row in zip(*col_list)] - But the advantage of the two above approaches instead of going block by block, is that
frame.astype(object)seems to do the thing I need (numpy numeric types to native types, datetime64 to datetime.datetime, ..), whilevalues.astype(object)(witch in fact happens with the block approach, as it works there with numpy arrays) does not really do the same.
Example (assumingdfa mixed dtype dataframe with a datetime64 column 'date'):
In [177]: df['date'].astype(object).values
Out[177]:
array([datetime.datetime(2012, 1, 1, 0, 0),
datetime.datetime(2012, 1, 2, 0, 0),
datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0),
datetime.datetime(2012, 1, 5, 0, 0)], dtype=object)
In [178]: df['date'].values.astype(object)
Out[178]:
array([1325376000000000000L, 1325462400000000000L, 1325548800000000000L,
1325635200000000000L, 1325721600000000000L], dtype=object) So if I am going through the data block by block I will have to special case the handling of the data based on the dtype/blocktype.
To make the complexity complete (for me at least), if you do .values on the whole dataframe, you even get a third result (no int or datetime.datetime, but Timestamps):
In [183]: df.values
Out[183]:
array([[0.0, Timestamp('2012-01-01 00:00:00')],
[5.2, Timestamp('2012-01-02 00:00:00')],
[63.2, Timestamp('2012-01-03 00:00:00')],
[25.0, Timestamp('2012-01-04 00:00:00')],
[12.2, Timestamp('2012-01-05 00:00:00')]], dtype=object) The main speed-up will be of not having to call the maybe_asscalar anymore on each individual value:
def maybe_asscalar(self, i):
try:
return np.asscalar(i)
except AttributeError:
return i
As this is not needed anymore if I cast to object somewhere as above (are there case where this will not be equivalent?).