BUG: convert nan to None before insert data into mysql by simomo · Pull Request #4200 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation21 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

@simomo Can you include a test demonstrating expected behavior? See pandas/io/tests/test_sql.py for examples.

Also, of course including None requires the object dtype. Does this come with a performance cost? @jreback?

In [30]: frame
Out[30]: 
   0         1
0  2       NaN
1  3 -1.029046

In [31]: frame.dtypes
Out[31]: 
0      int64
1    float64
dtype: object

In [32]: frame.where(pd.notnull(frame), None).dtypes
Out[32]: 
0     int64
1    object
dtype: object

This is quite complicated and shouldn't be done this way. The main issue is different reprs for datetime/non-datetime. There are better ways of doing this (these are internal routines), e.g. see core/format.py/CSVFormatter/_save_chunk. This has to do with how things are converted/passed to SQL, e.g. whether they need to be stringfied or not.

You are going to need to segregate by block type, then convert or (not-convert) as needed, substituting appropriate 'null' sentinals (which might be different for different flavors?)

@jreback Why is it necessary to have different sentinals for NaN and NaT?

(I agree this should be done on write/like _save_chunk...)

Its not necessary per se, but I suspect that the different SQL have different sentials (if None works for everything then great)....

for perf though...this may need to be optimized

Idea being to abstract problem of None to SQLAlchemy, assuming it Just WorksTM. Which I thought was kind of the point of it...

Yeah, perf could be an issue - in which case we'll end up writing a load of platform specific stuff? :s

I am not sure what perf diff will be, just have to profile it. You might simply want to something like:

values = df.values.astype(object)
values[pd.isnull(df)] = None

prob should work and be pretty fast (not 100% sure what this will do to dates though)

I think you are going to have to do specific backend specific conversions (mostly on NaN/None, but also datetimes). IIRC mysql stores dates as strings? though some of this may be converted from datestimes

@danielballan do we care how it's stored... provided the roundtrip works (and probably also that the dtype is sensible) is all good...?

Do we need to know in order to query (for None)? Can't SQLAlchemy compile your query in a clever way (worrying about the platform specific bit), maybe I've got it totally wrong?

No, I don't think we care. The second comment on the SO question is troubling, or at least confusing to me. I think all flavors of SQL just have NULL, and we'll want those to ultimately come out as np.nan. Certainty not 'NaN'.

@danielballan you will for sure need to do type conversions on the readback, e.g. make sure dates are correct (you can just use convert_objects(convert_dates='coerce')

you also may want to do convert_objects(convert_numeric=True) on the numeric columns (may only be necessary depending on how results are returned)

hayd mentioned this pull request

Jul 20, 2013

20 tasks

Filling NaN with None:

produces an error:

ValueError: must specify a fill method or value

Is it the same bug as reported in this thread?

@jreback Convert np.nan fields to None values (for dtype=object, of course).

At the same time I can do (i.e. there is no error):

df['col1'].apply(lambda x: None if pd.isnull(x) else x)

which seems to be equivalent to:

@stared

well aside fro the apply be MUCH slower, fillna also does dtype infererence.

I meant what is your purpose in doing this?

@jreback I am performing an outer join of two tables, so I am getting np.nans.
Later I am using this data to interact with MongoDB and I want to have None for missing fields. (I don't want to make conversions each time when I read from or write to the database.)

BTW: Why apply is much slower? Or, in general, for mapping columns what should be used?

ahh...then this is the same issue (has to do with exporting np.nan -> None, or the appropriate if say its NaT)

well, apply is not vectorized so you should do that if at all possible; fillna is cython based so pretty fast.

apply is very general though

@stared apply will be much slower than most (all?) ops that are built into pandas already. E.g., fillna does a specific thing so it doesn't need to accept an arbitrary Python function like apply does. It is there free to use whatever numpy and maybe Cython is available to do its job. apply must be very general and slow it is going to be slow.

@hayd I believe you are going to do this as part of big SQL refactor (and its already linked), so closing