BUG: Timestamp cannot parse nanosecond from string by sinhrks · Pull Request #7907 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation13 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

sinhrks

Fixes 2 problems related to Timestamp string parsing:

# Result after the fix

# If string contains offset, it will be parsed using fixed timezone offset. Following results in 2013-11-01 05:00:00 in UTC (There is some existing tests checking this behaviour).
repr(pd.Timestamp('2013-11-01 00:00:00-0500'))
# Timestamp('2013-11-01 00:00:00-0500', tz='pytz.FixedOffset(-300)')

# If tz is specified simultaneously, it should convert the timezone. 
repr(pd.Timestamp('2013-11-01 00:00:00-0500', tz='America/Chicago'))
# Timestamp('2013-11-01 00:00:00-0500', tz='America/Chicago')

repr(pd.Timestamp('2013-11-01 00:00:00-0500', tz='Asia/Tokyo'))
# Timestamp('2013-11-01 14:00:00+0900', tz='Asia/Tokyo')
# Result after the fix

repr(pd.Timestamp('2001-01-01 00:00:00.000000005'))
# Timestamp('2001-01-01 00:00:00.000000005')

repr(pd.Timestamp('2001-01-01 00:00:00.000000005', tz='US/Eastern'))
# Timestamp('2001-01-01 00:00:00.000000005-0500', tz='US/Eastern')

repr(pd.Timestamp('2001-01-01 00:00:00.000001234+09:00'))
# Timestamp('2001-01-01 00:00:00.000001234+0900', tz='pytz.FixedOffset(540)')

repr(pd.Timestamp('2001-01-01 00:00:00.000001234+09:00', tz='Asia/Tokyo'))
# Timestamp('2001-01-01 00:00:00.000001234+0900', tz='Asia/Tokyo')

# Because non ISO 8601 format is parsed by dateutil, nanosecond will lost (no change)
repr(pd.Timestamp('01-01-01 00:00:00.000000001'))
# Timestamp('2001-01-01 00:00:00')

repr(pd.Timestamp('01-01-01 00:00:00.000000001+9:00'))
# Timestamp('2001-01-01 00:00:00+0900', tz='tzoffset(None, 32400)')

CC @cyber42 @ischwabacher @rockg @adamgreenhall

sinhrks

obj.value = pandas_datetimestruct_to_datetime(PANDAS_FR_ns, &obj.dts)
_check_dts_bounds(&obj.dts)
if out_tzoffset != -1:
obj.tzinfo = pytz.FixedOffset(out_tzoffset)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is preferable for fixed offset, pytz.FixedOffset or dateutil.tz.tzoffset?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a pytz ; dateutil must be explicitly specified (in theory could have an option for this but will defer that to the future - you can create an issue if u would like - pls copy @dbew on it)

@jreback

@sinhrks can you run a perf test and post if anything amiss (I don't think so, but just to be sure)

@sinhrks

As the original one gets little slower, modified the logic a little. Following is a perf before/after the modification. Looks varied, but no test gets constantly worse.

Perf with original logic in this PR (for ref)

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
read_parse_dates_iso8601                     |   3.5694 |   2.3197 |   1.5387 |

1st run with current logic

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
read_parse_dates_iso8601                     |   2.3770 |   2.3616 |   1.0065 |
...
packers_read_pickle                          | 306.5277 | 254.1726 |   1.2060 |
frame_apply_lambda_mean                      |   9.4993 |   7.8301 |   1.2132 |
write_store_table_dc                         | 246.1410 | 197.7520 |   1.2447 |
frame_dtypes                                 |   0.2213 |   0.1760 |   1.2579 |
reindex_daterange_pad                        |   2.0043 |   1.5914 |   1.2595 |
read_store_table_mixed                       |  21.8923 |  17.3550 |   1.2614 |
frame_mask_bools                             |  33.6537 |  26.1207 |   1.2884 |
frame_get_numeric_data                       |   0.2956 |   0.2290 |   1.2908 |
indexing_dataframe_boolean_rows_object       |   1.3426 |   0.9793 |   1.3710 |
index_float64_div                            |   8.8797 |   6.3930 |   1.3890 |
frame_dropna_axis1_all                       | 426.2613 | 299.9337 |   1.4212 |

2nd run

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
read_parse_dates_iso8601                     |   2.3813 |   2.3716 |   1.0041 |
...
strings_match                                |  12.7503 |  10.5516 |   1.2084 |
timeseries_1min_5min_mean                    |   1.5577 |   1.2853 |   1.2119 |
reindex_daterange_backfill                   |   1.8867 |   1.5500 |   1.2172 |
frame_ctor_dtindex_CBMonthEndx1              |   7.7756 |   6.3493 |   1.2246 |
eval_frame_add_all_threads                   |  35.2084 |  28.6183 |   1.2303 |
frame_ctor_dtindex_CustomBusinessDayx1       |   3.8170 |   3.0283 |   1.2604 |
eval_frame_mult_one_thread                   |  37.0866 |  29.2060 |   1.2698 |
eval_frame_chained_cmp_python_one_thread     | 135.0936 | 104.4240 |   1.2937 |
query_with_boolean_selection                 |  74.9604 |  57.4800 |   1.3041 |
dtindex_from_series_ctor                     |   0.0354 |   0.0270 |   1.3088 |
strings_center                               |   9.6043 |   7.3230 |   1.3115 |
stat_ops_frame_sum_int_axis_1                |   2.3720 |   1.7384 |   1.3645 |
strings_pad                                  |  10.5731 |   7.6413 |   1.3837 |
groupby_multi_different_functions            |  38.6817 |  25.0644 |   1.5433 |

@jreback

@rockg

This is kind of difficult to go through. The tests look okay to me...cannot comment on the specific changes, though.

jreback

@@ -399,7 +401,8 @@ Bug Fixes
- Bug in ``Series.str.cat`` with an index which was filtered as to not include the first item (:issue:`7857`)
- Bug in ``Timestamp`` cannot parse ``nanosecond`` from string (:issue:`7878`)
- Bug in ``Timestamp`` with string offset and ``tz`` results in incorrect (:issue:`7833`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean to say just incorrect, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, should be more specific, or grammatically wrong?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was just a minor grammar point

just say incorrect (not in incorrect)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

@jreback

@sinhrks

@sinhrks

jreback added a commit that referenced this pull request

Aug 11, 2014

@jreback

BUG: Timestamp cannot parse nanosecond from string

@jreback

This was referenced

Aug 11, 2014