BUG: Timestamp cannot parse nanosecond from string by sinhrks · Pull Request #7907 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation13 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Fixes 2 problems related to Timestamp string parsing:

Timestamp parsing results incorrect if input contains offset string (Closes BUG: Constructing Timestamp from local time + offset + time zone yields wrong offset #7833).
NOTE: Also modified Timestamp.__repr__ to display fixed timezone info, because this can be either pytz or dateutil.

# Result after the fix

# If string contains offset, it will be parsed using fixed timezone offset. Following results in 2013-11-01 05:00:00 in UTC (There is some existing tests checking this behaviour).
repr(pd.Timestamp('2013-11-01 00:00:00-0500'))
# Timestamp('2013-11-01 00:00:00-0500', tz='pytz.FixedOffset(-300)')

# If tz is specified simultaneously, it should convert the timezone. 
repr(pd.Timestamp('2013-11-01 00:00:00-0500', tz='America/Chicago'))
# Timestamp('2013-11-01 00:00:00-0500', tz='America/Chicago')

repr(pd.Timestamp('2013-11-01 00:00:00-0500', tz='Asia/Tokyo'))
# Timestamp('2013-11-01 14:00:00+0900', tz='Asia/Tokyo')

Timestamp loses nanosecond info when parsing from str. (Closes Timestamp losing nanosecond accuracy when reading from string #7878)

# Result after the fix

repr(pd.Timestamp('2001-01-01 00:00:00.000000005'))
# Timestamp('2001-01-01 00:00:00.000000005')

repr(pd.Timestamp('2001-01-01 00:00:00.000000005', tz='US/Eastern'))
# Timestamp('2001-01-01 00:00:00.000000005-0500', tz='US/Eastern')

repr(pd.Timestamp('2001-01-01 00:00:00.000001234+09:00'))
# Timestamp('2001-01-01 00:00:00.000001234+0900', tz='pytz.FixedOffset(540)')

repr(pd.Timestamp('2001-01-01 00:00:00.000001234+09:00', tz='Asia/Tokyo'))
# Timestamp('2001-01-01 00:00:00.000001234+0900', tz='Asia/Tokyo')

# Because non ISO 8601 format is parsed by dateutil, nanosecond will lost (no change)
repr(pd.Timestamp('01-01-01 00:00:00.000000001'))
# Timestamp('2001-01-01 00:00:00')

repr(pd.Timestamp('01-01-01 00:00:00.000000001+9:00'))
# Timestamp('2001-01-01 00:00:00+0900', tz='tzoffset(None, 32400)')

CC @cyber42 @ischwabacher @rockg @adamgreenhall

sinhrks

obj.value = pandas_datetimestruct_to_datetime(PANDAS_FR_ns, &obj.dts)
_check_dts_bounds(&obj.dts)
if out_tzoffset != -1:
obj.tzinfo = pytz.FixedOffset(out_tzoffset)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is preferable for fixed offset, pytz.FixedOffset or dateutil.tz.tzoffset?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a pytz ; dateutil must be explicitly specified (in theory could have an option for this but will defer that to the future - you can create an issue if u would like - pls copy @dbew on it)

@sinhrks can you run a perf test and post if anything amiss (I don't think so, but just to be sure)

As the original one gets little slower, modified the logic a little. Following is a perf before/after the modification. Looks varied, but no test gets constantly worse.

Perf with original logic in this PR (for ref)

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
read_parse_dates_iso8601                     |   3.5694 |   2.3197 |   1.5387 |

1st run with current logic

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
read_parse_dates_iso8601                     |   2.3770 |   2.3616 |   1.0065 |
...
packers_read_pickle                          | 306.5277 | 254.1726 |   1.2060 |
frame_apply_lambda_mean                      |   9.4993 |   7.8301 |   1.2132 |
write_store_table_dc                         | 246.1410 | 197.7520 |   1.2447 |
frame_dtypes                                 |   0.2213 |   0.1760 |   1.2579 |
reindex_daterange_pad                        |   2.0043 |   1.5914 |   1.2595 |
read_store_table_mixed                       |  21.8923 |  17.3550 |   1.2614 |
frame_mask_bools                             |  33.6537 |  26.1207 |   1.2884 |
frame_get_numeric_data                       |   0.2956 |   0.2290 |   1.2908 |
indexing_dataframe_boolean_rows_object       |   1.3426 |   0.9793 |   1.3710 |
index_float64_div                            |   8.8797 |   6.3930 |   1.3890 |
frame_dropna_axis1_all                       | 426.2613 | 299.9337 |   1.4212 |

2nd run

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
read_parse_dates_iso8601                     |   2.3813 |   2.3716 |   1.0041 |
...
strings_match                                |  12.7503 |  10.5516 |   1.2084 |
timeseries_1min_5min_mean                    |   1.5577 |   1.2853 |   1.2119 |
reindex_daterange_backfill                   |   1.8867 |   1.5500 |   1.2172 |
frame_ctor_dtindex_CBMonthEndx1              |   7.7756 |   6.3493 |   1.2246 |
eval_frame_add_all_threads                   |  35.2084 |  28.6183 |   1.2303 |
frame_ctor_dtindex_CustomBusinessDayx1       |   3.8170 |   3.0283 |   1.2604 |
eval_frame_mult_one_thread                   |  37.0866 |  29.2060 |   1.2698 |
eval_frame_chained_cmp_python_one_thread     | 135.0936 | 104.4240 |   1.2937 |
query_with_boolean_selection                 |  74.9604 |  57.4800 |   1.3041 |
dtindex_from_series_ctor                     |   0.0354 |   0.0270 |   1.3088 |
strings_center                               |   9.6043 |   7.3230 |   1.3115 |
stat_ops_frame_sum_int_axis_1                |   2.3720 |   1.7384 |   1.3645 |
strings_pad                                  |  10.5731 |   7.6413 |   1.3837 |
groupby_multi_different_functions            |  38.6817 |  25.0644 |   1.5433 |

This is kind of difficult to go through. The tests look okay to me...cannot comment on the specific changes, though.

jreback

@@ -399,7 +401,8 @@ Bug Fixes
- Bug in ``Series.str.cat`` with an index which was filtered as to not include the first item (:issue:`7857`)



- Bug in ``Timestamp`` cannot parse ``nanosecond`` from string (:issue:`7878`)
- Bug in ``Timestamp`` with string offset and ``tz`` results in incorrect (:issue:`7833`)