BUG: Fixed Unicode decoding error in `Period.strftime` when a locale-specific directive is used by smarie · Pull Request #46405 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation104 Commits60 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

smarie changed the title~~Fixed Unicode decoding error in Period.strftime and PeriodIndex.strftime when a locale-specific directive is used~~ Fixed Unicode decoding error in Period.strftime when a locale-specific directive is used

I have an issue: I can reproduce the bug locally but not on CI.
It seems that there is no CI runner as of today that uses non-utf8 encoded locale.

The issue can only be reproduced when the current locale encoding is not utf8:

(zh_CN defaults encoding is gb2312)

What is the recommended way to proceed ?

fix the issue without having the CI reproduce it ? (not recommended...)
add a workflow in posix.yml
other ?

I'll try the second option for now (when github issues are fixed, it seems down for now)

gha runner was not triggered apparently :( re-trying

Unnfortunately it does not work :(

/usr/bin/bash: warning: setlocale: LC_ALL: cannot change locale (zh_CN.GB2312)

EDIT: retrying with zh_CN.gb2312

ok with lower case zh_CN.gb2312 it is worse: the runner does not even start :(
Can anyone help me to define a runner configured with this locale+encoding ? (@jreback maybe ?) Otherwise I am not able to reproduce the issue on CI, which is a bit problematic to prevent regression once I push the fix.

Sylvain MARIE added 5 commits

March 18, 2022 15:45

…ure/46319_locale_strftime_period

� Conflicts: � pandas/tests/io/formats/test_format.py

Sylvain MARIE added 7 commits

March 19, 2022 11:04

params=[
pytest.param(None, id=str(locale.getlocale())),
"it_IT.utf8",
"zh_CN",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

"zh_CN",
"zh_CN", # Note: encoding is automatically 'gb2312'

What do you think ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add zh_CH.utf8 and zh_CH.gb2312 explicitly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll add zh_CH.utf8.
Strangely enough, zh_CN.gb2312 did not seem to be working, but zh_CN does involve gb2312 .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also add it_IT by the way, so that both locales have their utf8 AND non-utf8 encoding activated in the test

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- name: Generate extra locales
# These extra locales will be available for locale.setlocale() calls in tests
run: \|
sudo locale-gen ${{ env.EXTRA_LOC }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the Locale: it_IT.utf8 isnt working either?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends what you mean by "working". Do you refer to the bug #46319 or to something more general ?

The issue BUG: UnicodeError when using Period.strftime with non-utf8 locale #46319 happens only when the current locale uses a non-unicode encoding (no utf-8). So it does not happen in any of the workers before this PR
So in the test fixture parameters I added the zh_CN locale, that has a non-utf-8 encoding.
Unfortunately this particular parameter value was never used in any of the workers (the test was skipped), since locale.setlocale was raising an error (locale not available)
I finally found a way to make the locale available on linux workers: sudo locale-gen <locale>

Does that answer your question ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was assumed setting the environment variables

      LANG: ${{ matrix.lang || '' }}
      LC_ALL: ${{ matrix.lc_all || '' }}

would ensure the locale for the test run would not be the default (en_US.UTF-8) for the other locale builds.

So IIUC, this is necessary if setting a non-utf8 locale?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, this is what I thought too.
Unfortunately when I tried setting LANG and LC_ALL to zh_CN (non-utf8), then there was a strange segmentation fault appearing (see #46405 (comment))

So this is why I now do not set it through these env variables but I only install it.

Maybe the problem is that when the env variables are set, many side effects happen in the build chain, that I do not quite understand.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're setting the locale this way, can we remove setting the LANG and LC_ALL environment variables here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I have no idea :) At least for the tests in this PR, yes these env variables are not used.

However these env vars are probably related to other locale-related tests, including being able to compile pandas on alternate locale, etc. It seems that in the CI those env variables are used to even add a specific line at the beginning of the init file of pandas ! (see ci/setup_env.sh)

This seems a bit beyond the scope of this PR. However it is maybe worth keeping track of this, in a dedicated ticket.

Sylvain MARIE added 6 commits

September 3, 2022 14:39

…ure/46319_locale_strftime_period

� Conflicts: � .github/workflows/ubuntu.yml

@mroeschke @jreback @jbrockmendel all set now. Sorry for the delay.
Note that the only failing CI build seems completely unrelated (the issue is about a TLS certificate bundle)

Maybe still on time for 1.5 milestone ?

Maybe still on time for 1.5 milestone ?

Sorry, since we released the rc already, we are only backporting fixes that are uncovered in the 1.5.0rc. Could you move the whatsnew note to 1.6.0.rst?

Sylvain MARIE added 2 commits

September 6, 2022 23:03

Thanks @smarie for your patience and persistence! Very nice fix

smarie deleted the feature/46319_locale_strftime_period branch

September 8, 2022 20:23

noatamir pushed a commit to noatamir/pandas that referenced this pull request

…specific directive is used (pandas-dev#46405)

Added test representative of pandas-dev#46319. Should fail on CI
Added a gha worker with non utf 8 zh_CN encoding
Attempt to fix the encoding so that locale works
Added the fix, but not using it for now, until CI is able to reproduce the issue.
Crazy idea: maybe simply removing the .utf8 modifier will use the right encoding !
Hopefully fixing the locale not available error
Now simply generating the locale, not updating the ubuntu one
Trying to install the locale without enabling it
Stupid mistake
Testing the optional locale generator condition
Put back all runners
Added whatsnew
Now using the fix
As per code review: moved locale-switching fixture overridden_locale to conftest
Flake8
Added comments on the runner
Added a non-utf8 locale in the it_IT runner. Added the zh_CN.utf8 locale in the tests
Improved readability of fixture overridden_locale as per code review
Added two comments on default encoding
Fixed pandas-dev#46319 by adding a new char_to_string_locale function in the tslibs.util module, able to decode char* using the current locale.
As per code review: modified the test to contain non-utf8 chars. Fixed the resulting issue.
Split the test in two for clarity
Fixed test and flake8 error.
Updated whatsnew to ref pandas-dev#46468 . Updated test name
Removing wrong whatsnew bullet
Nitpick on whatsnew as per code review
Fixed build error rst directive
Names incorrectly reverted in last merge commit
Fixed test_localization so that pandas-dev#46595 can be demonstrated on windows targets (even if today these do not run on windows targets, see pandas-dev#46597)
Fixed tm.set_locale context manager, it could error and leak when category LC_ALL was used. Fixed pandas-dev#46595
Removed the fixture as per code review, and added corresponding parametrization in tests.
Dummy mod to trigger CI again
reverted dummy mod
Attempt to fix the remaining error on the numpy worker
Fixed issue in _from_ordinal
Added asserts to try to understand
Reverted debugging asserts and applied fix for numpy repeat from pandas-dev#47670.
Fixed the last issue on numpy dev: a TypeError message had changed
Code review: Removed EXTRA_LOC
Code review: removed commented line
Code review: reverted out of scope change
Code review: reverted out of scope change
Fixed unused import
Fixed revert mistake
Moved whatsnew to 1.6.0
Update pandas/tests/io/parser/test_quoting.py

Co-authored-by: Sylvain MARIE sylvain.marie@se.com