BUG: Fixed Unicode decoding error in Period.strftime
when a locale-specific directive is used by smarie · Pull Request #46405 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation104 Commits60 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
smarie changed the title
Fixed Unicode decoding error in Fixed Unicode decoding error in Period.strftime
and PeriodIndex.strftime
when a locale-specific directive is usedPeriod.strftime
when a locale-specific directive is used
I have an issue: I can reproduce the bug locally but not on CI.
It seems that there is no CI runner as of today that uses non-utf8 encoded locale.
The issue can only be reproduced when the current locale encoding is not utf8:
(zh_CN defaults encoding is gb2312)
What is the recommended way to proceed ?
- fix the issue without having the CI reproduce it ? (not recommended...)
- add a workflow in
posix.yml
- other ?
I'll try the second option for now (when github issues are fixed, it seems down for now)
gha runner was not triggered apparently :( re-trying
Unnfortunately it does not work :(
/usr/bin/bash: warning: setlocale: LC_ALL: cannot change locale (zh_CN.GB2312)
EDIT: retrying with zh_CN.gb2312
ok with lower case zh_CN.gb2312
it is worse: the runner does not even start :(
Can anyone help me to define a runner configured with this locale+encoding ? (@jreback maybe ?) Otherwise I am not able to reproduce the issue on CI, which is a bit problematic to prevent regression once I push the fix.
Sylvain MARIE added 5 commits
…ure/46319_locale_strftime_period
� Conflicts: � pandas/tests/io/formats/test_format.py
Sylvain MARIE added 7 commits
params=[ |
---|
pytest.param(None, id=str(locale.getlocale())), |
"it_IT.utf8", |
"zh_CN", |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe
"zh_CN", |
---|
"zh_CN", # Note: encoding is automatically 'gb2312' |
What do you think ?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add zh_CH.utf8
and zh_CH.gb2312
explicitly
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll add zh_CH.utf8
.
Strangely enough, zh_CN.gb2312
did not seem to be working, but zh_CN
does involve gb2312 .
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also add it_IT
by the way, so that both locales have their utf8 AND non-utf8 encoding activated in the test
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
- name: Generate extra locales |
---|
# These extra locales will be available for locale.setlocale() calls in tests |
run: | |
sudo locale-gen ${{ env.EXTRA_LOC }} |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the Locale: it_IT.utf8
isnt working either?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends what you mean by "working". Do you refer to the bug #46319 or to something more general ?
- The issue BUG: UnicodeError when using Period.strftime with non-utf8 locale #46319 happens only when the current locale uses a non-unicode encoding (no utf-8). So it does not happen in any of the workers before this PR
- So in the test fixture parameters I added the
zh_CN
locale, that has a non-utf-8 encoding. - Unfortunately this particular parameter value was never used in any of the workers (the test was skipped), since
locale.setlocale
was raising an error (locale not available) - I finally found a way to make the locale available on linux workers:
sudo locale-gen <locale>
Does that answer your question ?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was assumed setting the environment variables
LANG: ${{ matrix.lang || '' }}
LC_ALL: ${{ matrix.lc_all || '' }}
would ensure the locale for the test run would not be the default (en_US.UTF-8) for the other locale builds.
So IIUC, this is necessary if setting a non-utf8 locale?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, this is what I thought too.
Unfortunately when I tried setting LANG
and LC_ALL
to zh_CN
(non-utf8), then there was a strange segmentation fault appearing (see #46405 (comment))
So this is why I now do not set it through these env variables but I only install it.
Maybe the problem is that when the env variables are set, many side effects happen in the build chain, that I do not quite understand.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're setting the locale this way, can we remove setting the LANG
and LC_ALL
environment variables here?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I have no idea :) At least for the tests in this PR, yes these env variables are not used.
However these env vars are probably related to other locale-related tests, including being able to compile pandas on alternate locale, etc. It seems that in the CI those env variables are used to even add a specific line at the beginning of the init file of pandas ! (see ci/setup_env.sh
)
This seems a bit beyond the scope of this PR. However it is maybe worth keeping track of this, in a dedicated ticket.
Sylvain MARIE added 6 commits
…ure/46319_locale_strftime_period
� Conflicts: � .github/workflows/ubuntu.yml
@mroeschke @jreback @jbrockmendel all set now. Sorry for the delay.
Note that the only failing CI build seems completely unrelated (the issue is about a TLS certificate bundle)
Maybe still on time for 1.5 milestone ?
Maybe still on time for 1.5 milestone ?
Sorry, since we released the rc already, we are only backporting fixes that are uncovered in the 1.5.0rc. Could you move the whatsnew note to 1.6.0.rst
?
Sylvain MARIE added 2 commits
Thanks @smarie for your patience and persistence! Very nice fix
smarie deleted the feature/46319_locale_strftime_period branch
noatamir pushed a commit to noatamir/pandas that referenced this pull request
…specific directive is used (pandas-dev#46405)
Added test representative of pandas-dev#46319. Should fail on CI
Added a gha worker with non utf 8 zh_CN encoding
Attempt to fix the encoding so that locale works
Added the fix, but not using it for now, until CI is able to reproduce the issue.
Crazy idea: maybe simply removing the .utf8 modifier will use the right encoding !
Hopefully fixing the locale not available error
Now simply generating the locale, not updating the ubuntu one
Trying to install the locale without enabling it
Stupid mistake
Testing the optional locale generator condition
Put back all runners
Added whatsnew
Now using the fix
As per code review: moved locale-switching fixture
overridden_locale
to conftestFlake8
Added comments on the runner
Added a non-utf8 locale in the
it_IT
runner. Added the zh_CN.utf8 locale in the testsImproved readability of fixture
overridden_locale
as per code reviewAdded two comments on default encoding
Fixed pandas-dev#46319 by adding a new
char_to_string_locale
function in thetslibs.util
module, able to decode char* using the current locale.As per code review: modified the test to contain non-utf8 chars. Fixed the resulting issue.
Split the test in two for clarity
Fixed test and flake8 error.
Updated whatsnew to ref pandas-dev#46468 . Updated test name
Removing wrong whatsnew bullet
Nitpick on whatsnew as per code review
Fixed build error rst directive
Names incorrectly reverted in last merge commit
Fixed test_localization so that pandas-dev#46595 can be demonstrated on windows targets (even if today these do not run on windows targets, see pandas-dev#46597)
Fixed
tm.set_locale
context manager, it could error and leak when category LC_ALL was used. Fixed pandas-dev#46595Removed the fixture as per code review, and added corresponding parametrization in tests.
Dummy mod to trigger CI again
reverted dummy mod
Attempt to fix the remaining error on the numpy worker
Fixed issue in
_from_ordinal
Added asserts to try to understand
Reverted debugging asserts and applied fix for numpy repeat from pandas-dev#47670.
Fixed the last issue on numpy dev: a TypeError message had changed
Code review: Removed
EXTRA_LOC
Code review: removed commented line
Code review: reverted out of scope change
Code review: reverted out of scope change
Fixed unused import
Fixed revert mistake
Moved whatsnew to 1.6.0
Update pandas/tests/io/parser/test_quoting.py
Co-authored-by: Sylvain MARIE sylvain.marie@se.com