BUG: pd.NA.format fails with format_specs by topper-123 · Pull Request #34740 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
pd.NA fails if passed to a format string and format parameters are supplied. This is different behaviour than np.nan and makes converting arrays containing pd.NA to strings very brittle and annoying.
Examples:
format(pd.NA) '' # master and PR, ok format(pd.NA, ".1f") TypeError # master '' # this PR format(pd.NA, ">5") TypeError # master ' ' # this PR, tries to behave like a string, then falls back to '', like np.na
The new behaviour mirrors the behaviour of np.nan.
@topper-123 Thanks for looking into this!
Personally, instead of relying on a try/except of NaN to check what is supported, I would rather try to understand how and what works for NaN, and try to implement the same logic here.
For example, I suppose that format(pd.NA, ">10.1f") will fail on this branch? While for NaN this works.
Now, properly implementing __format__ manually might be too complicated though, and the "fallback" of formatting the string might already be useful anyway.
Hmm, np.nan is just a float, so using the builtin float.__format__, I think, which is probably a bit complicated to replicate ...
Another idea: how robust would it be if we format some other value (eg np.nan), and then replace "nan" with "<NA>" in the result? We would need a bit of logic to potentially replace " nan" instead of "nan" if possible, but for the rest it might work in many cases?
Another idea: how robust would it be if we format some other value (eg np.nan), and then replace "nan" with "" in the result?
Wouldn't work out of the box, e.g. "nantes_{}.format(np.nan)", I don't think adding logic to get the correct "nan" is the right approach, it's too complicated IMO.
Another idea: pd.NA is supposed to work with all dtypes, not just floats, so probably should'nt be restricted to format_specs accepted by float. How about a simple:
def format(self, format_spec): try: return self.repr().format(format_spec) except ValueError: return self.repr()
This would allow string format_spec to work (as they do for floats already) and make self.repr() a fallback that always works.
Wouldn't work out of the box, e.g. "nantes_{}.format(np.nan)",
I don't fully know how the inner python details of this method work, but I suppose the above would end up calling pd.NA.__format__("") ?As long that the nan -> NA replacement happens inside the __format__ function, I would think the above to work fine.
How about a simple:
I think that is certainly better (avoiding only accepting the rules valid for float), but that still wouldn't work for the example I gave of format(pd.NA, ">10.1f") (I think).
(now, it's certainly already fixing a set of use cases, so could also be a good start)
Very quick try with
def __format__(self, format_spec) -> str:
res = format(np.nan, format_spec)
return res.replace("nan", "<NA>")
works for the example you gave, and also for the example I gave:
In [1]: "nantes_{}".format(pd.NA)
Out[1]: 'nantes_<NA>'
In [3]: format(pd.NA, ">10.1f")
Out[3]: ' <NA>'
Of course, the above still needs 1) take the 1 char length difference into account in case there is whitespace (like the second example) and 2) still fallback to formatting with the string repr and finally the plain string repr (like your example impl at #34740 (comment)).
Yeah, __format__ only works inside the brackets, so you're right there.
The length format spec would be one special case that would need to be handled, but are there other? I don't think so for floats, but there could be for other format_specs?
I've made the simpler implementation that I suggested. I'm a bit hesitant that adding the special cases will make this too complex.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am fine with the simplest solution that at least fixes the basic formatting, for now. I still think it wouldn't be hard to support proper floating point / numeric formatting (with the NaN formatting and replace afterwards)