String dtype: implement object-dtype based StringArray variant with NumPy semantics by jorisvandenbossche · Pull Request #58451 · pandas-dev/pandas (original) (raw)
For the NA variants, do we really even need to expose/check against the storage or can we always just check that we have a StringDtype with a pd.NA missing value marker?
You mean that our asserters (assert_frame_equal et al) would consider a column with string dtype backed by pyarrow vs python as equal? (as long as the na_value is equal)
The main thing I want to decouple users from is relying heavily on things like
StringDtype(storage="python"), because it makes it really hard to move them away from our internals.
Yes, I also want to ensure that our general goal is that essentially almost no user should have to be explicit about the storage (that's also one of the reasons that I do not want to include the storage in the string alias for the new-to-be-default string dtype).
But, this is about a developer tool. Currently we regard pd.StringDtype("pyarrow") == pd.StringDtype("python") as False, and so assert_frame_equal will fail for that. And in that case, for the developer UX, the assert error message should be clear.
Right now, you can get something like:
AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="col1") are different
Attribute "dtype" are different
[left]: string[pyarrow]
[right]: string[pyarrow]
which is not very helpful ...
The reason for that is because I did not bake the pd.NA vs np.nan information in the string alias / representation.
See also #59342 for an issue about this that was just opened (we should probably continue the main discussion about an informative repr there, but short term I want to include something like the above as a short-term solution until we decide on the best way forward to address the issues in #59342)
FWIW I also added this code change in #59352 (to fix up failing tests on main) but with a comment explaining the issue.