ENH/DOC: reimplement Series delegates/accessors using descriptors by shoyer · Pull Request #9322 · pandas-dev/pandas (original) (raw)

OK, a few things we could do:

  1. Do checks to take str out of __dir__ for invalid types. This would eliminate the auto-complete issue, but I think s.str? would give the same message you showed above (object s.str not found).
  2. Return a standard StringsMethod object, but add some sort of hook that checks that the type is valid before every method lookup. You could still auto-complete str methods, though, and this is more complex for .dt, because it can create several sub-types of accessors.
  3. Make s.str for invalid types some sort of "deferred error" object that raises TypeError when any attribute is accessed but with a copied docstring from StringMethods. I tossed together an implementation, which gives us functionality like the following:
In [15]: s = pd.Series([1])

In [16]: s.str.<tab>

In [17]: s.str
Out[17]: <pandas.core.series.InvalidStringMethods at 0x107a32fd0>

In [18]: s.str?
Type:        InvalidStringMethods
String form: <pandas.core.series.InvalidStringMethods object at 0x107a8e150>
File:        /Users/shoyer/dev/pandas/pandas/core/series.py
Docstring:
Vectorized string functions for Series. NAs stay NA unless handled
otherwise by a particular method. Patterned after Python's string methods,
with some inspiration from R's stringr package.

Examples
--------
>>> s.str.split('_')
>>> s.str.replace('_', '')

In [19]: s.str.cat
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-e75e3f77c883> in <module>()
----> 1 s.str.cat

/Users/shoyer/dev/pandas/pandas/core/series.py in __getattr__(self, name)
   2552
   2553     def __getattr__(self, name):
-> 2554         raise self._error
   2555
   2556

TypeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Unfortunately, it is not possible (AFAICT) to make an object on which repr raises a TypeError but for which __doc__ is well defined.

I'm -0 on these options. They add complexity and I don't think they're that much more usable -- if s.str? says not found, the first thing I'm going to try to do is see what s.str is, which will raise the TypeError. I also don't think there are that many who search through the Series namespace for methods -- there are simply too many methods/properties for that to be very useable.