ENH: automatic rpy2 instance conversion by sinhrks · Pull Request #7385 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation20 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

sinhrks

Derived from #7309. Create a wrapper for robjects.r in pandas.rpy.common to perform automatic pandas DataFrame and Series conversion. Series will be converted to R data.frame to preserve rownames (index).

If looks OK, I'll modify the doc (#7309) based on following API.

import pandas as pd
import pandas.rpy.common as com

iris = com.load_data('iris')
com.r.assign('iris', iris)
returned = com.r['iris']
type(returned)
# <class 'pandas.core.frame.DataFrame'>

df = pd.DataFrame(np.random.randn(20, 5),
                  index=pd.date_range(start='2011/01/01', freq='D', periods=20))
com.r.assign('df', df)
returned = com.r['df']
type(returned)
# <class 'pandas.core.frame.DataFrame'>

s = pd.Series(np.random.randn(20), name='test')
com.r.assign('s', s)
returned = com.r['s']
type(returned)
# <class 'pandas.core.frame.DataFrame'>

cpcloud

def __getattribute__(self, attr):
if attr == 'assign':
return _assign
return robj.r.__getattribute__(attr)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to use the interface provided, i.e., instead of robj.r.__getattribute__(attr), just do getattr(robj.r, attr). Same for the below methods: just call their respective toplevel functions or behavior as you would if you were a user. Sometimes Python itself performs ops on the result of a special method call, e.g., for rich comparisons Python will automatically compare the ids of two objects if either of their comparison methods of the same name return NotImplemented. This is done internally in Python, but if you directly call the method like __eq__ you don't get this convenience.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, modified.

@jorisvandenbossche

See also comment of @sinhrks here: #7309 (comment):

I've briefly checked pandas2ri of rpy2.4.0, and found current pandas conversion looks better. pandas2ri doesn't convert returned rpy2 DataFrame automatically, and may raise ValueError for DatetimeIndex.

I think we have to decide where we want this conversion machinery to live (because now you have one in ipython magic (but that is moved to rpy2), rpy2 and pandas):

@lgautier @davclark

@davclark

I'm leaving town for a week, so I'll pick this up next weekend, but wanted to let folks know that rpy2 needs to have some machinery for R -> python conversion (obviously), and so it makes the most sense to me to have the code live there, and I'm pretty sure any reasonable patch would be happily accepted.

You can see that the rmagic code (in the process of being deprecated in IPython, now living in rpy2.ipython) hands all conversion over to ro.conversion.ri2ro. So, to do this in rpy2, the idea would be to make pandas2ri.activate() set up better conversion in the dynamically patched ri2ro function.

I actually opened up an issue about this, as my memory was that things were better than they currently are! I haven't had time to go digging though:

https://bitbucket.org/lgautier/rpy2/issue/206/numpy2ri-pandas2ri-no-longer-properly

For what it's worth, I think if we have pandas installed (and invoke pandas2ri.activate()), a pandas.Series is a much better choice for conversion of R lists and vectors than a numpy object, as you get a proper index.

@lgautier

(...)

I'm pretty sure any reasonable patch would be happily accepted.
(...)

So am I.

@sinhrks

Maybe interface and conversion logic should be discussed separately.

Conversion Functions

Currently pandas conversion looks better for me. I agree it should be merged in the future, and it should be decided on which module the conversion function maintained. I think the conversion more rely on the type of Index and pandas version, thus it may better to hold the conversion logic on pandas and call it from rpy2 ?

Conversion Interface

In my use case, sometimes I want to handle rpy2 raw values otherwise want automatic conversion. As pandas2ri overwrites all the rpy2 default conversion func, I have to activate and deactivate every time based on my operation to do this. Thus I prefer pandas r to be wrapped to perform automatic conversion, and I think it is natural.

@jreback

@davclark

Saw this question was unanswered while checking into another issue. It should be noted that @lgautier fixed an issue with already existing code to convert pandas DataFrames automatically into rpy2 wrapped function calls.

The logic for the direction rpy2 has moved is that conversion to (other) python objects has been deprecated in favor of rpy2 proxy objects (wrapped R objects) supporting the array interface so numpy calls work directly on rpy2.robjects objects. And if you want a true numpy.array, you can just use numpy.asarray.

It's less obvious how to do that in pandas as there's nothing equivalent to the standard array / buffer API for tables of data.

The other piece is that we've been talking about moving to a generics approach to handling conversion on the rpy2 end in the future.

So, that's the state of things on the rpy2 side. Probably in any case it's good to have the code that inspects the guts of R objects live in rpy2. If folks want to coordinate, that'd be great. In particular, no one has asked for anything on the rpy2 side, right?

@jorisvandenbossche

Conversion functions

@davclark Do you mean that the future of the pandas2ri module in rpy2 is uncertain? (as this does not fit in the generic approach?)
The question on the conversion functions is where this should live, in pandas or in rpy2? So in fact, that is asking something on the rpy2 side, as the current conversion functions in rpy2 are lacking in some ways and should be improved if we decide that it should live in rpy2 (or at least accepting PRs).

@sinhrks I think you could also say the conversion depends more on the internals of the rpy2 objects and so rpy2 version, and should only use public pandas API. But if more contributors of pandas are interested in keeping this up to date, it is maybe easier to do it here.

@davclark What do you think of the conversion interface issue raised by @sinhrks above?

@davclark

The functionality of rpy2.pandas2ri.activate() should remain about the same. The infrastructure that supports should become more robust and extensible via generics. This not-yet-implemented generic system would be a good place for pandas code to modify conversion to and from R.

My feeling is that advanced users like @sinhrks would be better served by using the conversion functions directly (pandas2ri.pandas2ri() and pandas2ri.ri2pandas()), rather than activating and deactivating (i.e., swapping functions assigned to a given symbol). Note that there is no longer a general ri2py, as one can use ri2ro to get an object that supports the array interface. From there, it is easy to do numpy.asarray(). However, it seems maybe ri2py should come back if there is strong demand.

@sinhrks - is there a reason that simply using the functions directly doesn't work for you?

Can someone provide a conceptual diff on those pandas2ri functions with the pandas.rpy. I know there are things I'd like to see in rpy2: for example, by adding multi-index support (it's not clear what the right way to do this is!). I'm not sure why a pandas.Series should be a data.frame in R as R vectors and lists have names().

@jorisvandenbossche, sorry if I came across as snarky. Does someone want to provide a PR against rpy2? We had a strange default branch for a while, but it's been rebased onto the 2.4.x branch, and is now targeting a 2.5.x release. So default is a good place to start (equivalent of master on git). Or, an answer to the above "conceptual diff" (or issues on the rpy2 issue tracker) would be enough to get us going in the right direction.

@jorisvandenbossche

@davclark Ah, I didn't interpreted you as snarky! Sorry if I implied that I did :-) Your input is certainly valued!

@sinhrks

@davclark Ah, what I meant is I want to perform automatic conversion in separate ways, sometimes numpy and otherwise pandas, etc. And I'm not willing to to call each raw function like pandas2ri or activate/diactivate every time. My idea is preparing separate input path (such as robjects.r and pandas.rpy2.common.r) which performs automatic conversion separate ways. But whatever possible.

And agreed to Series should be converted to vector, I'll fix this.

@davclark

Thanks @sinhrks. That clarifies your concerns. It strikes me that this might be best expressed via a context manager... Can you provide the two use-cases or user models that would differentiate between the rpy2 model and the pandas model? It would be good to be clear on that as we coordinate.

@lgautier

@sinhrks

@davclark Ah, what I meant is I want to perform automatic conversion in separate ways, sometimes numpy and otherwise pandas, etc. And I'm not willing to to call each raw function like pandas2ri or activate/diactivate every time. My idea is preparing separate input path (such as robjects.r and pandas.rpy2.common.r) which performs automatic conversion separate ways. But whatever possible.

And agreed to Series should be converted to vector, I'll fix this.

"automatic" conversion that would change its conversion logic is possible with the existing conversion infrastructure in rpy2. You just have to make your own conversion logic and register it.

Should you want to have you own conversion rules that disregards existing conversion, this is also possible. As a module owner you can decide on the way it should be done: this is between you and your users. In the present case, may be worth considering looking at how the existing conversion in rpy2 could address your needs, and suggest changes where it does not.

The case of explicitly parallel and active conversion rules is not very well addressed by the current design in rpy2 (as it is using the fact that imported modules are singletons, and the active conversion is always at rpy2.robjects.conversion.<function>.

@lgautier

@davclark

Thanks @sinhrks. That clarifies your concerns. It strikes me that this might be best expressed via a context manager... Can you provide the two use-cases or user models that would differentiate between the rpy2 model and the pandas model? It would be good to be clear on that as we coordinate.

Using a context manager would be an elegant idea. The only potential is issue would be with if several threads are used, as the conversion system would be modified "globally", even if encapsulated in a context.

@davclark

Just to touch base, I'm spending some time with @mrocklin thinking about how to do general conversion. He and some folks at Continuum are working on a project you've likely heard of called blaze, which in particular contains a simple conversion system called into that exercises @mrocklin's multiple dispatch mechanism. There's a related package called dynd, which we're looking at as a way to handle sensible handling of things like missing data for conversion to R. We're also discussing difficulties that arise with multi-indices.

But he seems willing to break out into as a separate project, and this could perhaps be a way to coordinate conversion between data-frame (and other) packages like pytables, pandas, R, SQL, etc.

In any case, I'd still love to hear a bit more about what kind of API people would like to see.

@jreback

@sinhrks can you rebase / update

what is the status of this?

@jreback

@sinhrks

@sinhrks

@jreback

see #9602 we are deprecating in 0.16.0. and redirecting to rpy2 for future conversions.

Labels

IO Data

IO issues that don't fit into a more specific label