series finalized not correctly called in merge? · Issue #6923 · pandas-dev/pandas (original) (raw)

I got some help from Jeff on stackoverflow, but either I'm misunderstanding the way __finalized__ works, or there's a bug in how it's called. My intent was to preserve series metadata after 2 dataframes being merged, and I believe __finalize__ should be able to handle this.

I define a couple dataframes, and assign metadata values to all the series:

import numpy as np
import pandas as pd
np.random.seed(10)
df1 = pd.DataFrame(np.random.randint(0, 4, (3, 2)), columns=['a', 'b'])
df2 = pd.DataFrame(np.random.randint(0, 4, (3, 2)), columns=['c', 'd'])
df1
     a  b
  0  1  1
  1  0  3
  2  0  1
df2
     c  d
  0  3  0
  1  1  1
  2  0  1

Then I assign metadata field filename to series

pd.Series._metadata = ['name', 'filename']

for c1 in df1:
    df1[c1].filename = 'fname1.csv'
for c2 in df2:
    df2[c2].filename = 'fname2.csv'

Now, I'm defining __finalize__ for series, which I understand is able to propagate metadata from one series to the other, for example when I want to merge. But when I define a __finalize__ that prints off the metadata that I've already assigned, it looks like by the time it calls __finalize__, it no longer has the metadata.

def finalize_ser(self, other, method=None, **kwargs):
  print 'Self meta: {}'.format(getattr(self, 'filename', None))
  print 'Other meta: {}'.format(getattr(other, 'filename', None))

  for name in self._metadata:
      object.__setattr__(self, name, getattr(other, name, ''))
  return self

pd.Series.__finalize__ = finalize_ser

When I call merge, I never see the correct metadata printed off

df1.merge(df2, left_on=['a'], right_on=['c'], how='inner')
  Self meta: None
  Other meta: None
  Self meta: None
  Other meta: None
  Self meta: None
  Other meta: None
  Self meta: None
  Other meta: None
  Out[5]:
     a  b  c  d
  0  1  1  1  1
  1  0  3  0  1
  2  0  1  0  1

It appears the metadata is lost before it gets to the __finalize__ call, though it's still in the original series

df1.a.filename  # => 'fname1.csv'
mgd.a.filename  # => AttributeError

Is this expected or is there a bug?