New DataFrame feature: listify() and unlistify() · Issue #10511 · pandas-dev/pandas (original) (raw)

While building up a DataFrame in several steps, I found it difficult to add a new "perpendicular" column, i.e. a column that adds another dimension to already existing columns. To solve this problem I got the idea that this may be done in two steps:

  1. Add a new column whose values at each cell is a python list of the values the new column takes on.
  2. Unlistify the column thereby creating a new row for each element in the above lists.

I.e. I propose two new DataFrame methods, listify() and unlistify().

listify(df, column): Takes as input a dataframe and the name of a column. It will do a groupby of the df for all columns except column and generate a single row where the values in the column cell is a list of the column column values.

unlistify(df, column): Takes as input a dataframe and the name of a column. It will iterate over the values of the contents of column for each row and generate a new row for each value.

The functions may be expanded to support multiple columns. listify() may e.g. support a post processing function, that will be applied on the list.

The following python code illustrates these two functions. But obviously the functionality may be implemented more efficienctly on the C-level.

#!/usr/bin/python

import pandas as pd

def listify(df, column): matches = [i for i,n in enumerate(df.columns) if n==column]

if len(matches)==0: raise Exception('Failed to find column named ' + column +'!') if len(matches)>1: raise Exception('More than one column named ' + column +'!')

old_index = df.index col_idx = matches[0] + len(old_index.shape) # Since we will reset the index

column_names = list(df.index.names) + list(df.columns) gb_cols = [c for c in column_names if c!= column]

Helper function to generate the squashed dataframe

def fnc(d): row = list(d.values[0]) return pd.DataFrame([row[:col_idx] + [[v[col_idx] for v in list(d.values)]] + row[col_idx+1:]])

return (df .reset_index() .groupby(gb_cols) .apply(fnc) .rename(columns = lambda i : column_names[i]) .set_index(old_index.names) )

def unlistify(df, column): matches = [i for i,n in enumerate(df.columns) if n==column]

if len(matches)==0: raise Exception('Failed to find column named ' + column +'!') if len(matches)>1: raise Exception('More than one column named ' + column +'!')

col_idx = matches[0]

Helper function to expand and repeat the column col_idx

def fnc(d): row = list(d.values[0]) bef = row[:col_idx] aft = row[col_idx+1:] col = row[col_idx] z = [bef + [c] + aft for c in col] return pd.DataFrame(z)

col_idx += len(df.index.shape) # Since we will push reset the index index_names = list(df.index.names) column_names = list(index_names) + list(df.columns) return (df .reset_index() .groupby(level=0,as_index=0) .apply(fnc) .rename(columns = lambda i :column_names[i]) .set_index(index_names) )

Examples of how to listify and unlistify a column.

df = pd.DataFrame([[11,range(5),10], [22,range(3),20]], columns = ['A','B','C']).set_index('C') print 'org' print df print '--' df = unlistify(df,'B') print 'unlistify(df,B)' print df print '--' df = listify(df,'B') print 'listify(df,B)' print df

The corresponding output:

org
     A                B
C                      
10  11  [0, 1, 2, 3, 4]
20  22        [0, 1, 2]
--
unlistify(df,B)
     A  B
C        
10  11  0
10  11  1
10  11  2
10  11  3
10  11  4
20  22  0
20  22  1
20  22  2
--
listify(df,B)
     A                B
C                      
10  11  [0, 1, 2, 3, 4]
20  22        [0, 1, 2]