New DataFrame feature: listify() and unlistify() · Issue #10511 · pandas-dev/pandas (original) (raw)
While building up a DataFrame in several steps, I found it difficult to add a new "perpendicular" column, i.e. a column that adds another dimension to already existing columns. To solve this problem I got the idea that this may be done in two steps:
- Add a new column whose values at each cell is a python list of the values the new column takes on.
- Unlistify the column thereby creating a new row for each element in the above lists.
I.e. I propose two new DataFrame methods, listify() and unlistify().
listify(df, column)
: Takes as input a dataframe and the name of a column. It will do a groupby of the df for all columns except column and generate a single row where the values in the column cell is a list of the column column
values.
unlistify(df, column)
: Takes as input a dataframe and the name of a column. It will iterate over the values of the contents of column
for each row and generate a new row for each value.
The functions may be expanded to support multiple columns. listify()
may e.g. support a post processing function, that will be applied on the list.
The following python code illustrates these two functions. But obviously the functionality may be implemented more efficienctly on the C-level.
#!/usr/bin/python
import pandas as pd
def listify(df, column): matches = [i for i,n in enumerate(df.columns) if n==column]
if len(matches)==0: raise Exception('Failed to find column named ' + column +'!') if len(matches)>1: raise Exception('More than one column named ' + column +'!')
old_index = df.index col_idx = matches[0] + len(old_index.shape) # Since we will reset the index
column_names = list(df.index.names) + list(df.columns) gb_cols = [c for c in column_names if c!= column]
Helper function to generate the squashed dataframe
def fnc(d): row = list(d.values[0]) return pd.DataFrame([row[:col_idx] + [[v[col_idx] for v in list(d.values)]] + row[col_idx+1:]])
return (df .reset_index() .groupby(gb_cols) .apply(fnc) .rename(columns = lambda i : column_names[i]) .set_index(old_index.names) )
def unlistify(df, column): matches = [i for i,n in enumerate(df.columns) if n==column]
if len(matches)==0: raise Exception('Failed to find column named ' + column +'!') if len(matches)>1: raise Exception('More than one column named ' + column +'!')
col_idx = matches[0]
Helper function to expand and repeat the column col_idx
def fnc(d): row = list(d.values[0]) bef = row[:col_idx] aft = row[col_idx+1:] col = row[col_idx] z = [bef + [c] + aft for c in col] return pd.DataFrame(z)
col_idx += len(df.index.shape) # Since we will push reset the index index_names = list(df.index.names) column_names = list(index_names) + list(df.columns) return (df .reset_index() .groupby(level=0,as_index=0) .apply(fnc) .rename(columns = lambda i :column_names[i]) .set_index(index_names) )
Examples of how to listify and unlistify a column.
df = pd.DataFrame([[11,range(5),10], [22,range(3),20]], columns = ['A','B','C']).set_index('C') print 'org' print df print '--' df = unlistify(df,'B') print 'unlistify(df,B)' print df print '--' df = listify(df,'B') print 'listify(df,B)' print df
The corresponding output:
org
A B
C
10 11 [0, 1, 2, 3, 4]
20 22 [0, 1, 2]
--
unlistify(df,B)
A B
C
10 11 0
10 11 1
10 11 2
10 11 3
10 11 4
20 22 0
20 22 1
20 22 2
--
listify(df,B)
A B
C
10 11 [0, 1, 2, 3, 4]
20 22 [0, 1, 2]