ENH/DOC: wide_to_long performance and docstring clarification by erikcs · Pull Request #14779 · pandas-dev/pandas (original) (raw)

So here is an attempt to make the original interface more robust, these two examples fail on the master branch, but should be able to produce the correct result which is:

In [12]: df = pd.DataFrame({ ...: 'A11': ['a11', 'a22', 'a33'], ...: 'A12': ['a21', 'a22', 'a23'], ...: 'B11': ['b11', 'b12', 'b13'], ...: 'B12': ['b21', 'b22', 'b23'], ...: 'BB11': [1, 2, 3], ...: 'BB12': [4, 5, 6], ...: 'BBBX' : [91, 92, 93], ...: 'BBBZ' : [91, 92, 93] ...: }) ...: df['id'] = df.index ...: df ...: Out[12]: A11 A12 B11 B12 BB11 BB12 BBBX BBBZ id 0 a11 a21 b11 b21 1 4 91 91 0 1 a22 a22 b12 b22 2 5 92 92 1 2 a33 a23 b13 b23 3 6 93 93 2

In [13]: pd.wide_to_long(df, ['A', 'B', 'BB'], i='id', j='year') Out[13]: BBBX BBBZ A B BB id year 0 11 91 91 a11 b11 1 1 11 92 92 a22 b12 2 2 11 93 93 a33 b13 3 0 12 91 91 a21 b21 4 1 12 92 92 a22 b22 5 2 12 93 93 a23 b23 6

In [14]: df = pd.DataFrame({ ...: 'A(quarterly)2011': ['a11', 'a22', 'a33'], ...: 'A(quarterly)2012': ['a21', 'a22', 'a23'], ...: 'B(quarterly)2011': ['b11', 'b12', 'b13'], ...: 'B(quarterly)2012': ['b21', 'b22', 'b23'], ...: 'BB(quarterly)2011': [1, 2, 3], ...: 'BB(quarterly)2012': [4, 5, 6], ...: 'BBBX' : [91, 92, 93], ...: 'BBBZ' : [91, 92, 93] ...: }) ...: df['id'] = df.index ...: df ...: Out[14]: A(quarterly)2011 A(quarterly)2012 B(quarterly)2011 B(quarterly)2012
0 a11 a21 b11 b21 1 a22 a22 b12 b22 2 a33 a23 b13 b23

BB(quarterly)2011 BB(quarterly)2012 BBBX BBBZ id 0 1 4 91 91 0 1 2 5 92 92 1 2 3 6 93 93 2

In [15]: pd.wide_to_long(df, ['A(quarterly)', 'B(quarterly)', 'BB(quarterly)'], i='id', j='year') Out[15]: BBBX BBBZ A(quarterly) B(quarterly) BB(quarterly) id year 0 2011 91 91 a11 b11 1 1 2011 92 92 a22 b12 2 2 2011 93 93 a33 b13 3 0 2012 91 91 a21 b21 4 1 2012 92 92 a22 b22 5 2 2012 93 93 a23 b23 6

The first one fails because the regex confuses the same substrings in the id_vars and value_vars, the second ones because of the parenthesis.

Assuming a Prefix(Optional Sep)Postfix structure on the "time" variables, I tried to make it robust:

[16]: df = pd.DataFrame({ ...: 'A11': ['a11', 'a22', 'a33'], ...: 'A12': ['a21', 'a22', 'a23'], ...: 'B11': ['b11', 'b12', 'b13'], ...: 'B12': ['b21', 'b22', 'b23'], ...: 'BB11': [1, 2, 3], ...: 'BB12': [4, 5, 6], ...: 'Acat' : [91, 92, 93], ...: 'BBBZ' : [91, 92, 93] ...: }) ...: df['id'] = df.index ...: df ...: Out[16]: A11 A12 Acat B11 B12 BB11 BB12 BBBZ id 0 a11 a21 91 b11 b21 1 4 91 0 1 a22 a22 92 b12 b22 2 5 92 1 2 a33 a23 93 b13 b23 3 6 93 2

raises a ValueError: ('Ambiguous names: ', ['A11', 'A12', 'Acat']) .

While the following works

df = pd.DataFrame({ ...: 'A-11': ['a11', 'a22', 'a33'], ...: 'A-12': ['a21', 'a22', 'a23'], ...: 'B-11': ['b11', 'b12', 'b13'], ...: 'B-12': ['b21', 'b22', 'b23'], ...: 'BB-11': [1, 2, 3], ...: 'BB-12': [4, 5, 6], ...: 'Acat' : [91, 92, 93], ...: 'BBBZ' : [91, 92, 93] ...: }) ...: df['id'] = df.index ...: df ...: Out[18]: A-11 A-12 Acat B-11 B-12 BB-11 BB-12 BBBZ id 0 a11 a21 91 b11 b21 1 4 91 0 1 a22 a22 92 b12 b22 2 5 92 1 2 a33 a23 93 b13 b23 3 6 93 2

In [19]: pd.wide_to_long(df, ['A', 'B', 'BB'], i='id', j='year', sep='-') Out[19]: Acat BBBZ A B BB id year 0 11 91 91 a11 b11 1 1 11 92 92 a22 b12 2 2 11 93 93 a33 b13 3 0 12 91 91 a21 b21 4 1 12 92 92 a22 b22 5 2 12 93 93 a23 b23 6