ENH: Serialize view of ArrowStringArray · Issue #42600 · pandas-dev/pandas (original) (raw)
Navigation Menu
- Explore
- Pricing
Provide feedback
Saved searches
Use saved searches to filter your results more quickly
Description
Currently Pandas serializes views of ArrowStringArrays by serailizing the whole thing, rather than a subset. Here is an example:
In [1]: import pandas as pd
In [2]: s = pd.Series([c * 1000 for c in "abcdefghijklmnopqrstuvwxyz"])
In [3]: s Out[3]: 0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa... 1 bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb... 2 cccccccccccccccccccccccccccccccccccccccccccccc... 3 dddddddddddddddddddddddddddddddddddddddddddddd... 4 eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee... 5 ffffffffffffffffffffffffffffffffffffffffffffff... 6 gggggggggggggggggggggggggggggggggggggggggggggg... 7 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh... 8 iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii... 9 jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj... 10 kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk... 11 llllllllllllllllllllllllllllllllllllllllllllll... 12 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm... 13 nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn... 14 oooooooooooooooooooooooooooooooooooooooooooooo... 15 pppppppppppppppppppppppppppppppppppppppppppppp... 16 qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq... 17 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr... 18 ssssssssssssssssssssssssssssssssssssssssssssss... 19 tttttttttttttttttttttttttttttttttttttttttttttt... 20 uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu... 21 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv... 22 wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww... 23 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx... 24 yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy... 25 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz... dtype: object
In [4]: import pickle
In [5]: len(pickle.dumps(s)) Out[5]: 26758
In [6]: len(pickle.dumps(s.astype("string[pyarrow]"))) Out[6]: 26891
In [7]: len(pickle.dumps(s.head(5))) Out[7]: 5632
In [8]: len(pickle.dumps(s.astype("string[pyarrow]").head(5))) Out[8]: 26891
This negatively affects dask dataframe operations that cut up pandas dataframes into small pieces, moves them around to different computers, and then pieces them back together again.