ENH: Serialize view of ArrowStringArray · Issue #42600 · pandas-dev/pandas (original) (raw)

Skip to content

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sign up

@mrocklin

Description

@mrocklin

Currently Pandas serializes views of ArrowStringArrays by serailizing the whole thing, rather than a subset. Here is an example:

In [1]: import pandas as pd

In [2]: s = pd.Series([c * 1000 for c in "abcdefghijklmnopqrstuvwxyz"])

In [3]: s Out[3]: 0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa... 1 bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb... 2 cccccccccccccccccccccccccccccccccccccccccccccc... 3 dddddddddddddddddddddddddddddddddddddddddddddd... 4 eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee... 5 ffffffffffffffffffffffffffffffffffffffffffffff... 6 gggggggggggggggggggggggggggggggggggggggggggggg... 7 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh... 8 iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii... 9 jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj... 10 kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk... 11 llllllllllllllllllllllllllllllllllllllllllllll... 12 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm... 13 nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn... 14 oooooooooooooooooooooooooooooooooooooooooooooo... 15 pppppppppppppppppppppppppppppppppppppppppppppp... 16 qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq... 17 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr... 18 ssssssssssssssssssssssssssssssssssssssssssssss... 19 tttttttttttttttttttttttttttttttttttttttttttttt... 20 uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu... 21 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv... 22 wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww... 23 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx... 24 yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy... 25 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz... dtype: object

In [4]: import pickle

In [5]: len(pickle.dumps(s)) Out[5]: 26758

In [6]: len(pickle.dumps(s.astype("string[pyarrow]"))) Out[6]: 26891

In [7]: len(pickle.dumps(s.head(5))) Out[7]: 5632

In [8]: len(pickle.dumps(s.astype("string[pyarrow]").head(5))) Out[8]: 26891

This negatively affects dask dataframe operations that cut up pandas dataframes into small pieces, moves them around to different computers, and then pieces them back together again.