PERF: creating string Series/Arrays from sequence with many strings by topper-123 · Pull Request #36304 · pandas-dev/pandas (original) (raw)
Improves performance of pandas._libs.lib.ensure_string_array is cases with many string elements.
Examples:
x = np.array([str(u) for u in range(1_000_000)], dtype=object) %timeit pd.Series(x, dtype=str) 344 ms ± 59.7 ms per loop # v1.1.0 157 ms ± 7.04 ms per loop # v1.1.1 and master 22.6 ms ± 191 µs per loop # this PR %timeit pd.Series(x, dtype="string") 357 ms ± 40.2 ms per loop # v1.1.0 148 ms ± 713 µs per loop # v1.1.1 and master 26.3 ms ± 291 µs per loop # this PR
#35519 is the cause of the improvement from 1.1.0 to 1.1.1.
Together with #35519 this PR means that the overhead of working with strings in pandas has gotten considerably smaller for cases when we many times instantiate new Series/PandasArrays with dtype str/StringDtype, i.e. probably quite often.