to_stata always stores strings as str244 · Issue #8969 · pandas-dev/pandas (original) (raw)

Copied from #7858:

I am still getting this bug (pandas 0.15.1, numpy 1.9.1, Stata 13.1 on Windows 7). The written DTA file still stores the strings as str244 even though the strings themselves have length 1. This also means that when they're read back into pandas the DataFrame looks the same. I think the only way to detect it from within Pandas is to look at the size of the DTA file itself.

df = pd.DataFrame(['a', 'b', 'c'], columns=['alpha']) df.to_stata('test.dta') df2 = pd.read_stata('test.dta') assert (df['alpha'] == df2['alpha']).min()

But when loading in Stata:

. use test . describe

Contains data from D:\data\Pollution\test.dta obs: 3
vars: 2 02 Dec 2014 10:30 size: 744

          storage   display    value

variable name type format label variable label

index long %12.0g
alpha str244 %1s

Sorted by:

. compress index was long now byte alpha was str244 now str1 (738 bytes saved)

. describe

Contains data from D:\data\Pollution\test.dta obs: 3
vars: 2 02 Dec 2014 10:30 size: 6

          storage   display    value

variable name type format label variable label

index byte %12.0g
alpha str1 %9s

Sorted by:

I don't know how critical this is since there are workarounds. You can either use compress; save, replace in Stata after every use of pandas or, if the 244 problem makes the DTA exceed your memory limit (which causes quite the system error lightshow as I just experienced), you could pass it through a CSV first.

It's just a matter of convenience.