Pandas 0.14.1 StataReader seems to read .dta files 10x slower than Pandas 0.13.1 · Issue #8040 · pandas-dev/pandas (original) (raw)

Hello, I recently upgraded to Pandas 0.14.1 from Pandas 0.13.1, and am having trouble reading Stata .dta files using StataReader. Files that used to take 20 seconds to read now take 300 seconds, and files that used to take 220 seconds are not read even after 20 minutes.

I would really like to use the newer version of Pandas for these large datasets, and import a .dta file rather than a .csv or other filetype in order to maintain my value labels from Stata.

Steps for reproduction:

First, create a large dataset in Stata 13:

clear
set obs 11500
forvalues i = 1/8000{
gen var`i' = 1
}

saveold bigdataset, replace

Second, try to read it into pandas using StataReader:

from pandas.io.stata import StataReader

reader = StataReader('bigdataset.dta')
data = reader.data()

Using pandas 0.13.1, this takes around 220 seconds, which is ok, but using pandas 0.14.1, nothing has happened even after waiting around 20 minutes.

When I test this using a smaller dataset:

clear
set obs 11500
forvalues i = 1/1000{
gen var`i' = 1
}

saveold smalldataset, replace
from pandas.io.stata import StataReader

reader = StataReader('smalldataset.dta')
data = reader.data()

Using pandas 0.13.1, this takes around 20 seconds, but using pandas 0.14.1, this takes around 300 seconds.

Thanks for reading! Here is the output from show_versions(), if relevant:

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-32-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: 0.20.2
numpy: 1.8.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 1.2.1
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
sqlalchemy: None
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
bq: None
apiclient: None