read_sas fails when passed a file object from GCSFS · Issue #33069 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
From https://stackoverflow.com/q/60848250/101923
export BUCKET_NAME=swast-scratch-us curl -L https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT | gsutil cp - gs://${BUCKET_NAME}/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT
import pandas as pd import gcsfs
bucket_name = "swast-scratch-us" project_id = "swast-scratch"
fs = gcsfs.GCSFileSystem(project=project_id) with fs.open( "{}/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT".format(bucket_name), "rb" ) as f: df = pd.read_sas(f, format="xport") print(df)
Problem description
This throws the following exception:
Traceback (most recent call last):
File "after.py", line 15, in <module>
df = pd.read_sas(f, format="xport")
File "/Users/swast/miniconda3/envs/scratch/lib/python3.7/site-packages/pandas/io/sas/sasreader.py", line 70, in read_sas
filepath_or_buffer, index=index, encoding=encoding, chunksize=chunksize
File "/Users/swast/miniconda3/envs/scratch/lib/python3.7/site-packages/pandas/io/sas/sas_xport.py", line 280, in __init__
contents = contents.encode(self._encoding)
AttributeError: 'bytes' object has no attribute 'encode'
(scratch)
Expected Output
SEQN SDDSRVYR RIDSTATR RIAGENDR ... SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR
0 93703.0 10.0 2.0 2.0 ... 145.0 15.0 15.0 5.00
1 93704.0 10.0 2.0 1.0 ... 143.0 15.0 15.0 5.00
2 93705.0 10.0 2.0 2.0 ... 145.0 3.0 3.0 0.82
3 93706.0 10.0 2.0 1.0 ... 134.0 NaN NaN NaN
4 93707.0 10.0 2.0 1.0 ... 138.0 10.0 10.0 1.88
... ... ... ... ... ... ... ... ... ...
9249 102952.0 10.0 2.0 2.0 ... 138.0 4.0 4.0 0.95
9250 102953.0 10.0 2.0 1.0 ... 137.0 12.0 12.0 NaN
9251 102954.0 10.0 2.0 2.0 ... 144.0 10.0 10.0 1.18
9252 102955.0 10.0 2.0 2.0 ... 136.0 9.0 9.0 2.24
9253 102956.0 10.0 2.0 1.0 ... 142.0 7.0 7.0 1.56
[9254 rows x 46 columns]
Note: the expected output is printed when a local file is read.
Output of pd.show_versions()
Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
import pandas as pd
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 19.4.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.1
numpy : 1.18.1
pytz : 2019.2
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0.post20200311
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : 0.6.0
lxml.etree : 4.5.0
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.11.0
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
xarray : 0.12.3
xlrd : None
xlwt : None
xlsxwriter : None