read_sas fails when passed a file object from GCSFS · Issue #33069 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

From https://stackoverflow.com/q/60848250/101923

export BUCKET_NAME=swast-scratch-us curl -L https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT | gsutil cp - gs://${BUCKET_NAME}/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT

import pandas as pd import gcsfs

bucket_name = "swast-scratch-us" project_id = "swast-scratch"

fs = gcsfs.GCSFileSystem(project=project_id) with fs.open( "{}/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT".format(bucket_name), "rb" ) as f: df = pd.read_sas(f, format="xport") print(df)

Problem description

This throws the following exception:

Traceback (most recent call last):
  File "after.py", line 15, in <module>
    df = pd.read_sas(f, format="xport")
  File "/Users/swast/miniconda3/envs/scratch/lib/python3.7/site-packages/pandas/io/sas/sasreader.py", line 70, in read_sas
    filepath_or_buffer, index=index, encoding=encoding, chunksize=chunksize
  File "/Users/swast/miniconda3/envs/scratch/lib/python3.7/site-packages/pandas/io/sas/sas_xport.py", line 280, in __init__
    contents = contents.encode(self._encoding)
AttributeError: 'bytes' object has no attribute 'encode'
(scratch) 

Expected Output

          SEQN  SDDSRVYR  RIDSTATR  RIAGENDR  ...  SDMVSTRA  INDHHIN2  INDFMIN2  INDFMPIR
0      93703.0      10.0       2.0       2.0  ...     145.0      15.0      15.0      5.00
1      93704.0      10.0       2.0       1.0  ...     143.0      15.0      15.0      5.00
2      93705.0      10.0       2.0       2.0  ...     145.0       3.0       3.0      0.82
3      93706.0      10.0       2.0       1.0  ...     134.0       NaN       NaN       NaN
4      93707.0      10.0       2.0       1.0  ...     138.0      10.0      10.0      1.88
...        ...       ...       ...       ...  ...       ...       ...       ...       ...
9249  102952.0      10.0       2.0       2.0  ...     138.0       4.0       4.0      0.95
9250  102953.0      10.0       2.0       1.0  ...     137.0      12.0      12.0       NaN
9251  102954.0      10.0       2.0       2.0  ...     144.0      10.0      10.0      1.18
9252  102955.0      10.0       2.0       2.0  ...     136.0       9.0       9.0      2.24
9253  102956.0      10.0       2.0       1.0  ...     142.0       7.0       7.0      1.56

[9254 rows x 46 columns]

Note: the expected output is printed when a local file is read.

Output of pd.show_versions()

Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.

import pandas as pd
pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 19.4.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.18.1
pytz : 2019.2
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0.post20200311
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : 0.6.0
lxml.etree : 4.5.0
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.11.0
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
xarray : 0.12.3
xlrd : None
xlwt : None
xlsxwriter : None