Potential bug in reading SAS files with CHAR (RLE) compression and many repeated characters · Issue #31243 · pandas-dev/pandas (original) (raw)
Hi,
I think I ran into a bug in the RLE decompression implementation.
Short description:
String fields with more than 32 repeated consecutive characters are be cropped at 32 and next fields will spill over corrupting the whole dataframe.
Example:
example.csv with fields of length 50
long_string_field1,long_string_field2,long_string_field3
"00000000000000000000000000000000000000000000000000","11111111111111111111111111111111111111111111111111","aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
Create a CHAR compressed sas7bdat file (system encoding is set to latin1)
options compress=char; proc import datafile="path\example.csv" out=your_lib.example dbms=csv replace; getnames=yes; run;
import pandas as pd example = pd.read_sas("./example.sas7bdat", encoding="latin1")
This is what you get:
example['long_string_field1'].values[0] '00000000000000000000000000000000 111111111111111111'
example['long_string_field2'].values[0] '11111111111111 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
example['long_string_field3'].values[0] nan
There are a couple of interesting points:
- Exactly 32 characters are read/written from each field. Originals had 50 characters.
- The dataframe fields are filled up to the 50 limit (+ 2 additional spaces between the sources of the fields. In case of consecutive integers) with spillovers from next fields.
- This only happens if one uses the CHAR compression. (RLE)
- This only happens if you have repeated consecutive characters.
- The
sas7bdat
package works fine.
DISCLAIMER
I would never use any of the things above out of my free will. Sadly, this is an actual case I keep running into when having to deal with SAS... 😢
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None
pandas : 0.25.3
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 45.1.0.post20200119
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : None
sqlalchemy : 1.3.12
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None