BUG: Formatting issues with stderr output of read_csv() with warn_bad_lines=True · Issue #41710 · pandas-dev/pandas (original) (raw)
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
badly-formatted.csv
lname fname birthday hometown
Deere Jane 2003-01-02 Philadelphia, PA
Doe John 2001-04-01 New York, NY
Jefferson Thomas 1743-04-13 Shadwell, VA
Lincoln Abraham 1809-02-12 Sinking Spring, KY
Washington George 1732-02-22 Pope's Creek, VA
(note extra tab characters)
main.py
import pandas
pandas.read_csv("badly-formatted.csv", sep="\t", error_bad_lines=False, warn_bad_lines=True)
Problem description
When using pandas.read_csv()
with the error_bad_lines=False
argument (and the default warn_bad_lines=True
arg), the output being printed to stderr appears to be bytes
type that has been cast to str
. This adds noise to the console output, and is making it difficult to properly capture stderr output from this constructor.
Expected stderr Output from MRE
Skipping line 3: expected 4 fields, saw 5
Skipping line 4: expected 4 fields, saw 5
Skipping line 6: expected 4 fields, saw 5
Actual stderr Output from MRE
b'Skipping line 3: expected 4 fields, saw 5\nSkipping line 4: expected 4 fields, saw 5\nSkipping line 6: expected 4 fields, saw 5\n'
Inconsistent line breaks with larger files
I have also seen line breaks appear inconsistently when reading in larger files. I haven't been able to isolate any causes of this inconsistency yet
Actual stderr output from large file
b'Skipping line 342635: expected 25 fields, saw 26\n'
b'Skipping line 392116: expected 25 fields, saw 26\n'
b'Skipping line 544651: expected 25 fields, saw 26\nSkipping line 553773: expected 25 fields, saw 26\n'
b'Skipping line 559046: expected 25 fields, saw 26\nSkipping line 559146: expected 25 fields, saw 26\nSkipping line 559148: expected 25 fields, saw 26\nSkipping line 559155: expected 25 fields, saw 26\n'
b'Skipping line 596525: expected 25 fields, saw 26\n'
b'Skipping line 634470: expected 25 fields, saw 26\n'
b'Skipping line 777743: expected 25 fields, saw 26\nSkipping line 777744: expected 25 fields, saw 26\nSkipping line 777745: expected 25 fields, saw 26\nSkipping line 777746: expected 25 fields, saw 26\nSkipping line 777747: expected 25 fields, saw 26\nSkipping line 777748: expected 25 fields, saw 26\nSkipping line 777749: expected 25 fields, saw 26\nSkipping line 777750: expected 25 fields, saw 26\nSkipping line 777831: expected 25 fields, saw 26\nSkipping line 777832: expected 25 fields, saw 26\nSkipping line 778297: expected 25 fields, saw 26\n'
Expected stderr output from large file
Skipping line 342635: expected 25 fields, saw 26
Skipping line 392116: expected 25 fields, saw 26
Skipping line 544651: expected 25 fields, saw 26
Skipping line 553773: expected 25 fields, saw 26
Skipping line 559046: expected 25 fields, saw 26
Skipping line 559146: expected 25 fields, saw 26
Skipping line 559148: expected 25 fields, saw 26
Skipping line 559155: expected 25 fields, saw 26
Skipping line 596525: expected 25 fields, saw 26
Skipping line 634470: expected 25 fields, saw 26
Skipping line 777743: expected 25 fields, saw 26
Skipping line 777744: expected 25 fields, saw 26
Skipping line 777745: expected 25 fields, saw 26
Skipping line 777746: expected 25 fields, saw 26
Skipping line 777747: expected 25 fields, saw 26
Skipping line 777748: expected 25 fields, saw 26
Skipping line 777749: expected 25 fields, saw 26
Skipping line 777750: expected 25 fields, saw 26
Skipping line 777831: expected 25 fields, saw 26
Skipping line 777832: expected 25 fields, saw 26
Skipping line 778297: expected 25 fields, saw 26
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 2cb9652
python : 3.9.5.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Mon Mar 8 22:11:48 PST 2021; root:xnu-4903.278.65~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.4
numpy : 1.19.4
pytz : 2020.5
dateutil : 2.8.1
pip : 21.1.2
setuptools : 56.0.0
Cython : None
pytest : 6.2.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 2.2.5
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None