API/ENH: read_csv handling of bad lines (too many/few fields) · Issue #15122 · pandas-dev/pandas (original) (raw)
Currently read_csv
has some ways to deal with "bad lines" (bad in the sense of too many or too few fields compared to the determined number of columns):
- by default, it will error for too many fields, and fill with NaNs for too few fields
- with
error_bad_lines=false
rows with too many fields will be dropped instead of raising an error (and in that case,warn_bad_lines
controls to get a warning or not) - with
usecols
you can select certain columns, and in this way deal with rows with too many fields.
Some possibilities are missing in this scheme:
- "process" bad lines with too many fields, i.e. drop the excessive fields instead of either raising an error or dropping the full row (discussed in usecols dooesn't help with unclean csv's #9549)
- getting a warning or error with too few fields instead of automatically filling with NaNs (asked for in "Bad" lines with too few fields #9729), or dropping those rows
Apart from that, #5686 makes the request to be able to specify a custom function to process a bad line, to have even more control.
In #9549 (comment) (and surrounding comments) there was some discussion about how to integrate this, and some idea from there from @jreback and @selasley:
Provide more fine grained control in a new keyword (and deprecate error_bad_lines
):
bad_lines='error'|'warn'|'skip'|'process'
or leave out 'warn'
and keep warn_bad_lines
to be able to combine a warning with both 'skip' and 'process'.
We should further think about whether we can integrate this with the case of too few fields and not only too many.
I think it would be nice to have some better control here, but we should think a bit about the best API for this.