pandas (original) (raw)

Currently read_csv has some ways to deal with "bad lines" (bad in the sense of too many or too few fields compared to the determined number of columns):

by default, it will error for too many fields, and fill with NaNs for too few fields
with error_bad_lines=false rows with too many fields will be dropped instead of raising an error (and in that case, warn_bad_lines controls to get a warning or not)
with usecols you can select certain columns, and in this way deal with rows with too many fields.

Some possibilities are missing in this scheme:

"process" bad lines with too many fields, i.e. drop the excessive fields instead of either raising an error or dropping the full row (discussed in usecols dooesn't help with unclean csv's #9549)
getting a warning or error with too few fields instead of automatically filling with NaNs (asked for in "Bad" lines with too few fields #9729), or dropping those rows

Apart from that, #5686 makes the request to be able to specify a custom function to process a bad line, to have even more control.

In #9549 (comment) (and surrounding comments) there was some discussion about how to integrate this, and some idea from there from @jreback and @selasley:

Provide more fine grained control in a new keyword (and deprecate error_bad_lines):

bad_lines='error'|'warn'|'skip'|'process'

or leave out 'warn' and keep warn_bad_lines to be able to combine a warning with both 'skip' and 'process'.

We should further think about whether we can integrate this with the case of too few fields and not only too many.

I think it would be nice to have some better control here, but we should think a bit about the best API for this.