Add ability to process bad lines for read_csv · Issue #5686 · pandas-dev/pandas (original) (raw)

CSV files can contains some errors, for example:

NAME,PAT
Peter,cat
really bad line
Fedor,cat

to skip really bad lines exist error_bad_lines=False parameter.
Another example without quotes and delimiters in field:

NAME,PAT
Peter,cat
Ira,cat,dog
Fedor,cat

Which with quotes will look like this:

NAME,PAT
Peter,cat
Ira,"cat,dog"
Fedor,cat

So it easy fix this line if know that first field not contain extra separators.
Also extra trailing delimiters issue: #2886

More real life example:

82,52,29,11,2,2013-08-02 00:00:00,,,gen,,FDP, employee,0,1,gen,,,0
55,69,36,19,2,2013-10-28 00:00:00,,,gen,,FDP employee,0,1,gen,,,0

There are difference for FDP employee and FDP, employee. So it will be grate to have ability process this bad lines with own handler.

My proposition add additional parameter process_bad_lines for read_csv.
For example, if I want fix line:

def bad_line_handler(items):
    '''probably ugly example,
       but lets imagine that `FDP, employee` is half of our data'''
    fdp_index = items.index('FDP')
    return items[:fdp_index] + ['FDP, employee'] + items[fdp_index + 2:]

pd.read_csv(file, process_bad_lines=bad_line_handler)

error_bad_line and warn_bad_line can work as before but at first once try replace bad string with process_bad_lines handler.
if process_bad_lines will return None when probably better just skip this line without exceptions (probably it more flexible), to store compatibility just return unchanged items parameter. Otherwise None can be equal to bad line and better raise exceptions from process_bad_lines handler.
not always I can already create CSV file with quotes, somebody already send me bad CSV.
I can pre-process file but it will take more time and work, see for example http://stackoverflow.com/questions/14550441/problems-reading-csv-file-with-commas-and-characters-in-pandas

Some additions:
For example I have no much string fields and can assume that one of strings contains separator:

int,int,int,int,int,datetime,,,str,,str,str,int,int,str,,,int
int,int,int,int,int,datetime,,,str,,str    ,int,int,str,,,int

But it can work bad for many strings:

int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime,  str,str,str,str,str  ,int,int,str,,,int

However it also be grate have default methods to fix this strings with concatenating left strings:

int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime,  STR  ,str,str,str,str,int,int,str,,,int

# for example with next syntax
pd.read_csv(file, process_bad_lines='try_concat_left')

or right strings:

int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime,str,str,str,str,  STR  ,int,int,str,,,int

# for example with next syntax
pd.read_csv(file, process_bad_lines='try_concat_right')

and also removing extra trailing delimiters:

pd.read_csv(file, process_bad_lines='skip_right_delimiters')