Add ability to process bad lines for read_csv · Issue #5686 · pandas-dev/pandas (original) (raw)
CSV files can contains some errors, for example:
NAME,PAT
Peter,cat
really bad line
Fedor,cat
to skip really bad lines exist error_bad_lines=False
parameter.
Another example without quotes and delimiters in field:
NAME,PAT
Peter,cat
Ira,cat,dog
Fedor,cat
Which with quotes will look like this:
NAME,PAT
Peter,cat
Ira,"cat,dog"
Fedor,cat
So it easy fix this line if know that first field not contain extra separators.
Also extra trailing delimiters issue: #2886
More real life example:
82,52,29,11,2,2013-08-02 00:00:00,,,gen,,FDP, employee,0,1,gen,,,0
55,69,36,19,2,2013-10-28 00:00:00,,,gen,,FDP employee,0,1,gen,,,0
There are difference for FDP employee
and FDP, employee
. So it will be grate to have ability process this bad lines with own handler.
My proposition add additional parameter process_bad_lines
for read_csv
.
For example, if I want fix line:
def bad_line_handler(items):
'''probably ugly example,
but lets imagine that `FDP, employee` is half of our data'''
fdp_index = items.index('FDP')
return items[:fdp_index] + ['FDP, employee'] + items[fdp_index + 2:]
pd.read_csv(file, process_bad_lines=bad_line_handler)
error_bad_line
andwarn_bad_line
can work as before but at first once try replace bad string withprocess_bad_lines
handler.- if
process_bad_lines
will returnNone
when probably better just skip this line without exceptions (probably it more flexible), to store compatibility just return unchangeditems
parameter. OtherwiseNone
can be equal to bad line and better raise exceptions fromprocess_bad_lines
handler. - not always I can already create CSV file with quotes, somebody already send me bad CSV.
- I can pre-process file but it will take more time and work, see for example http://stackoverflow.com/questions/14550441/problems-reading-csv-file-with-commas-and-characters-in-pandas
Some additions:
For example I have no much string fields and can assume that one of strings contains separator:
int,int,int,int,int,datetime,,,str,,str,str,int,int,str,,,int
int,int,int,int,int,datetime,,,str,,str ,int,int,str,,,int
But it can work bad for many strings:
int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime, str,str,str,str,str ,int,int,str,,,int
However it also be grate have default methods to fix this strings with concatenating left strings:
int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime, STR ,str,str,str,str,int,int,str,,,int
# for example with next syntax
pd.read_csv(file, process_bad_lines='try_concat_left')
or right strings:
int,int,int,int,int,datetime,str,str,str,str,str,str,int,int,str,,,int
int,int,int,int,int,datetime,str,str,str,str, STR ,int,int,str,,,int
# for example with next syntax
pd.read_csv(file, process_bad_lines='try_concat_right')
and also removing extra trailing delimiters:
pd.read_csv(file, process_bad_lines='skip_right_delimiters')