read_csv with names, usecols and parse_dates · Issue #9755 · pandas-dev/pandas (original) (raw)
The arrays passed to the date_parser function is different when names
and use_cols
are specified to limit the number of parsed columns.
When running the example code below, the date_parser function receives two arguments, one array with '20140101' strings, and one array with integers. The default date_parser
fails to process this input.
When assigning an empty list to DROPPED_COLUMNS
(so that all columns are parsed), the second array contains strings instead of integers, and the datetimes are parsed correctly.
The problem doesn't occur with engine='python'
. I haven't tested the influence of the header
and index_cols
options.
from __future__ import print_function, division
import pandas as pd
CSV = '2014.csv'
DROPPED_COLUMNS = ['NCDC', 'I', 'QCP']
# DROPPED_COLUMNS = []
with open(CSV) as csv:
csv.readline()
column_names = csv.readline().split()
used_columns = [i for i, column_name in enumerate(column_names)
if column_name not in DROPPED_COLUMNS]
used_col_names = [column_name for i, column_name in enumerate(column_names)
if i in used_columns]
parse_dates = [[i for i, column_name in enumerate(used_col_names)
if column_name in ('Date', 'HrMn')]]
print(parse_dates)
data = pd.read_csv(csv, header=None, names=used_col_names, index_col=False, engine='python',
parse_dates=parse_dates, usecols=used_columns)
print(data)
Identification TEMP
USAF NCDC Date HrMn I Type QCP Temp Q
062693,99999,20140101,0025,4,FM-15, , 7.0,1,
062693,99999,20140101,0055,4,FM-15, , 6.0,1,
062693,99999,20140101,0125,4,FM-15, , 6.0,1,
062693,99999,20140101,0155,4,FM-15, , 6.0,1,
062693,99999,20140101,0225,4,FM-15, , 6.0,1,
062693,99999,20140101,0255,4,FM-15, , 6.0,1,
062693,99999,20140101,0325,4,FM-15, , 6.0,1,
062693,99999,20140101,0355,4,FM-15, , 6.0,1,