read_csv with names, usecols and parse_dates · Issue #9755 · pandas-dev/pandas (original) (raw)

The arrays passed to the date_parser function is different when names and use_cols are specified to limit the number of parsed columns.

When running the example code below, the date_parser function receives two arguments, one array with '20140101' strings, and one array with integers. The default date_parser fails to process this input.

When assigning an empty list to DROPPED_COLUMNS (so that all columns are parsed), the second array contains strings instead of integers, and the datetimes are parsed correctly.

The problem doesn't occur with engine='python'. I haven't tested the influence of the header and index_cols options.

from __future__ import print_function, division
import pandas as pd

CSV = '2014.csv'

DROPPED_COLUMNS = ['NCDC', 'I', 'QCP']
# DROPPED_COLUMNS = []

with open(CSV) as csv:
    csv.readline()
    column_names = csv.readline().split()
    used_columns = [i for i, column_name in enumerate(column_names)
                    if column_name not in DROPPED_COLUMNS]
    used_col_names = [column_name for i, column_name in enumerate(column_names)
                      if i in used_columns]
    parse_dates = [[i for i, column_name in enumerate(used_col_names)
                    if column_name in ('Date', 'HrMn')]]
    print(parse_dates)

    data = pd.read_csv(csv, header=None, names=used_col_names, index_col=False, engine='python',
                       parse_dates=parse_dates, usecols=used_columns)
    print(data)
Identification                          TEMP
USAF   NCDC  Date     HrMn I Type  QCP  Temp   Q
062693,99999,20140101,0025,4,FM-15,    ,   7.0,1,
062693,99999,20140101,0055,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0125,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0155,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0225,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0255,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0325,4,FM-15,    ,   6.0,1,
062693,99999,20140101,0355,4,FM-15,    ,   6.0,1,