Minor change to csv reading by MLnick · Pull Request #146 · pandas-dev/pandas (original) (raw)

Hi Wes

Firstly, congrats on such an amazing project! I love prototyping in python / numpy / ipython, but I always envy some of R's features. I tried pandas out about 9 months ago and although it was interesting, it seemed very rough around the edges. Now, however, it is looking really polished and I've been using it for prototyping and testing some trading models, and everything works extremely well. I hope it keeps growing, together with the integration with scikits statsmodels/timeseries and maybe even scikits.learn in future ...

Anyway, as I was starting to dive into the code, I came across the read_csv functions and noticed that there was full duplication in read_table. The csv module in python actually has full support for arbitrary delimiters, so there is no need for the duplication. Also, there is csv.Sniffer().sniff(sample) that attempts to sniff out the delimiter automatically. This commit tries to "magically" handle any arbitrary CSV file without needing to specify a separator, whether separated by blank spaces, tabs, commas, semicolons or other weird separators (I have a file at work with "^" separators :). If it doesn't work, one can fall back on specifying the separator (so read_csv looks more like read_table). In future it could make sense to simply have one read_data or read_table function.

Incidentally, the csv.Sniffer() also tries to sniff out other things like quote escaping and double quoting, but this commit effectively only uses it for the delimiter. If problems with quote / string escaping crop up with users one could always let the sniffer try to figure out the full dialect.