Issue 17829: csv.Sniffer.snif doesn't set up the dialect properly for a csv created with dialect=csv.excel_tab and containing quote (") char (original) (raw)
Issue17829
Created on 2013-04-24 15:06 by GhislainHivon, last changed 2022-04-11 14:57 by admin.
When sniffing the dialect of a file created with the csv module with dialect=csv.excel_tab and one of the row contain a quote ("), the delimiter is set to ' ' instead of '\t'.
I had a look at this and have the following remarks. 1) the file csv_sniffing_excel_tab.py no longer works with python 3.3. It now produces the folowing traceback: Traceback (most recent call last): File "csv_sniffing_excel_tab.py", line 36, in create_file() File "csv_sniffing_excel_tab.py", line 23, in create_file writer.writerows(test_data) TypeError: 'str' does not support the buffer interface 2) The problem seems to be in the _guess_quote_and_delimiter method. If you always call _guess_delimiter, the sniffer give the correct result. 3) As far as I understand the problem is the first regular expression: (?P[^\w\n"\'])(?P ?)(?P["\']).*?(?P=quote)(?P=delim) Now if we have a line as the following 273:MVREGR1:ByEuPo:"Baryton ""Euphonium"" populaire" The delim group will match the space, the space group will match nothing the quote group will match " the non-group pattern will match "Euphonium" followed by the quote group matching " again and the delim group matching the space. And so we get the wrong delimiter.