read_csv parse issue with newline in quoted items combined with skiprows · Issue #10911 · pandas-dev/pandas (original) (raw)
Now I don't know if this is known or the desired behaviour but when I try to read certain rows from a large file that uses "~" (tilde) as a quotechar and use skiprows at the same time, the parser screws up as follows:
Note: I use "" in the output even though that isn't shown, if I didn't the markup would become messed up - sorry...
pd.read_csv(StringIO.StringIO('a,b,c\r
a\n b,e\n d,f\n f\r1,2,12\n 13\n 14'), quotechar="", skiprows=range(1,2) ) a b c "b" "e\n d" "f\n f" 1 2 "12\n 13\n 14" NaN
while the output I wish to get would be in this artificial case:
a b c
0 1 2 "12\n 13\n 14"
it seems when skipping rows, the parser ignores custom quotation - which in this case is undesired from my point of view.
EDIT: It might well be that in the quoted texts newlines are not always \n but sometimes also \r.
EDIT2 (31.8.):
The lineterminator fix fails as far as I can see with the following example:
a = StringIO.StringIO('Text,url\r
example\r sentence\r one,url1\rexample\n sentence\n two,url2') pd.read_csv(a, quotechar="", skiprows=range(1,2), lineterminator='\r' ) Text url 0 sentence NaN 1 "one" url1 2 "example\n sentence\n two" url2
The problem is that there is a "text"-column in the csv with html-formatted textblocks as content. However, there is no saying what kind of newline the creators of the html used originally and the textblocks stem from different sources.
I might also add that it respects the quoting perfectly if one does not use "skiprows".
versioninfo:
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7
pandas: 0.16.2
nose: None
Cython: None
numpy: 1.9.2
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None