Update tokenizer to fix #8679 #8661 by selasley · Pull Request #8752 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation36 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

selasley

Update tokenizer's handling of skipped lines.
Fixes a problem with read_csv(delim_whitespace=True) and
read_table(engine='c') when lines being skipped have trailing
whitespace.

closes #8679
closes #8681

jreback

@@ -3036,7 +3036,7 @@ def test_line_comment(self):
def test_comment_skiprows(self):
data = """# empty
random line
random line with trailing spaces

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave the original test in, and add a new test (you can just copy-paste), same method is fine. Also add the issue numbers as a comment.

@jreback

pls add a release note in v0.15.2 (reference both issues)

@jreback

@mdmueller

Looks good to me--might as well leave the old test in as @jreback noted above though.

Thanks for the fix @selasley!

@jreback

@selasley

@jreback

@selasley ok, rebase again, you'll have a merge conflict (as you are including the whole of v0.15.2.txt again). You should be fine otherwise, repush and i'll have a look.

@selasley

I added my comments to whatsnew/v0.15.2.txt from e8b3862, did a rebase/squash and pushed the result. Let me know if I was supposed to do something different.

@selasley

commit 3c16f33 contains a bug fix in tokenize_delim_customterm, general code cleanup and updates to various files to e8b3862 versions

@jreback

you seem to be picking up other commits and then rebasing them in

what are you doing to rebase?

@jreback

@selasley

On Nov 10, 2014, at 3:37 AM, jreback notifications@github.com wrote:

https://github.com/pydata/pandas/wiki/Using-Git

Thanks for the link. I'll definitely read through it.

I've been doing git rebase -i HEAD~2 then changing the second pick to squash and editing the commit comments. Then I do a git push origin +trailing_spaces_fix

For the latest commit I copied several files from the latest master branch to my trailing_spaces_fix branch to make sure the tests still passed. In retrospect that was probably not a good thing to do.=

@jreback

@selasley can you rebase this.

and can you post a memory comparison of old vs new? (use a relatively small example)

@selasley

I'll try again to rebase tonight.
I created a text file with 10000000 rows to skip, each containing 8 fields, followed by two rows each with 4 integers. In ipython3
%timeit -r 5 -n 5 df = pd.read_csv('bigfile.txt', header=None, delim_whitespace=True, skiprows=10000000)
5 loops, best of 5: 5.74 s per loop for new pandas
5 loops, best of 5: 10.3 s per loop for pandas 0.15.1 with numba and BottleNeck

According to the Allocations instrument with Xcode, reading the file allocated 4.67GB total with pandas 0.15.1 and 1.40GB with the new pandas.

With 100000 rows to skip the numbers are
5 loops, best of 5: 43.9 ms per loop, 29.23MB for new pandas
5 loops, best of 5: 88.0 ms per loop, 71.45MB for pandas 0.15.1

@selasley

I found a problem with custom line terminators in 0.15.1 and in my version
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='', sep=' ') works as expected, but pd.read_csv(StringIO('A B C_0 1 2_3 4 5'.replace(' ', ',')), lineterminator='', sep=r',+')
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='', delim_whitespace=True) pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='', sep=r'\ +')
all return an empty DataFrame. I would like to fix this.

@jreback

@selasley fine to fix (I think their is an open issue about line terminators - can't find it in s quick search)
pls put in separate commits (or do a separate pr)
the skip rows memory issue ok in here
but need to rebase

@selasley

I think I rebased properly this time for the trailing spaces fix. Thanks for your patience

@jreback

rebase looks good!

https://github.com/pydata/pandas/wiki/Performance-Testing

see:
need to test for -r csv
eg the vbenches related to csv reading
pls post results

I think it makes send to add s vbench for skiprows as well
don't make the example so long maybe
like them to take < less than say 500ms (or so)
for this type of bench - maybe 20k rows skip first 10k

@selasley

I added a 20000 row, skiprows=10000 benchtest to io_bench.py

./test_perf.sh -b master -t HEAD -r csv

Invoked with :
--ncalls: 3
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_to_csv_date_formatting                 |  21.4180 |  26.3120 |   0.8140 |
read_csv_skiprows                            |  16.4446 |  18.6180 |   0.8833 |
read_csv_precise_converter                   |   1.8516 |   1.9027 |   0.9732 |
frame_to_csv                                 | 159.6503 | 163.0680 |   0.9790 |
read_csv_thou_vb                             |  24.6903 |  25.0720 |   0.9848 |
frame_to_csv2                                | 149.5183 | 150.1943 |   0.9955 |
write_csv_standard                           |  55.2851 |  55.2817 |   1.0001 |
frame_to_csv_mixed                           | 967.0726 | 962.2053 |   1.0051 |
packers_write_csv                            | 1647.7884 | 1638.5774 |   1.0056 |
read_csv_infer_datetime_format_custom        |  13.7510 |  13.6327 |   1.0087 |
read_csv_infer_datetime_format_iso8601       |   2.2984 |   2.2717 |   1.0118 |
read_csv_infer_datetime_format_ymd           |   2.4959 |   2.4653 |   1.0124 |
read_csv_roundtrip_converter                 |   2.8143 |   2.7333 |   1.0296 |
packers_read_csv                             | 220.8546 | 213.4230 |   1.0348 |
read_csv_standard                            |  13.8597 |  13.3590 |   1.0375 |
read_csv_vb                                  |  26.3373 |  25.1070 |   1.0490 |
read_csv_default_converter                   |   2.0603 |   1.9317 |   1.0666 |
read_csv_comment2                            |  32.6854 |  30.3557 |   1.0767 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [746afa2] : BUG CSV: fix problem with trailing whitespace, issues 8661, 8679
ENH CSV: improve memory footprint when skipping first N rows, issue 8681
Base   [cd93914] : BUG: Allow non-float values in get_yahoo_quotes.
Fixes issue: #5229

@jreback

perf looks good (and vbench).

you need to rebase again. somehow picking up odd things.

git rebase thisbranch -i origin/master should do it

@selasley

git rebase trailing_spaces_fix -i origin/master and git rebase -i trailing_spaces_fix origin/master gave fatal errors so I tried the rebase that seemed to work yesterday

@jreback

@selasley pls rebase and push. is this working locally?

@selasley

I am trying to follow the recipe on the wiki. Starting from scratch
git clone https://github.com/selasley/pandas.git
git remote add upstream git://github.com/pydata/pandas
git fetch upstream
git rebase -i upstream/master
git diff usually shows that doc/source/whatsnew/v0.15.2.txt needs to be cleaned up. I do that then
git rebase --continue
git push origin +trailing_spaces_fix
Right now git rebase -i says Successfully rebased and updated refs/heads/trailing_spaces_fix.
I did add some files to my fork and pushed before reading the wiki page. Maybe that's what is causing problems. Should I start from scratch and fork from the latest master branch?

@jreback

git push origin trailing_spaces_fix -f

the -f is key

@selasley

OK, I did a girl push with -f. I mistakenly thought +trailing_spaces_fix was the same as --force from some stack overflow answer. Now that I read the git push docs I see they are different.

@jreback

@selasley

I misinterpreted the failure message on Travis. It builds without errors on my mac with clang from Xcode 6.1 and Cython 0.21.1, but if I run nosetests in the pandas directory I get the ImportError: cannot import name 'hashtable' error.
I'll work on getting it to build in Travis

jreback

@@ -66,6 +66,7 @@ Enhancements
- Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`).
- Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See :ref:`here<remote_data.ga>`.
- Added flag ``order_categoricals`` to ``StataReader`` and ``read_stata`` to select whether to order imported categorical data (:issue:`8836`). See :ref:`here <io.stata-categorical>` for more information on importing categorical variables from Stata data files.
- Reduce memory usage when skiprows is an integer in read_csv (:issue:`8681`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to performance section

@jreback

looks good. minor comment. pls rebase / squash. ping when green.

@selasley

I moved the comment to performance in whatsnew and rebased

jreback

df = self.read_csv(StringIO(data.replace(',', ' ')),
header=None, delim_whitespace=True,
skiprows=[0,1,2,3,5,6], skip_blank_lines=True)
tm.assert_almost_equal(df.values, expected)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare an actual dataframe here (and below), NOT the values, just tm.assert_frame_equal, this makes sure dtypes, index, columns are all correct

… 8661, 8679

ENH CSV: Reduce memory usage when skiprows is an integer in read_csv, issue 8681

@selasley

replaced tm.assert_almost_equal(df.values, expected) with tm.assert_frame_equal(df, expected) where expected is now a dataframe. I had to push a couple of times. The latest travis build should succeed

jreback added a commit that referenced this pull request

Nov 27, 2014

@jreback

@jreback

@xdliao

Although it works for "skiprows = integer" now, there is still problem with "skiprows = list of rows"
when the list does not start from row=0.
such as read_table(file, skiprows = (1,2)) when all lines have trailing spaces

@selasley

I can reproduce the problem. I'll look into it
data1="""A B C D E\n2spaces \n1space \n3spaces \nF G H I J \n0 1 2 3 4 \n5. 6 7. 8 9. """
data2="""A B C D E \n2spaces \n1space \n3spaces \nF G H I J \n0 1 2 3 4 \n5. 6 7. 8 9. """
pd.read_csv(StringIO(data1), skiprows=4, delim_whitespace=True)
F G H I J
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data1), skiprows=(1,2,3,4), delim_whitespace=True)
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data2), skiprows=4, delim_whitespace=True)
F G H I J
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data2), skiprows=(1,2,3,4), delim_whitespace=True)
A B C D E
0 NaN NaN NaN NaN NaN
1 F G H I J
2 0 1 2 3 4
3 5. 6 7. 8 9.

@jreback

@selasley excellent why don't u create a new issue for this (and xref this one)
@xdliao thanks for the report!

@selasley


Test name | head[ms] | base[ms] | ratio |

frame_to_csv_date_formatting | 10.0880 | 10.4263 | 0.9676 |
frame_to_csv | 110.0133 | 113.4770 | 0.9695 |
write_csv_standard | 32.4697 | 33.2243 | 0.9773 |
read_csv_infer_datetime_format_ymd | 1.5241 | 1.5490 | 0.9839 |
read_csv_infer_datetime_format_iso8601 | 1.3983 | 1.3986 | 0.9998 |
read_csv_comment2 | 20.4786 | 20.3200 | 1.0078 |
packers_write_csv | 935.2113 | 927.8717 | 1.0079 |
read_csv_infer_datetime_format_custom | 7.3980 | 7.3347 | 1.0086 |
packers_read_csv | 138.5036 | 135.7140 | 1.0206 |
frame_to_csv2 | 91.6687 | 89.4313 | 1.0250 |
read_csv_precise_converter | 1.2287 | 1.1946 | 1.0285 |
frame_to_csv_mixed | 454.8573 | 441.8160 | 1.0295 |
read_csv_roundtrip_converter | 1.8710 | 1.8086 | 1.0344 |
read_csv_skiprows | 10.4700 | 10.0637 | 1.0404 |
read_csv_thou_vb | 14.9264 | 14.2634 | 1.0465 |
read_csv_standard | 8.9880 | 8.5763 | 1.0480 |
read_csv_vb | 17.1566 | 16.2590 | 1.0552 |

read_csv_default_converter | 1.2707 | 1.1950 | 1.0633 |

Test name | head[ms] | base[ms] | ratio |

Seed used: 1234

Target [8602c23] : BUG: Fix buffer overflows in tokenizer.c with certain malformed input files. GH9205
Base [9f439f0] : Merge pull request #9380 from behzadnouri/i8grby