Update tokenizer to fix #8679 #8661 by selasley · Pull Request #8752 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation36 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Update tokenizer's handling of skipped lines.
Fixes a problem with read_csv(delim_whitespace=True) and
read_table(engine='c') when lines being skipped have trailing
whitespace.
@@ -3036,7 +3036,7 @@ def test_line_comment(self): |
---|
def test_comment_skiprows(self): |
data = """# empty |
random line |
random line with trailing spaces |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you leave the original test in, and add a new test (you can just copy-paste), same method is fine. Also add the issue numbers as a comment.
pls add a release note in v0.15.2 (reference both issues)
Looks good to me--might as well leave the old test in as @jreback noted above though.
Thanks for the fix @selasley!
- pls add back old test
- add release note
- rebase / squash
@selasley ok, rebase again, you'll have a merge conflict (as you are including the whole of v0.15.2.txt again). You should be fine otherwise, repush and i'll have a look.
I added my comments to whatsnew/v0.15.2.txt from e8b3862, did a rebase/squash and pushed the result. Let me know if I was supposed to do something different.
commit 3c16f33 contains a bug fix in tokenize_delim_customterm, general code cleanup and updates to various files to e8b3862 versions
you seem to be picking up other commits and then rebasing them in
what are you doing to rebase?
On Nov 10, 2014, at 3:37 AM, jreback notifications@github.com wrote:
Thanks for the link. I'll definitely read through it.
I've been doing git rebase -i HEAD~2 then changing the second pick to squash and editing the commit comments. Then I do a git push origin +trailing_spaces_fix
For the latest commit I copied several files from the latest master branch to my trailing_spaces_fix branch to make sure the tests still passed. In retrospect that was probably not a good thing to do.=
@selasley can you rebase this.
and can you post a memory comparison of old vs new? (use a relatively small example)
I'll try again to rebase tonight.
I created a text file with 10000000 rows to skip, each containing 8 fields, followed by two rows each with 4 integers. In ipython3
%timeit -r 5 -n 5 df = pd.read_csv('bigfile.txt', header=None, delim_whitespace=True, skiprows=10000000)
5 loops, best of 5: 5.74 s per loop for new pandas
5 loops, best of 5: 10.3 s per loop for pandas 0.15.1 with numba and BottleNeck
According to the Allocations instrument with Xcode, reading the file allocated 4.67GB total with pandas 0.15.1 and 1.40GB with the new pandas.
With 100000 rows to skip the numbers are
5 loops, best of 5: 43.9 ms per loop, 29.23MB for new pandas
5 loops, best of 5: 88.0 ms per loop, 71.45MB for pandas 0.15.1
I found a problem with custom line terminators in 0.15.1 and in my version
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='', sep=' ')
works as expected, but
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'.replace(' ', ',')), lineterminator='', sep=r',+')
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='', delim_whitespace=True)
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='', sep=r'\ +')
all return an empty DataFrame. I would like to fix this.
@selasley fine to fix (I think their is an open issue about line terminators - can't find it in s quick search)
pls put in separate commits (or do a separate pr)
the skip rows memory issue ok in here
but need to rebase
I think I rebased properly this time for the trailing spaces fix. Thanks for your patience
rebase looks good!
https://github.com/pydata/pandas/wiki/Performance-Testing
see:
need to test for -r csv
eg the vbenches related to csv reading
pls post results
I think it makes send to add s vbench for skiprows as well
don't make the example so long maybe
like them to take < less than say 500ms (or so)
for this type of bench - maybe 20k rows skip first 10k
I added a 20000 row, skiprows=10000 benchtest to io_bench.py
./test_perf.sh -b master -t HEAD -r csv
Invoked with :
--ncalls: 3
--repeats: 3
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
frame_to_csv_date_formatting | 21.4180 | 26.3120 | 0.8140 |
read_csv_skiprows | 16.4446 | 18.6180 | 0.8833 |
read_csv_precise_converter | 1.8516 | 1.9027 | 0.9732 |
frame_to_csv | 159.6503 | 163.0680 | 0.9790 |
read_csv_thou_vb | 24.6903 | 25.0720 | 0.9848 |
frame_to_csv2 | 149.5183 | 150.1943 | 0.9955 |
write_csv_standard | 55.2851 | 55.2817 | 1.0001 |
frame_to_csv_mixed | 967.0726 | 962.2053 | 1.0051 |
packers_write_csv | 1647.7884 | 1638.5774 | 1.0056 |
read_csv_infer_datetime_format_custom | 13.7510 | 13.6327 | 1.0087 |
read_csv_infer_datetime_format_iso8601 | 2.2984 | 2.2717 | 1.0118 |
read_csv_infer_datetime_format_ymd | 2.4959 | 2.4653 | 1.0124 |
read_csv_roundtrip_converter | 2.8143 | 2.7333 | 1.0296 |
packers_read_csv | 220.8546 | 213.4230 | 1.0348 |
read_csv_standard | 13.8597 | 13.3590 | 1.0375 |
read_csv_vb | 26.3373 | 25.1070 | 1.0490 |
read_csv_default_converter | 2.0603 | 1.9317 | 1.0666 |
read_csv_comment2 | 32.6854 | 30.3557 | 1.0767 |
-------------------------------------------------------------------------------
Test name | head[ms] | base[ms] | ratio |
-------------------------------------------------------------------------------
Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234
Target [746afa2] : BUG CSV: fix problem with trailing whitespace, issues 8661, 8679
ENH CSV: improve memory footprint when skipping first N rows, issue 8681
Base [cd93914] : BUG: Allow non-float values in get_yahoo_quotes.
Fixes issue: #5229
perf looks good (and vbench).
you need to rebase again. somehow picking up odd things.
git rebase thisbranch -i origin/master
should do it
git rebase trailing_spaces_fix -i origin/master and git rebase -i trailing_spaces_fix origin/master gave fatal errors so I tried the rebase that seemed to work yesterday
@selasley pls rebase and push. is this working locally?
I am trying to follow the recipe on the wiki. Starting from scratch
git clone https://github.com/selasley/pandas.git
git remote add upstream git://github.com/pydata/pandas
git fetch upstream
git rebase -i upstream/master
git diff usually shows that doc/source/whatsnew/v0.15.2.txt needs to be cleaned up. I do that then
git rebase --continue
git push origin +trailing_spaces_fix
Right now git rebase -i says Successfully rebased and updated refs/heads/trailing_spaces_fix.
I did add some files to my fork and pushed before reading the wiki page. Maybe that's what is causing problems. Should I start from scratch and fork from the latest master branch?
git push origin trailing_spaces_fix -f
the -f
is key
OK, I did a girl push with -f. I mistakenly thought +trailing_spaces_fix was the same as --force from some stack overflow answer. Now that I read the git push docs I see they are different.
I misinterpreted the failure message on Travis. It builds without errors on my mac with clang from Xcode 6.1 and Cython 0.21.1, but if I run nosetests in the pandas directory I get the ImportError: cannot import name 'hashtable' error.
I'll work on getting it to build in Travis
@@ -66,6 +66,7 @@ Enhancements |
---|
- Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`). |
- Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See :ref:`here<remote_data.ga>`. |
- Added flag ``order_categoricals`` to ``StataReader`` and ``read_stata`` to select whether to order imported categorical data (:issue:`8836`). See :ref:`here <io.stata-categorical>` for more information on importing categorical variables from Stata data files. |
- Reduce memory usage when skiprows is an integer in read_csv (:issue:`8681`) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this to performance section
looks good. minor comment. pls rebase / squash. ping when green.
I moved the comment to performance in whatsnew and rebased
df = self.read_csv(StringIO(data.replace(',', ' ')), |
---|
header=None, delim_whitespace=True, |
skiprows=[0,1,2,3,5,6], skip_blank_lines=True) |
tm.assert_almost_equal(df.values, expected) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare an actual dataframe here (and below), NOT the values, just tm.assert_frame_equal
, this makes sure dtypes, index, columns are all correct
… 8661, 8679
ENH CSV: Reduce memory usage when skiprows is an integer in read_csv, issue 8681
replaced tm.assert_almost_equal(df.values, expected) with tm.assert_frame_equal(df, expected) where expected is now a dataframe. I had to push a couple of times. The latest travis build should succeed
jreback added a commit that referenced this pull request
Although it works for "skiprows = integer" now, there is still problem with "skiprows = list of rows"
when the list does not start from row=0.
such as read_table(file, skiprows = (1,2)) when all lines have trailing spaces
I can reproduce the problem. I'll look into it
data1="""A B C D E\n2spaces \n1space \n3spaces \nF G H I J \n0 1 2 3 4 \n5. 6 7. 8 9. """
data2="""A B C D E \n2spaces \n1space \n3spaces \nF G H I J \n0 1 2 3 4 \n5. 6 7. 8 9. """
pd.read_csv(StringIO(data1), skiprows=4, delim_whitespace=True)
F G H I J
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data1), skiprows=(1,2,3,4), delim_whitespace=True)
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data2), skiprows=4, delim_whitespace=True)
F G H I J
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data2), skiprows=(1,2,3,4), delim_whitespace=True)
A B C D E
0 NaN NaN NaN NaN NaN
1 F G H I J
2 0 1 2 3 4
3 5. 6 7. 8 9.
@selasley excellent why don't u create a new issue for this (and xref this one)
@xdliao thanks for the report!
Test name | head[ms] | base[ms] | ratio |
frame_to_csv_date_formatting | 10.0880 | 10.4263 | 0.9676 |
frame_to_csv | 110.0133 | 113.4770 | 0.9695 |
write_csv_standard | 32.4697 | 33.2243 | 0.9773 |
read_csv_infer_datetime_format_ymd | 1.5241 | 1.5490 | 0.9839 |
read_csv_infer_datetime_format_iso8601 | 1.3983 | 1.3986 | 0.9998 |
read_csv_comment2 | 20.4786 | 20.3200 | 1.0078 |
packers_write_csv | 935.2113 | 927.8717 | 1.0079 |
read_csv_infer_datetime_format_custom | 7.3980 | 7.3347 | 1.0086 |
packers_read_csv | 138.5036 | 135.7140 | 1.0206 |
frame_to_csv2 | 91.6687 | 89.4313 | 1.0250 |
read_csv_precise_converter | 1.2287 | 1.1946 | 1.0285 |
frame_to_csv_mixed | 454.8573 | 441.8160 | 1.0295 |
read_csv_roundtrip_converter | 1.8710 | 1.8086 | 1.0344 |
read_csv_skiprows | 10.4700 | 10.0637 | 1.0404 |
read_csv_thou_vb | 14.9264 | 14.2634 | 1.0465 |
read_csv_standard | 8.9880 | 8.5763 | 1.0480 |
read_csv_vb | 17.1566 | 16.2590 | 1.0552 |
read_csv_default_converter | 1.2707 | 1.1950 | 1.0633 |
Test name | head[ms] | base[ms] | ratio |
Seed used: 1234
Target [8602c23] : BUG: Fix buffer overflows in tokenizer.c with certain malformed input files. GH9205
Base [9f439f0] : Merge pull request #9380 from behzadnouri/i8grby