ENH: add pd.asof_merge by jreback · Pull Request #13358 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation101 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Here is a notebook of example usage and timings
@sinhrks @jorisvandenbossche @wesm
API question. now we have pd.merge
(already existing), pd.ordered_merge
(already existing) and pd.asof_merge
. Makes sense to keep them separate as the take somewhat different args, but this IS adding to the namespace a bit.
raise MergeError("incompatible merge keys, " |
---|
"must be the same type") |
# no duplicates on rhs join keys |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if the function could handle duplicates automatically.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its easy, my only question is this too magical. IOW would a user expect to get the dups? (e.g. it will work right now w/dups)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fixed to auto-deduplicate right if needed (all internal now).
Current coverage is 84.32%
@@ master #13358 diff @@
Files 138 138
Lines 50929 51058 +129
Methods 0 0
Messages 0 0
Branches 0 0
- Hits 42926 43055 +129
Misses 8003 8003
Partials 0 0
Powered by Codecov. Last updated by 9670b31...4592fa2
I think adding to the name space (with in reason) is better than shoe horning partially overlapping APIs together. Having args/kwargs which are not used / disallowed / have different meaning based on the value of other input should be avoided.
try: |
---|
rhs = right.iloc[rby.indices[key]] |
except KeyError: |
# key doesn't exist in left |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't the key exist in the left if the failure is from a lookup on the right?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing ticker for a quotes (looks like I am bugging on it ATM
In [1]: trades = pd.read_csv('pandas/tools/tests/data/trades.csv')
In [2]: trades.time = pd.to_datetime(trades.time)
In [3]: quotes = pd.read_csv('pandas/tools/tests/data/quotes.csv')
In [4]: quotes.time = pd.to_datetime(quotes.time)
In [5]: pd.asof_merge(trades, quotes, on='time', by='ticker')
Out[5]:
time ticker price quantity marketCenter bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 NASDAQ 51.95 51.95
1 2016-05-25 13:30:00.038 MSFT 51.95 155 NASDAQ 51.95 51.95
2 2016-05-25 13:30:00.048 GOOG 720.77 100 NASDAQ 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 NASDAQ 720.50 720.93
4 2016-05-25 13:30:00.048 GOOG 720.93 200 NASDAQ 720.50 720.93
5 2016-05-25 13:30:00.048 GOOG 720.93 300 NASDAQ 720.50 720.93
6 2016-05-25 13:30:00.048 GOOG 720.93 600 NASDAQ 720.50 720.93
7 2016-05-25 13:30:00.048 GOOG 720.93 44 NASDAQ 720.50 720.93
8 2016-05-25 13:30:00.074 AAPL 98.67 478343 NASDAQ NaN NaN
9 2016-05-25 13:30:00.075 AAPL 98.67 478343 NASDAQ 98.55 98.56
10 2016-05-25 13:30:00.075 AAPL 98.66 6 NASDAQ 98.55 98.56
11 2016-05-25 13:30:00.075 AAPL 98.65 30 NASDAQ 98.55 98.56
12 2016-05-25 13:30:00.075 AAPL 98.65 75 NASDAQ 98.55 98.56
13 2016-05-25 13:30:00.075 AAPL 98.65 20 NASDAQ 98.55 98.56
14 2016-05-25 13:30:00.075 AAPL 98.65 35 NASDAQ 98.55 98.56
15 2016-05-25 13:30:00.075 AAPL 98.65 10 NASDAQ 98.55 98.56
16 2016-05-25 13:30:00.075 AAPL 98.55 6 ARCA 98.55 98.56
17 2016-05-25 13:30:00.075 AAPL 98.55 6 ARCA 98.55 98.56
18 2016-05-25 13:30:00.076 AAPL 98.56 1000 ARCA 98.55 98.56
19 2016-05-25 13:30:00.076 AAPL 98.56 200 ARCA 98.55 98.56
20 2016-05-25 13:30:00.076 AAPL 98.56 300 ARCA 98.55 98.56
21 2016-05-25 13:30:00.076 AAPL 98.56 400 ARCA 98.55 98.56
22 2016-05-25 13:30:00.076 AAPL 98.56 600 ARCA 98.55 98.56
23 2016-05-25 13:30:00.076 AAPL 98.56 200 ARCA 98.55 98.56
24 2016-05-25 13:30:00.078 MSFT 51.95 783 NASDAQ 51.92 51.95
25 2016-05-25 13:30:00.078 MSFT 51.95 100 NASDAQ 51.92 51.95
26 2016-05-25 13:30:00.078 MSFT 51.95 100 NASDAQ 51.92 51.95
In [6]: pd.asof_merge(trades, quotes[quotes.ticker!='MSFT'], on='time', by='ticker')
IndexError: indices are out-of-bounds
In [7]: quotes[quotes.ticker!='MSFT']
Out[7]:
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.50 720.93
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.048 GOOG 720.50 720.93
6 2016-05-25 13:30:00.048 GOOG 720.50 720.93
7 2016-05-25 13:30:00.072 GOOG 720.50 720.88
8 2016-05-25 13:30:00.075 AAPL 98.55 98.56
9 2016-05-25 13:30:00.076 AAPL 98.55 98.56
10 2016-05-25 13:30:00.076 AAPL 98.55 98.56
11 2016-05-25 13:30:00.076 AAPL 98.55 98.56
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have NaN in the resulting DataFrame when the right misses a key, just like in a regular left merge().
(This should be part of the regression tests as well, and I totally forgot about it earlier when I came-up with examples. Sorry about that.)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [5]: pd.asof_merge(trades, quotes[quotes.ticker!='MSFT'], on='time', by='ticker')
Out[5]:
time ticker price quantity marketCenter bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 NASDAQ NaN NaN
1 2016-05-25 13:30:00.038 MSFT 51.95 155 NASDAQ NaN NaN
2 2016-05-25 13:30:00.048 GOOG 720.77 100 NASDAQ 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 NASDAQ 720.50 720.93
4 2016-05-25 13:30:00.048 GOOG 720.93 200 NASDAQ 720.50 720.93
5 2016-05-25 13:30:00.048 GOOG 720.93 300 NASDAQ 720.50 720.93
6 2016-05-25 13:30:00.048 GOOG 720.93 600 NASDAQ 720.50 720.93
7 2016-05-25 13:30:00.048 GOOG 720.93 44 NASDAQ 720.50 720.93
8 2016-05-25 13:30:00.074 AAPL 98.67 478343 NASDAQ NaN NaN
9 2016-05-25 13:30:00.075 AAPL 98.67 478343 NASDAQ 98.55 98.56
10 2016-05-25 13:30:00.075 AAPL 98.66 6 NASDAQ 98.55 98.56
11 2016-05-25 13:30:00.075 AAPL 98.65 30 NASDAQ 98.55 98.56
12 2016-05-25 13:30:00.075 AAPL 98.65 75 NASDAQ 98.55 98.56
13 2016-05-25 13:30:00.075 AAPL 98.65 20 NASDAQ 98.55 98.56
14 2016-05-25 13:30:00.075 AAPL 98.65 35 NASDAQ 98.55 98.56
15 2016-05-25 13:30:00.075 AAPL 98.65 10 NASDAQ 98.55 98.56
16 2016-05-25 13:30:00.075 AAPL 98.55 6 ARCA 98.55 98.56
17 2016-05-25 13:30:00.075 AAPL 98.55 6 ARCA 98.55 98.56
18 2016-05-25 13:30:00.076 AAPL 98.56 1000 ARCA 98.55 98.56
19 2016-05-25 13:30:00.076 AAPL 98.56 200 ARCA 98.55 98.56
20 2016-05-25 13:30:00.076 AAPL 98.56 300 ARCA 98.55 98.56
21 2016-05-25 13:30:00.076 AAPL 98.56 400 ARCA 98.55 98.56
22 2016-05-25 13:30:00.076 AAPL 98.56 600 ARCA 98.55 98.56
23 2016-05-25 13:30:00.076 AAPL 98.56 200 ARCA 98.55 98.56
24 2016-05-25 13:30:00.078 MSFT 51.95 783 NASDAQ NaN NaN
25 2016-05-25 13:30:00.078 MSFT 51.95 100 NASDAQ NaN NaN
26 2016-05-25 13:30:00.078 MSFT 51.95 100 NASDAQ NaN NaN
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#4588 reared its ugly head, so had to 'fix' it.
jreback changed the title
ENH: add pd.asof_merge, closes #1870 ENH: add pd.asof_merge
tolerance: integer or Timedelta, optional, default None |
---|
select asof tolerance within this range, must be compatible |
to the merge index. |
strict: boolean, default False |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, I found this keyword name counter-intuitive. More verbose, but would something like allow_exact_matches=True
be better?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"strict" means "strictly less-than", as opposed to "less-than-or-equal-to". I agree that the parameter's explanation could be clearer.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about one of these:
strict='less_than_or_equal' | 'less_than'
match='less_than_or_equal' | 'less_than'
match_less_that_or_equal=True | False
allow_exact_matches=True | False
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyone have thoughts on strict
-> new name? (from above)
@jreback Just a general comment: for me (as a not financial guy :-)), the current explanation of the method (and also the examples) is not really clarifying on what it exactly does .. (at a first read, without looking at the code, which a user should not have to do). So we will have to do a better job explaining it I think.
Some things I was thinking:
- 'asof' is not a well-known term I think, so we need to explain it here as well (you should not have to look at the
.asof()
docstring to know what it means. - maybe a more dummy example would also help? (the current example is rather detailed (milliseconds, ..), making it more difficult to grasp in an glance what it does) But can certainly keep the current example as an actual practical example!
Will take a more detailed look myself later.
Ok taking @chrisaycock suggestion, changed strict
-> allow_exact_matches
, much more clear.
Here's the new signature.
Signature: pd.asof_merge(left, right, on=None, left_on=None, right_on=None, by=None, suffixes=('_x', '_y'), tolerance=None, allow_exact_matches=True)
Docstring:
Perform an asof merge. This by definition will fill the right data with
the asof the left data. The data must be ordered according to the 'on'
(typically time). Optionally perform group-wise merge.
Asof means return the last good (non-NaN) value in object for the date in
question. This is conceptually very close to a left-outer join with
forward filling.
.. versionadded 0.18.2
Parameters
----------
left: DataFrame
right: DataFrame
Data will be de-duplicated on the [by, on] combination if needed.
on: label or list
Field names to join on. Must be found in both DataFrames.
The data MUST be ordered. Furthermore this must be a datetimelike or
integer-index.
left_on: label or list, or array-like
Field names to join on in left DataFrame. Can be a vector or list of
vectors of the length of the DataFrame to use a particular vector as
the join key instead of columns
right_on: label or list, or array-like
Field names to join on in right DataFrame or vector/list of vectors per
left_on docs
by: column name or list of column names
Group both the left and right DataFrames by the group columns; perform
the merge operation on these pieces and recombine.
suffixes: 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right
side, respectively
tolerance: integer or Timedelta, optional, default None
select asof tolerance within this range, must be compatible
to the merge index.
allow_exact_matches: boolean, default True
If True, allow matching the same 'on' value (e.g. same time)
If False, don't match the same 'on' value
(e.g. don't match the same time)
Returns
-------
merged : DataFrame
What about this for the help statement?
Perform an asof merge. This is similar to a left-join except that we match on
nearest key rather than equal keys.
For each row in the left DataFrame, we select the last row in the right DataFrame
whose 'on' key is less than the left's key. Both DataFrames must be sorted by the
key.
Optionally perform group-wise merge. This matches the 'by' key equally, in
addition to the nearest match on the 'on' key.
Parameters
----------
left: DataFrame
right: DataFrame
Data will be de-duplicated on the [by, on] combination if needed.
on: label or list
Field names to join on. Must be found in both DataFrames.
The data MUST be ordered. Furthermore this must be a datetimelike or
integer-index.
left_on: label or list, or array-like
Field names to join on in left DataFrame. Can be a vector or list of
vectors of the length of the DataFrame to use a particular vector as
the join key instead of columns
right_on: label or list, or array-like
Field names to join on in right DataFrame or vector/list of vectors per
left_on docs
by: column name or list of column names
Group both the left and right DataFrames by the group columns; perform
the merge operation on these pieces and recombine.
suffixes: 2-length sequence (tuple, list, ...)
Suffix to apply to overlapping column names in the left and right
side, respectively
tolerance: integer or Timedelta, optional, default None
select asof tolerance within this range; must be compatible
to the merge index.
allow_exact_matches: boolean, default True
If True, allow matching the same 'on' value (i.e. less-than-or-equal-to)
If False, don't match the same 'on' value
(i.e., stricly less-than)
Similar to an ``pd.ordered_merge()``, an ``pd.asof_merge()`` allows one to propogate forward |
values that are missing. ``pd.asof_merge()`` allows some specific options on this propogation |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This propagates forward values that are not missing, right? Also, not sure what the second sentence is saying.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote basically using the new doc-string.
To @jorisvandenbossche's point, how about adding a smaller context-free example that makes it a little easier to see exactly what's going on. Something like this: (edit: changed 'right_val
' to numbers so it's easier to match up the rows)
a = pd.DataFrame({'a':[1, 5, 10], 'left_val':['a', 'b', 'c']}) b = pd.DataFrame({'a': [1, 2, 3, 6, 7], 'right_val': [1, 2, 3, 6, 7],})
pd.asof_merge(a, b, on='a')
It might also be good to show an ordered_merge
on the same set of data so you can see how they are different?
sure maybe just in the actual docs / or in doc-string as well?
yeah, I guess was just thinking about the docs
allow_exact_matches : boolean, default True |
---|
- If True, allow matching the same 'on' value |
(i.e. less-than-or-equal-to) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you align the (i.e.
with If
(otherwise sphinx will complain about unexpected indentation). Same for the lines below (and for check_duplicates
)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sure. that makes sense.
A long-time requested feature has been added thru the ``pd.merge_asof()`` function, to |
support asof style joining of time-series. (:issue:`1870`). Full documentaion are |
:ref:`here <mergeing.merge_asof>` |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mergeing -> merging
@jreback I added another bunch of doc comments (they are not critical, I can also do a PR later if you want)
But I wanted to come back to the duplicate handling for a moment. In case of only on
is specified (so no by
), it depends on if it is an 'equal' match or an 'less than' match for how duplicates (in the 'on' key column) in the right dataframe are handled:
In [1]: df1 = pd.DataFrame({'key':[1,1,3], 'left_val':[1,2,3]})
...: df2 = pd.DataFrame({'key':[1,3,3], 'right_val':[1,2,3]})
...:
In [2]: pd.merge_asof(df1, df2, on='key')
Out[2]:
key left_val right_val
0 1 1 1
1 1 2 1
2 3 3 2
3 3 3 3
In [3]: df1 = pd.DataFrame({'key':[1,1,3], 'left_val':[1,2,3]})
...: df2 = pd.DataFrame({'key':[1,2,2], 'right_val':[1,2,3]})
In [4]: pd.merge_asof(df1, df2, on='key')
Out[4]:
key left_val right_val
0 1 1 1
1 1 2 1
2 3 3 3
So in case of an exact match, you get both duplicates (consistent with merge
), but if you have a inexact match, only the last row of the duplicates is used. This seems a bit inconsistent to me.
(stressing again that I do not know the financial application of this method, but just look at it based on my understanding of the general concept of an asof join, so I don't know what the 'logical' way is to treat duplicates. I just notice that it is inconsistent with merge
/merge_ordered
)
your example of the duplicates is correct. This is exactly the purpose of an asof merge, NOT to duplicate (pun!) the point of merge. An asof-merge will get you exactly 1 value for each of the left. That's why duplicates are a bit odd, you normally don't have them.
Your example is not quite right, you are now emulating an inner join (as the how is implicit), but a how='left'
.
In [5]: pd.merge_asof(df1, df2, on='key')
Out[5]:
key left_val right_val
0 1 1 1
1 1 2 1
2 3 3 3
In [6]: pd.merge(df1, df2, on='key', how='left')
Out[6]:
key left_val right_val
0 1 1 1.0
1 1 2 1.0
2 3 3 NaN
I added a test with your examples in any event (as well as updated the docs).
(sorry for the late answer)
OK, I see the logic in that (and the consistency with a left merge
-> only the NaN is filled), but on the other hand, the "nearest" and "last row" in the docstring ("match on nearest key rather than equal keys." and "we select the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.") does not really apply on the 'equal' case (not the last row, but all rows but equal key are selected), only on the 'less than' case.
Further, I have another question on the duplicate. There is also a difference in handling of duplicates when using by
or not (using the same example as above, but with a 'by' column (consisting of one value for simplicity of showing)):
In [1]: df1 = pd.DataFrame({'key':[1,1,3], 'by':['a']*3, 'left_val':[1,2,3]})
...: df2 = pd.DataFrame({'key':[1,3,3], 'by':['a']*3, 'right_val':[1,2,3]})
...:
In [2]: pd.merge_asof(df1, df2, on='key')
Out[2]:
by_x key left_val by_y right_val
0 a 1 1 a 1
1 a 1 2 a 1
2 a 3 3 a 2
3 a 3 3 a 3
In [3]: pd.merge_asof(df1, df2, on='key', by='by')
Out[3]:
by key left_val right_val
0 a 1 1 1
1 a 1 2 1
2 a 3 3 3
So when not using by
, duplicate equal keys in the right frame are all used in the result, while when using by
, only the last row is used.
(BTW, it's OK for me to merge this version while I further ask questions, also easier to test when in master and to further clean-up docs)
Using a slightly other example (combining both cases - duplicate equal and non-equal keys in the right frame - in one dataframe:
In [4]: df1 = pd.DataFrame({'key':[1,1,3,5], 'by':['a']*4, 'left_val':[1,2,3,4]})
...: df2 = pd.DataFrame({'key':[1,2,2,5,5], 'by':['a']*5, 'right_val':[1,2,3,4,5]})
...:
In [5]: pd.merge_asof(df1, df2, on='key')
Out[5]:
by_x key left_val by_y right_val
0 a 1 1 a 1
1 a 1 2 a 1
2 a 3 3 a 3
3 a 5 4 a 4
4 a 5 4 a 5
In [6]: pd.merge_asof(df1, df2, on='key', by='by')
Out[6]:
by key left_val right_val
0 a 1 1 1
1 a 1 2 1
2 a 3 3 3
3 a 5 4 5
In [7]: pd.merge_asof(df1, df2, on='key', by='by', check_duplicates=False)
Out[7]:
by key left_val right_val
0 a 1 1 1
1 a 1 2 1
2 a 3 3 3
3 a 5 4 4
4 a 5 4 5
So actually (in this case), the duplicate=False
(when having duplicates, which is not the purpose of this keyword as following the docs you should use it when you are certain there are no duplicates) gives you similar behaviour as when no by
is used.
@jreback OK, I just see you actually changed this (check_duplicates
is now also used without by
), sorry, I didn't update my branch for testing ...
So ignore all my comments from above about inconsistency between using by
or not!
ok merged, will see about docs and do a followup probably
jreback added a commit that referenced this pull request