ENH: add pd.asof_merge by jreback · Pull Request #13358 · pandas-dev/pandas (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation101 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

jreback

closes #1870
xref #2941

Here is a notebook of example usage and timings

@jreback

@sinhrks @jorisvandenbossche @wesm

API question. now we have pd.merge (already existing), pd.ordered_merge (already existing) and pd.asof_merge. Makes sense to keep them separate as the take somewhat different args, but this IS adding to the namespace a bit.

chrisaycock

raise MergeError("incompatible merge keys, "
"must be the same type")
# no duplicates on rhs join keys

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the function could handle duplicates automatically.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its easy, my only question is this too magical. IOW would a user expect to get the dups? (e.g. it will work right now w/dups)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fixed to auto-deduplicate right if needed (all internal now).

@codecov-io

Current coverage is 84.32%

Merging #13358 into master will increase coverage by 0.03%

@@ master #13358 diff @@

Files 138 138
Lines 50929 51058 +129
Methods 0 0
Messages 0 0
Branches 0 0

Powered by Codecov. Last updated by 9670b31...4592fa2

@tacaswell

I think adding to the name space (with in reason) is better than shoe horning partially overlapping APIs together. Having args/kwargs which are not used / disallowed / have different meaning based on the value of other input should be avoided.

chrisaycock

try:
rhs = right.iloc[rby.indices[key]]
except KeyError:
# key doesn't exist in left

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't the key exist in the left if the failure is from a lookup on the right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing ticker for a quotes (looks like I am bugging on it ATM

In [1]: trades = pd.read_csv('pandas/tools/tests/data/trades.csv')

In [2]: trades.time = pd.to_datetime(trades.time)

In [3]: quotes = pd.read_csv('pandas/tools/tests/data/quotes.csv')

In [4]: quotes.time = pd.to_datetime(quotes.time)

In [5]: pd.asof_merge(trades, quotes, on='time', by='ticker')
Out[5]: 
                      time ticker   price  quantity marketCenter     bid     ask
0  2016-05-25 13:30:00.023   MSFT   51.95        75       NASDAQ   51.95   51.95
1  2016-05-25 13:30:00.038   MSFT   51.95       155       NASDAQ   51.95   51.95
2  2016-05-25 13:30:00.048   GOOG  720.77       100       NASDAQ  720.50  720.93
3  2016-05-25 13:30:00.048   GOOG  720.92       100       NASDAQ  720.50  720.93
4  2016-05-25 13:30:00.048   GOOG  720.93       200       NASDAQ  720.50  720.93
5  2016-05-25 13:30:00.048   GOOG  720.93       300       NASDAQ  720.50  720.93
6  2016-05-25 13:30:00.048   GOOG  720.93       600       NASDAQ  720.50  720.93
7  2016-05-25 13:30:00.048   GOOG  720.93        44       NASDAQ  720.50  720.93
8  2016-05-25 13:30:00.074   AAPL   98.67    478343       NASDAQ     NaN     NaN
9  2016-05-25 13:30:00.075   AAPL   98.67    478343       NASDAQ   98.55   98.56
10 2016-05-25 13:30:00.075   AAPL   98.66         6       NASDAQ   98.55   98.56
11 2016-05-25 13:30:00.075   AAPL   98.65        30       NASDAQ   98.55   98.56
12 2016-05-25 13:30:00.075   AAPL   98.65        75       NASDAQ   98.55   98.56
13 2016-05-25 13:30:00.075   AAPL   98.65        20       NASDAQ   98.55   98.56
14 2016-05-25 13:30:00.075   AAPL   98.65        35       NASDAQ   98.55   98.56
15 2016-05-25 13:30:00.075   AAPL   98.65        10       NASDAQ   98.55   98.56
16 2016-05-25 13:30:00.075   AAPL   98.55         6         ARCA   98.55   98.56
17 2016-05-25 13:30:00.075   AAPL   98.55         6         ARCA   98.55   98.56
18 2016-05-25 13:30:00.076   AAPL   98.56      1000         ARCA   98.55   98.56
19 2016-05-25 13:30:00.076   AAPL   98.56       200         ARCA   98.55   98.56
20 2016-05-25 13:30:00.076   AAPL   98.56       300         ARCA   98.55   98.56
21 2016-05-25 13:30:00.076   AAPL   98.56       400         ARCA   98.55   98.56
22 2016-05-25 13:30:00.076   AAPL   98.56       600         ARCA   98.55   98.56
23 2016-05-25 13:30:00.076   AAPL   98.56       200         ARCA   98.55   98.56
24 2016-05-25 13:30:00.078   MSFT   51.95       783       NASDAQ   51.92   51.95
25 2016-05-25 13:30:00.078   MSFT   51.95       100       NASDAQ   51.92   51.95
26 2016-05-25 13:30:00.078   MSFT   51.95       100       NASDAQ   51.92   51.95

In [6]: pd.asof_merge(trades, quotes[quotes.ticker!='MSFT'], on='time', by='ticker')
IndexError: indices are out-of-bounds
In [7]: quotes[quotes.ticker!='MSFT']
Out[7]: 
                      time ticker     bid     ask
0  2016-05-25 13:30:00.023   GOOG  720.50  720.93
3  2016-05-25 13:30:00.048   GOOG  720.50  720.93
4  2016-05-25 13:30:00.048   GOOG  720.50  720.93
5  2016-05-25 13:30:00.048   GOOG  720.50  720.93
6  2016-05-25 13:30:00.048   GOOG  720.50  720.93
7  2016-05-25 13:30:00.072   GOOG  720.50  720.88
8  2016-05-25 13:30:00.075   AAPL   98.55   98.56
9  2016-05-25 13:30:00.076   AAPL   98.55   98.56
10 2016-05-25 13:30:00.076   AAPL   98.55   98.56
11 2016-05-25 13:30:00.076   AAPL   98.55   98.56

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have NaN in the resulting DataFrame when the right misses a key, just like in a regular left merge().

(This should be part of the regression tests as well, and I totally forgot about it earlier when I came-up with examples. Sorry about that.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [5]: pd.asof_merge(trades, quotes[quotes.ticker!='MSFT'], on='time', by='ticker')
Out[5]: 
                      time ticker   price  quantity marketCenter     bid     ask
0  2016-05-25 13:30:00.023   MSFT   51.95        75       NASDAQ     NaN     NaN
1  2016-05-25 13:30:00.038   MSFT   51.95       155       NASDAQ     NaN     NaN
2  2016-05-25 13:30:00.048   GOOG  720.77       100       NASDAQ  720.50  720.93
3  2016-05-25 13:30:00.048   GOOG  720.92       100       NASDAQ  720.50  720.93
4  2016-05-25 13:30:00.048   GOOG  720.93       200       NASDAQ  720.50  720.93
5  2016-05-25 13:30:00.048   GOOG  720.93       300       NASDAQ  720.50  720.93
6  2016-05-25 13:30:00.048   GOOG  720.93       600       NASDAQ  720.50  720.93
7  2016-05-25 13:30:00.048   GOOG  720.93        44       NASDAQ  720.50  720.93
8  2016-05-25 13:30:00.074   AAPL   98.67    478343       NASDAQ     NaN     NaN
9  2016-05-25 13:30:00.075   AAPL   98.67    478343       NASDAQ   98.55   98.56
10 2016-05-25 13:30:00.075   AAPL   98.66         6       NASDAQ   98.55   98.56
11 2016-05-25 13:30:00.075   AAPL   98.65        30       NASDAQ   98.55   98.56
12 2016-05-25 13:30:00.075   AAPL   98.65        75       NASDAQ   98.55   98.56
13 2016-05-25 13:30:00.075   AAPL   98.65        20       NASDAQ   98.55   98.56
14 2016-05-25 13:30:00.075   AAPL   98.65        35       NASDAQ   98.55   98.56
15 2016-05-25 13:30:00.075   AAPL   98.65        10       NASDAQ   98.55   98.56
16 2016-05-25 13:30:00.075   AAPL   98.55         6         ARCA   98.55   98.56
17 2016-05-25 13:30:00.075   AAPL   98.55         6         ARCA   98.55   98.56
18 2016-05-25 13:30:00.076   AAPL   98.56      1000         ARCA   98.55   98.56
19 2016-05-25 13:30:00.076   AAPL   98.56       200         ARCA   98.55   98.56
20 2016-05-25 13:30:00.076   AAPL   98.56       300         ARCA   98.55   98.56
21 2016-05-25 13:30:00.076   AAPL   98.56       400         ARCA   98.55   98.56
22 2016-05-25 13:30:00.076   AAPL   98.56       600         ARCA   98.55   98.56
23 2016-05-25 13:30:00.076   AAPL   98.56       200         ARCA   98.55   98.56
24 2016-05-25 13:30:00.078   MSFT   51.95       783       NASDAQ     NaN     NaN
25 2016-05-25 13:30:00.078   MSFT   51.95       100       NASDAQ     NaN     NaN
26 2016-05-25 13:30:00.078   MSFT   51.95       100       NASDAQ     NaN     NaN

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4588 reared its ugly head, so had to 'fix' it.

@jreback jreback changed the titleENH: add pd.asof_merge, closes #1870 ENH: add pd.asof_merge

Jun 3, 2016

chris-b1

tolerance: integer or Timedelta, optional, default None
select asof tolerance within this range, must be compatible
to the merge index.
strict: boolean, default False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, I found this keyword name counter-intuitive. More verbose, but would something like allow_exact_matches=True be better?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"strict" means "strictly less-than", as opposed to "less-than-or-equal-to". I agree that the parameter's explanation could be clearer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about one of these:

  1. strict='less_than_or_equal' | 'less_than'
  2. match='less_than_or_equal' | 'less_than'
  3. match_less_that_or_equal=True | False
  4. allow_exact_matches=True | False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyone have thoughts on strict -> new name? (from above)

@jorisvandenbossche

@jreback Just a general comment: for me (as a not financial guy :-)), the current explanation of the method (and also the examples) is not really clarifying on what it exactly does .. (at a first read, without looking at the code, which a user should not have to do). So we will have to do a better job explaining it I think.

Some things I was thinking:

Will take a more detailed look myself later.

@wesm

@jreback

Ok taking @chrisaycock suggestion, changed strict -> allow_exact_matches, much more clear.

Here's the new signature.

Signature: pd.asof_merge(left, right, on=None, left_on=None, right_on=None, by=None, suffixes=('_x', '_y'), tolerance=None, allow_exact_matches=True)
Docstring:
Perform an asof merge. This by definition will fill the right data with
the asof the left data. The data must be ordered according to the 'on'
(typically time). Optionally perform group-wise merge.

Asof means return the last good (non-NaN) value in object for the date in
question. This is conceptually very close to a left-outer join with
forward filling.

.. versionadded 0.18.2

Parameters
----------
left: DataFrame
right: DataFrame
    Data will be de-duplicated on the [by, on] combination if needed.
on: label or list
    Field names to join on. Must be found in both DataFrames.
    The data MUST be ordered. Furthermore this must be a datetimelike or
    integer-index.
left_on: label or list, or array-like
    Field names to join on in left DataFrame. Can be a vector or list of
    vectors of the length of the DataFrame to use a particular vector as
    the join key instead of columns
right_on: label or list, or array-like
    Field names to join on in right DataFrame or vector/list of vectors per
    left_on docs
by: column name or list of column names
    Group both the left and right DataFrames by the group columns; perform
    the merge operation on these pieces and recombine.
suffixes: 2-length sequence (tuple, list, ...)
    Suffix to apply to overlapping column names in the left and right
    side, respectively
tolerance: integer or Timedelta, optional, default None
    select asof tolerance within this range, must be compatible
    to the merge index.
allow_exact_matches: boolean, default True
    If True, allow matching the same 'on' value (e.g. same time)
    If False, don't match the same 'on' value
        (e.g. don't match the same time)

Returns
-------
merged : DataFrame

@jreback

@chrisaycock

What about this for the help statement?

Perform an asof merge. This is similar to a left-join except that we match on
nearest key rather than equal keys.

For each row in the left DataFrame, we select the last row in the right DataFrame
whose 'on' key is less than the left's key. Both DataFrames must be sorted by the
key.

Optionally perform group-wise merge. This matches the 'by' key equally, in
addition to the nearest match on the 'on' key.


Parameters
----------
left: DataFrame
right: DataFrame
    Data will be de-duplicated on the [by, on] combination if needed.
on: label or list
    Field names to join on. Must be found in both DataFrames.
    The data MUST be ordered. Furthermore this must be a datetimelike or
    integer-index.
left_on: label or list, or array-like
    Field names to join on in left DataFrame. Can be a vector or list of
    vectors of the length of the DataFrame to use a particular vector as
    the join key instead of columns
right_on: label or list, or array-like
    Field names to join on in right DataFrame or vector/list of vectors per
    left_on docs
by: column name or list of column names
    Group both the left and right DataFrames by the group columns; perform
    the merge operation on these pieces and recombine.
suffixes: 2-length sequence (tuple, list, ...)
    Suffix to apply to overlapping column names in the left and right
    side, respectively
tolerance: integer or Timedelta, optional, default None
    select asof tolerance within this range; must be compatible
    to the merge index.
allow_exact_matches: boolean, default True
    If True, allow matching the same 'on' value (i.e. less-than-or-equal-to)
    If False, don't match the same 'on' value
        (i.e., stricly less-than)

@jreback

TomAugspurger

Similar to an ``pd.ordered_merge()``, an ``pd.asof_merge()`` allows one to propogate forward
values that are missing. ``pd.asof_merge()`` allows some specific options on this propogation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This propagates forward values that are not missing, right? Also, not sure what the second sentence is saying.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote basically using the new doc-string.

@chris-b1

To @jorisvandenbossche's point, how about adding a smaller context-free example that makes it a little easier to see exactly what's going on. Something like this: (edit: changed 'right_val' to numbers so it's easier to match up the rows)

a = pd.DataFrame({'a':[1, 5, 10], 'left_val':['a', 'b', 'c']}) b = pd.DataFrame({'a': [1, 2, 3, 6, 7], 'right_val': [1, 2, 3, 6, 7],})

pd.asof_merge(a, b, on='a')

It might also be good to show an ordered_merge on the same set of data so you can see how they are different?

@jreback

sure maybe just in the actual docs / or in doc-string as well?

@chris-b1

yeah, I guess was just thinking about the docs

jorisvandenbossche

allow_exact_matches : boolean, default True
- If True, allow matching the same 'on' value
(i.e. less-than-or-equal-to)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you align the (i.e. with If (otherwise sphinx will complain about unexpected indentation). Same for the lines below (and for check_duplicates)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sure. that makes sense.

@jreback

jorisvandenbossche

A long-time requested feature has been added thru the ``pd.merge_asof()`` function, to
support asof style joining of time-series. (:issue:`1870`). Full documentaion are
:ref:`here <mergeing.merge_asof>`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mergeing -> merging

@jorisvandenbossche

@jreback I added another bunch of doc comments (they are not critical, I can also do a PR later if you want)

But I wanted to come back to the duplicate handling for a moment. In case of only on is specified (so no by), it depends on if it is an 'equal' match or an 'less than' match for how duplicates (in the 'on' key column) in the right dataframe are handled:

In [1]: df1 = pd.DataFrame({'key':[1,1,3], 'left_val':[1,2,3]})
   ...: df2 = pd.DataFrame({'key':[1,3,3], 'right_val':[1,2,3]})
   ...:

In [2]: pd.merge_asof(df1, df2, on='key')
Out[2]:
   key  left_val  right_val
0    1         1          1
1    1         2          1
2    3         3          2
3    3         3          3

In [3]: df1 = pd.DataFrame({'key':[1,1,3], 'left_val':[1,2,3]})
   ...: df2 = pd.DataFrame({'key':[1,2,2], 'right_val':[1,2,3]})

In [4]: pd.merge_asof(df1, df2, on='key')
Out[4]:
   key  left_val  right_val
0    1         1          1
1    1         2          1
2    3         3          3

So in case of an exact match, you get both duplicates (consistent with merge), but if you have a inexact match, only the last row of the duplicates is used. This seems a bit inconsistent to me.

(stressing again that I do not know the financial application of this method, but just look at it based on my understanding of the general concept of an asof join, so I don't know what the 'logical' way is to treat duplicates. I just notice that it is inconsistent with merge/merge_ordered)

@jreback

@jreback

your example of the duplicates is correct. This is exactly the purpose of an asof merge, NOT to duplicate (pun!) the point of merge. An asof-merge will get you exactly 1 value for each of the left. That's why duplicates are a bit odd, you normally don't have them.
Your example is not quite right, you are now emulating an inner join (as the how is implicit), but a how='left'.

In [5]: pd.merge_asof(df1, df2, on='key')
Out[5]: 
   key  left_val  right_val
0    1         1          1
1    1         2          1
2    3         3          3

In [6]: pd.merge(df1, df2, on='key', how='left')
Out[6]: 
   key  left_val  right_val
0    1         1        1.0
1    1         2        1.0
2    3         3        NaN

I added a test with your examples in any event (as well as updated the docs).

@jreback

@jorisvandenbossche

(sorry for the late answer)
OK, I see the logic in that (and the consistency with a left merge -> only the NaN is filled), but on the other hand, the "nearest" and "last row" in the docstring ("match on nearest key rather than equal keys." and "we select the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.") does not really apply on the 'equal' case (not the last row, but all rows but equal key are selected), only on the 'less than' case.

Further, I have another question on the duplicate. There is also a difference in handling of duplicates when using by or not (using the same example as above, but with a 'by' column (consisting of one value for simplicity of showing)):

In [1]: df1 = pd.DataFrame({'key':[1,1,3], 'by':['a']*3, 'left_val':[1,2,3]})
   ...: df2 = pd.DataFrame({'key':[1,3,3], 'by':['a']*3, 'right_val':[1,2,3]})
   ...:

In [2]: pd.merge_asof(df1, df2, on='key')
Out[2]:
  by_x  key  left_val by_y  right_val
0    a    1         1    a          1
1    a    1         2    a          1
2    a    3         3    a          2
3    a    3         3    a          3

In [3]: pd.merge_asof(df1, df2, on='key', by='by')
Out[3]:
  by  key  left_val  right_val
0  a    1         1          1
1  a    1         2          1
2  a    3         3          3

So when not using by, duplicate equal keys in the right frame are all used in the result, while when using by, only the last row is used.

@jorisvandenbossche

(BTW, it's OK for me to merge this version while I further ask questions, also easier to test when in master and to further clean-up docs)

@jorisvandenbossche

Using a slightly other example (combining both cases - duplicate equal and non-equal keys in the right frame - in one dataframe:

In [4]: df1 = pd.DataFrame({'key':[1,1,3,5], 'by':['a']*4, 'left_val':[1,2,3,4]})
   ...: df2 = pd.DataFrame({'key':[1,2,2,5,5], 'by':['a']*5, 'right_val':[1,2,3,4,5]})
   ...:

In [5]: pd.merge_asof(df1, df2, on='key')
Out[5]:
  by_x  key  left_val by_y  right_val
0    a    1         1    a          1
1    a    1         2    a          1
2    a    3         3    a          3
3    a    5         4    a          4
4    a    5         4    a          5

In [6]: pd.merge_asof(df1, df2, on='key', by='by')
Out[6]:
  by  key  left_val  right_val
0  a    1         1          1
1  a    1         2          1
2  a    3         3          3
3  a    5         4          5

In [7]: pd.merge_asof(df1, df2, on='key', by='by', check_duplicates=False)
Out[7]:
  by  key  left_val  right_val
0  a    1         1          1
1  a    1         2          1
2  a    3         3          3
3  a    5         4          4
4  a    5         4          5

So actually (in this case), the duplicate=False (when having duplicates, which is not the purpose of this keyword as following the docs you should use it when you are certain there are no duplicates) gives you similar behaviour as when no by is used.

@jorisvandenbossche

@jreback OK, I just see you actually changed this (check_duplicates is now also used without by), sorry, I didn't update my branch for testing ...
So ignore all my comments from above about inconsistency between using by or not!

@jreback

ok merged, will see about docs and do a followup probably

jreback added a commit that referenced this pull request

Jun 17, 2016

@jreback