ENH: Allow join based on conditional statement · Issue #34543 · pandas-dev/pandas (original) (raw)

In pandas, it is only possible to perform joins based on conditional statements that consider whether they provided keys are equal. It would be great if pandas provided a way in which joins can be performed based on some other logical condition. This is possible in many SQL engines and allows for elegant solution of specific use cases.

Examples

Setup

df1 = pd.DataFrame({'col_a': [1,2,3], 'col_b': ["A", "B", "C"]}) print(df1)

   col_a col_b
0      1     A
1      2     B
2      3     C

df2 = pd.DataFrame({'col_a': [0, 2, 3], 'col_c': ["Z", "X", "Y"]}) print(df2)

   col_a col_c
0      0     Z
1      2     X
2      3     Y

Current behavior (equal joins)

Based on conditional df1.col_a == df2.col_b

df_equal = pd.merge(left=df1, right=df2, on="col_a") print(df_equal)

   col_a col_b col_c
0      2     B     X
1      3     C     Y

Desired behavior

Not equal to join

Based on conditional df1.col_a != df2.col_a

    col_a   col_b   col_c
0	1	A	Z
1 	1	A	Y
2	1	A	X
3	2	B	Z
4	2	B	X
5	3	C	Z
6	3	C	Y

Greater than join

Based on conditional df1.col_a > df2.col_a

    col_a   col_b   col_c
0	1	A	Z
1	2	B	Z
2	3	C	Z
3	3	C	Y

And any other conditional based on logic operators.

Some considerations

Currently, it it is possible with pandas to join on multiple keys. It would be nice if it is possible to provide multiple logical conditions to perform joining on multiple keys, and allow "mixing" of the logical conditions (i.e. join on df1.col_a != df2.col_a & df1.col_b == df1.col_c). Note also that the second conditional is applied to columns of different names (i.e. col_b of df1 and col_c of df2).

Real-world scenario

Such joins can be very useful in dataframes that have datetime-like objects stored in them. Consider a dataframe df_process that contains process_id process_start_date and process_end_date columns for some business process, and a second dataframe df_events that contains a event_id, event_date column for particular events. If you want to find all events that occur during some business process, you can easily obtain these by applying the conditional join statement:
df_event.event_date >= df_process.process_start_date & df_event.event_date <= df_process.process_end_date

Implementation suggestion

I'm not familiar with how joins are performed in pandas at the moment, but it seems to me that some equality check must be performed to decide to join particular records from the dataframes involved in the join. That means (if this is correct) that there is already some conditional built-in that evaluates to True or False based on the equality (==) comparison. My request should allow passing a different conditional to this part of the pandas join machinery.

Hopefully it's clear what I am requesting!