JOIN clause - Amazon Kinesis Data Analytics SQL Reference (original) (raw)

The JOIN clause in a SELECT statement combines columns from one or more streams or reference tables.

Stream-to-Stream Joins

Amazon Kinesis Data Analytics supports joining an in-application stream with another in-application stream using SQL, bringing this important traditional database functionality into the streaming context.

This section describes the types of joins that Kinesis Data Analytics supports, including time-based and row-based window joins, and the details about streaming joins.

Join Types

There are five types of joins:

INNER JOIN (or just JOIN) Returns all pairs of rows from the left and from the right for which the join condition evaluates to TRUE.
LEFT OUTER JOIN (or just LEFT JOIN) As INNER JOIN, but rows from the left are kept even if they don't match any rows on the right. NULL values are generated on the right.
RIGHT OUTER JOIN (or just RIGHT JOIN) As INNER JOIN, but rows from the right are kept even if they don't match any rows on the left. NULL values are generated on the left for these rows.
FULL OUTER JOIN (or just FULL JOIN) As INNER JOIN, but rows from both sides are kept even if they don't match any rows on the other side. NULL values are generated on the other side for these rows.
CROSS JOIN Returns the Cartesian product of the inputs: Every row from the left is paired with every row from the right.

Time-Based Window vs. Row-Based Window Joins

It isn't practical to join the entire history of the left stream to the entire history of the right. Therefore, you must restrict at least one stream to a time window by using an OVER clause. The OVER clause defines a window of rows that are to be considered for joining at a given time.

The window can be time-based or row-based:

OVER (RANGE INTERVAL '1' HOUR PRECEDING)  
OVER (ROWS 10 PRECEDING)  
Note

If no time window or row-based window is specified on the side of a join, then only the current row from that side participates in the join evaluation.

Examples of Stream-to-Stream Joins

The following examples demonstrate how an in-application stream-to-stream join works, when the results of the join are returned, and what the row times of the join results are.

Topics

Example Dataset

The examples in this section are based on the following datasets and stream definitions:

Sample of Orders Data
{
   "orderid":"101",
   "orders":"1"
}
Sample of Shipments Data
{
   "orderid":"101",
   "shipments":"2"
}
Creating the ORDERS_STREAM In-Application Stream
CREATE OR REPLACE STREAM "ORDERS_STREAM" ("orderid" int, "orderrowtime" timestamp);
CREATE OR REPLACE PUMP "ORDERS_STREAM_PUMP" AS INSERT INTO "ORDERS_STREAM"
SELECT STREAM "orderid", "ROWTIME"
FROM "SOURCE_SQL_STREAM_001" WHERE "orders" = 1;
Creating the SHIPMENTS_STREAM In-Application Stream
CREATE OR REPLACE STREAM "SHIPMENTS_STREAM" ("orderid" int, "shipmentrowtime" timestamp);
CREATE OR REPLACE PUMP "SHIPMENTS_STREAM_PUMP" AS INSERT INTO "SHIPMENTS_STREAM"
SELECT STREAM "orderid", "ROWTIME"
FROM "SOURCE_SQL_STREAM_001" WHERE "shipments" = 2;

Example 1: Time Window on One Side of a JOIN (INNER JOIN)

This example demonstrates a query that returns all orders with shipments that executed in the last minute.

Join Query
CREATE OR REPLACE STREAM "OUTPUT_STREAM" ("resultrowtime" timestamp, "orderid" int, "shipmenttime" timestamp, "ordertime" timestamp);
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS 
INSERT INTO "OUTPUT_STREAM"
  SELECT STREAM ROWTIME as "resultrowtime", s."orderid", s.rowtime as "shipmenttime", o.ROWTIME as "ordertime"
    FROM ORDERS_STREAM OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS o
      JOIN SHIPMENTS_STREAM AS s
        ON o."orderid" = s."orderid";
Query Results
ORDERS_STREAM SHIPMENTS_STREAM OUTPUT_STREAM
ROWTIME orderid ROWTIME orderid resultrowtime orderid shipmenttime OrderTime
10:00:00 101 10:00:00 100
10:00:20 102
10:00:30 103
10:00:40 104
10:00:45 104
10:00:45 100* 10:00:45 104 10:00:45 10:00:40
10:00:50 105

* - Record with orderid = 100 is a late event in the Orders stream.

Visual Representation of the Join

The following diagram represents a query that returns all orders with shipments that executed in the last minute.

Diagram of the join between all orders (orders_stream) and the shipments (shipments_stream) that occurred in the last minute.

Triggering of Results

The following describes the events that trigger results from the query.

ROWTIMES of Results

Example 2: Time Windows on Both Sides of a JOIN (INNER JOIN)

This example demonstrates a query that returns all orders that executed in the last minute, with shipments that executed in the last minute.

Join Query
CREATE OR REPLACE STREAM "OUTPUT_STREAM" ("resultrowtime" timestamp, "orderid" int, "shipmenttime" timestamp, "ordertime" timestamp);
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS INSERT INTO "OUTPUT_STREAM"
  SELECT STREAM ROWTIME as "resultrowtime", s."orderid", s.rowtime as "shipmenttime", o.ROWTIME as "ordertime"
    FROM ORDERS_STREAM OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS o
      JOIN SHIPMENTS_STREAM OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS s
        ON o."orderid" = s."orderid";
Query Results
ORDERS_STREAM SHIPMENTS_STREAM OUTPUT_STREAM
ROWTIME orderid ROWTIME orderid resultrowtime orderid shipmenttime OrderTime
10:00:00 101 10:00:00 100
10:00:20 102
10:00:30 103
10:00:40 104
10:00:45 104
10:00:45 100* 10:00:45 104 10:00:45 10:00:40
10:00:45 100 10:00:00 10:00:45
10:00:50 105

* - Record with orderid = 100 is a late event in the Orders stream.

Visual Representation of the Join

The following diagram represents a query that returns all orders that executed in the last minute, with shipments that executed in the last minute.

Diagram of the join between all orders occurring in last minute (orders_stream) and the shipments occurring in last minute (shipments_stream).

Triggering of Results

The following describes the events that trigger results from the query.

ROWTIMEs of Results

Example 3: Time Window on One Side of a RIGHT JOIN (RIGHT OUTER JOIN)

This example demonstrates a query that returns all shipments that executed in the last minute, whether or not there are corresponding orders in the last minute.

Join Query
CREATE OR REPLACE STREAM "OUTPUT_STREAM" ("resultrowtime" timestamp, "orderid" int, "ordertime" timestamp);
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS INSERT INTO "OUTPUT_STREAM"
  SELECT STREAM ROWTIME as "resultrowtime", s."orderid", o.ROWTIME as "ordertime"
    FROM ORDERS_STREAM OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS o
      RIGHT JOIN SHIPMENTS_STREAM AS s
        ON o."orderid" = s."orderid";

Query Results
ORDERS_STREAM SHIPMENTS_STREAM OUTPUT_STREAM
ROWTIME orderid ROWTIME orderid resultrowtime orderid OrderTime
10:00:00 101 10:00:00 100
10:00:00 100 null
10:00:20 102
10:00:30 103
10:00:40 104
10:00:45 104
10:00:45 100* 10:00:45 104 10:00:40
10:00:50 105
10:00:50 105 null

* - Record with orderid = 100 is a late event in the Orders stream.

Visual Representation of the Join

The following diagram represents a query that returns all shipments that executed in the last minute, whether or not there are corresponding orders in the last minute.

Diagram of a query returning all shipments (shipments_stream) occurring in the last minute, whether or not there are corresponding orders (orders_stream) in the last minute

Triggering of Results

The following describes the events that trigger results from the query.

ROWTIMEs of Results

Example 4: Time Windows on Both Sides of a RIGHT JOIN (RIGHT OUTER JOIN)

This example demonstrates a query that returns all shipments that executed in the last minute, whether or not they have corresponding orders.

Join Query
CREATE OR REPLACE STREAM "OUTPUT_STREAM" ("resultrowtime" timestamp, "orderid" int, "shipmenttime" timestamp, "ordertime" timestamp);
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS INSERT INTO "OUTPUT_STREAM"
  SELECT STREAM ROWTIME as "resultrowtime", s."orderid", s.ROWTIME as "shipmenttime", o.ROWTIME as "ordertime"
    FROM ORDERS_STREAM OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS o
      RIGHT JOIN SHIPMENTS_STREAM OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS s
        ON o."orderid" = s."orderid";

Query Results
ORDERS_STREAM SHIPMENTS_STREAM OUTPUT_STREAM
ROWTIME orderid ROWTIME orderid resultrowtime orderid shipmenttime OrderTime
10:00:00 101 10:00:00 100
10:00:20 102
10:00:30 103
10:00:40 104
10:00:45 104
10:00:45 100* 10:00:45 104 10:00:40 10:00:45
10:00:45 100 10:00:45 10:00:00
10:00:50 105
10:01:50 105 10:00:50 null

* - Record with orderid = 100 is a late event in the Orders stream.

Visual Representation of the Join

The following diagram represents a query that returns all shipments that executed in the last minute, whether or not they have corresponding orders.

Diagram of a query returning all shipments (shipments_stream) in the last minute, whether or not there are corresponding orders (orders_stream).

Triggering of Results

The following describes the events that trigger results from the query.

ROWTIMEs of Results

Example 5: Time Window on One Side of a LEFT JOIN (LEFT OUTER JOIN)

This example demonstrates a query that returns all orders that executed in the last minute, whether or not there are corresponding shipments in the last minute.

Join Query
CREATE OR REPLACE STREAM "OUTPUT_STREAM" ("resultrowtime" timestamp, "orderid" int, "ordertime" timestamp);
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS INSERT INTO "OUTPUT_STREAM"
  SELECT STREAM ROWTIME as "resultrowtime", o."orderid", o.ROWTIME as "ordertime"
    FROM ORDERS_STREAM OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS o
      LEFT JOIN SHIPMENTS_STREAM AS s
        ON o."orderid" = s."orderid";
Query Results
ORDERS_STREAM SHIPMENTS_STREAM OUTPUT_STREAM
ROWTIME orderid ROWTIME orderid resultrowtime orderid OrderTime
10:00:00 101 10:00:00 100
10:00:20 102
10:00:30 103
10:00:40 104
10:00:45 104
10:00:45 100* 10:00:45 104 10:00:40
10:00:50 105
10:01:00 101 10:00:00
10:01:20 102 10:00:20
10:01:30 103 10:00:30
10:01:40 104 10:00:40
10:01:45 100 10:00:45

* - Record with orderid = 100 is a late event in the Orders stream.

Visual Representation of the Join

The following diagram represents a query that returns all orders that executed in the last minute, whether or not there are corresponding shipments in the last minute.

Diagram of a query returning all orders (orders_stream) that executed in the last minute, whether or not there are corresponding shipments (shipments_stream) in the last minute.

Triggering of Results

The following describes the events that trigger results from the query.

ROWTIMEs of Results

Example 6: Time Windows on Both Sides of a LEFT JOIN (LEFT OUTER JOIN)

This example demonstrates a query that returns all orders that executed in the last minute, whether or not they have corresponding shipments.

Join Query
CREATE OR REPLACE STREAM "OUTPUT_STREAM" ("resultrowtime" timestamp, "orderid" int, "shipmenttime" timestamp, "ordertime" timestamp);
CREATE OR REPLACE PUMP "OUTPUT_STREAM_PUMP" AS INSERT INTO "OUTPUT_STREAM"
  SELECT STREAM ROWTIME as "resultrowtime", s."orderid", s.ROWTIME as "shipmenttime", o.ROWTIME as "ordertime"
    FROM ORDERS_STREAM OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS o
      LEFT JOIN SHIPMENTS_STREAM OVER (RANGE INTERVAL '1' MINUTE PRECEDING) AS s
        ON o."orderid" = s."orderid";
Query Results
ORDERS_STREAM SHIPMENTS_STREAM OUTPUT_STREAM
ROWTIME orderid ROWTIME orderid resultrowtime orderid shipmenttime OrderTime
10:00:00 101 10:00:00 100
10:00:20 102
10:00:30 103
10:00:40 104
10:00:45 104
10:00:45 100* 10:00:45 104 10:00:40 10:00:45
10:00:50 105 10:00:45 100 10:00:00 10:00:45
10:01:00 101 null 10:00:00
10:01:20 102 null 10:00:20
10:01:30 103 null 10:00:30
10:01:40 104 null 10:00:40
10:01:45 100 null 10:00:45

* - Record with orderid = 100 is a late event in the Orders stream.

Visual Representation of the Join

The following diagram represents a query that returns all orders that executed in the last minute, whether or not they have corresponding shipments.

Diagram of a query returning all orders (orders_stream) that executed in the last minute, whether or not they have corresponding shipments (shipments_stream).

Triggering of Results

The following describes the events that trigger results from the query.

ROWTIMEs of Results

Summary

Stream-to-Table Joins

If one of the relations is a stream and the other is a finite relation, it is referred to as a stream-table join. For each row in the stream, the query looks up the row or rows in the table that match the join condition.

For example, Orders is a stream and PriceList is a table. The effect of the join is to add price list information to the order.

For information about creating a reference data source and joining a stream to a reference table, see Example: Add Reference Data Source in the Amazon Kinesis Data Analytics Developer Guide.