ClickHouse Primary Keys - Yegor Andreenko - Medium (original) (raw)

Yegor Andreenko

Recently I dived deep into ClickHouse. ClickHouse is column-store database by Yandex with great performance for analytical queries. For example check benchmark and post of Mark Litwintschik.

This is the translation of answer given by Alexey Milovidov (creator of ClickHouse) about composite primary key.

Questions:

  1. How many columns primary key could have? And what is layout of data on storage drive? Is there any theoretical/practical limits?
  2. Could columns with missing data at some rows be part of primary key?

Data in table of MergeTree type stored in set of multiple parts. On average you could expect little number of parts (units-tens per month).

In every part data stored sorted lexicographically by primary key. For example, if your primary key — (CounterID, Date), than rows would be located sorted by CounterID, and for rows with the same CounterID — sorted by Date.

Data structure of primary key looks like an array of marks — it’s values of primary key every index_granularity rows.

index_granularity — settings of MergeTree engine, default to 8192.

We say that primary key is sparse index of sorted data. Let’s visualise it with only one part. (I should have equal length between marks, but it’s a bit imperfect to draw asci-art here):

Data: [ — — — — — — — — — — — — — — — — — — — — — — — — — —-]
CounterID: [aaaaaaaaaaaaaaabbbbcdeeeeeeeeeeeeefgggggggghhhhhhhhhiiiiiiiiikllllllll]
Date: [1111222222233331233211111222222333211111112122222223111112223311122333]
Ticks:

|_____|_____|____|_____|_____|_____|_______|______|___|_____|

a,1___a,2___a,3___b,3___e,2____e,3___g,1______h,2____i,1___i,3___l,3
Mark’s number:

0_____1_____2____3_____4_____5_____6_______7______8___9_____10

It’s convenient to represent marks as marks of ruler. Primary key allows effectively read range of data. For select ClickHouse chooses set of mark ranges that could contain target data.

This way,

if you select `CounterID IN (‘a’, ‘h’)`
server reads data with mark ranges [0, 3) and [6, 8).

if you select `CounterID IN (‘a’, ‘h’) AND Date = 3`
server reads data with mark ranges [1, 3) and [7, 8).

Sometimes primary key works even if only the second column condition presents in select:

if you select `Date = 3`
server reads data with mark ranges [1, 10).

In our example it’s all marks except 0 — this is 90% of data. In this case index isn’t really effective, but still allows to skip part of data.
On the other hand, if we have more data for one CounterID, index allows to skip wider ranges of Date in data.

In any case, usage of index never could be less efficient than full scan.

Sparse index could read unnecessary rows: during read of one range of primary key `index_granularity * 2` unnecessary rows in every part. It’s normal and you shouldn’t try to reduce `index_granularity`. ClickHouse designed to work effective with data by large batches of rows, that’s why a bit of additional column during read isn’t hurt the performance. index_granularity = 8192 — good value for most cases.

Sparse index allows to work with tables that have enormous number of rows. And it always fits in RAM.

Primary key isn’t unique. You can insert many rows with the same value of primary key.

Primary key can also contain functional expressions.

Example: (CounterID, EventDate, intHash32(UserID))

Above it’s used to mix up the data of particular UserID for every tuple CounterID, EventDate. By-turn it’s used in sampling (https://clickhouse.yandex/reference_en.html#SAMPLE clause).

Let’s sum up what choice of primary key affects:

  1. The most important and obvious: primary key allows to read less data during SELECT queries. As shown in examples above it’s usually doesn’t make sense to include many columns into primary key for this purpose.

Let’s say you have primary key (a, b). By adding one more column c: (a, b, c) makes sense only if it conforms with both conditions:

- if you have queries with filter for this column;
- in your data could be quite long (several time bigger than index_granularity) ranges of data with the same values of (a, b).
In other words when adding one more column will allow to skip big enough ranges of data.

2. Data is sorted by primary key. That way data is more compressable. Sometimes it happens that by adding one more column into primary key data could be compressed better.

3. When you use different kinds of MergeTree with additional logic in merge: CollapsingMergeTree, SummingMergeTree and etc., primary key affects merge of data. For this reason it might be necessary to use more columns in primary key even when it’s not necessary for point 1.

Number of columns into primary key isn’t limited explicitly. Long primary key is usually useless. In real use case the maximum that I saw was ~20 columns (for SummingMergeTree), but I don’t recommend this variant.
Long primary key will negatively affect insert performance and memory usage.

Long primary key will not negatively affect the performance of SELECT queries.

During insert, missing values of all columns will be replaced with default values and written to table.