Projections
This article discusses what projections are, how you can use them and various options for manipulating projections.
Overview of projections
Projections store data in a format that optimizes query execution, this feature is useful for:
- Running queries on a column that is not a part of the primary key
- Pre-aggregating columns, it will reduce both computation and IO
You can define one or more projections for a table, and during the query analysis the projection with the least data to scan will be selected by ClickHouse without modifying the query provided by the user.
Projections will create internally a new hidden table, this means that more IO and space on disk will be required. Example, If the projection has defined a different primary key, all the data from the original table will be duplicated.
You can see more technical details about how projections work internally on this page.
Using projections
Example filtering without using primary keys
Creating the table:
Using ALTER TABLE, we could add the Projection to an existing table:
Inserting the data:
The Projection will allow us to filter by user_name fast even if in the original Table user_name was not defined as a PRIMARY_KEY.
At query time ClickHouse determined that less data will be processed if the projection is used, as the data is ordered by user_name.
To verify that a query is using the projection, we could review the system.query_log table. On the projections field we have the name of the projection used or empty if none has been used:
Example pre-aggregation query
Create the table with projection projection_visits_by_user:
Insert the data:
Execute a first query with GROUP BY using the field user_agent.
This query will not use the projection defined as the pre-aggregation does not match.
To make use of the projection you can execute queries that select part of, or all of the pre-aggregation and GROUP BY fields:
As previously mentioned, you can review the system.query_log table to understand if a projection was used.
The projections field shows the name of the projection used.
It will be empty if no projection has been used:
Normal projection with _part_offset field
Creating a table with a normal projection that utilizes the _part_offset field:
Inserting some sample data:
Using _part_offset as a secondary index
The _part_offset field preserves its value through merges and mutations, making it valuable for secondary indexing. We can leverage this in queries:
Manipulating projections
The following operations with projections are available:
ADD PROJECTION
Use the statement below to add a projection description to a tables metadata:
WITH SETTINGS Clause
WITH SETTINGS defines projection-level settings, which customize how the projection stores data (for example, index_granularity or index_granularity_bytes).
These correspond directly to MergeTree table settings, but apply only to this projection.
Example:
Projection settings override the effective table settings for the projection, subject to validation rules (e.g., invalid or incompatible overrides will be rejected).
DROP PROJECTION
Use the statement below to remove a projection description from a tables metadata and delete projection files from disk. This is implemented as a mutation.
MATERIALIZE PROJECTION
Use the statement below to rebuild the projection name in partition partition_name.
This is implemented as a mutation.
CLEAR PROJECTION
Use the statement below to delete projection files from disk without removing description. This is implemented as a mutation.
The commands ADD, DROP and CLEAR are lightweight in the sense that they only change metadata or remove files.
Additionally, they are replicated, and sync projection metadata via ClickHouse Keeper or ZooKeeper.
Projection manipulation is supported only for tables with *MergeTree engine (including replicated variants).
Controlling projection merge behavior
When you execute a query, ClickHouse chooses between reading from the original table or one of its projections. The decision to read from the original table or one of its projections is made individually per every table part. ClickHouse generally aims to read as little data as possible and employs a couple of tricks to identify the best part to read from, for example, sampling the primary key of a part. In some cases, source table parts have no corresponding projection parts. This can happen, for example, because creating a projection for a table in SQL is “lazy” by default - it only affects newly inserted data but keeps existing parts unaltered.
As one of the projections already contains the pre-computed aggregate values, ClickHouse tries to read from the corresponding projection parts to avoid aggregating at query runtime again. If a specific part lacks the corresponding projection part, query execution falls back to the original part.
But what happens if the rows in the original table change in a non-trivial way by non-trivial data part background merges?
For example, assume the table is stored using the ReplacingMergeTree table engine.
If the same row is detected in multiple input parts during merge, only the most recent row version (from the most recently inserted part) will be kept, while all older versions will be discarded.
Similarly, if the table is stored using the AggregatingMergeTree table engine, the merge operation may fold the same rows in the input parts (based on the primary key values) into a single row to update partial aggregation states.
Before ClickHouse v24.8, projection parts either silently got out of sync with the main data, or certain operations like updates and deletes could not be run at all as the database automatically threw an exception if the table had projections.
Since v24.8, a new table-level setting deduplicate_merge_projection_mode controls the behavior if the aforementioned non-trivial background merge operations occur in parts of the original table.
Delete mutations are another example of part merge operations that drop rows in the parts of the original table. Since v24.7, we also have a setting to control the behavior w.r.t. delete mutations triggered by lightweight deletes: lightweight_mutation_projection_mode.
Below are the possible values for both deduplicate_merge_projection_mode and lightweight_mutation_projection_mode:
throw(default): An exception is thrown, preventing projection parts from going out of sync.drop: Affected projection table parts are dropped. Queries will fall back to the original table part for affected projection parts.rebuild: The affected projection part is rebuilt to stay consistent with data in the original table part.