Skip to main content

· 6 min read

Overview: This article walks through the process of sending data from a Kafka topic to a ClickHouse table. We’ll use the Wiki recent changes feed, which provides a stream of events that represent changes made to various Wikimedia properties. The steps include:

  1. How to setup Kafka on Ubuntu
  2. Ingest a stream of data into a Kakfa topic
  3. Create a ClickHouse table that subscribes to the topic

1. Setup Kafka on Ubuntu

  1. Create an Ubuntu ec2 instance and SSH on to it:
ssh -i ~/training.pem ubuntu@ec2.compute.amazonaws.com
  1. Install Kafka (based on the instructions here: https://www.linode.com/docs/guides/how-to-install-apache-kafka-on-ubuntu/):
sudo apt update
sudo apt install openjdk-11-jdk

mkdir /home/ubuntu/kafka
cd /home/ubuntu/kafka/

wget https://downloads.apache.org/kafka/3.7.0/kafka_2.13-3.7.0.tgz

tar -zxvf kafka_2.13-3.7.0.tgz
  1. Start ZooKeeper:
cd kafka_2.13-3.7.0
bin/zookeeper-server-start.sh config/zookeeper.properties
  1. Open a new console and launch Kafka:
ssh -i ~/training.pem ubuntu@ec2.compute.amazonaws.com
cd kafka/kafka_2.13-3.7.0/
bin/kafka-server-start.sh config/server.properties
  1. Open a third console and create a topic named wikimedia:
ssh -i ~/training.pem ubuntu@ec2.compute.amazonaws.com
cd kafka/kafka_2.13-3.7.0/

bin/kafka-topics.sh --create --topic wikimedia --bootstrap-server localhost:9092
  1. You can verify it was created successfully by:
bin/kafka-topics.sh --list --bootstrap-server localhost:9092

2. Ingest the Wikimedia Stream into Kafka

  1. We need some utilities first:
sudo apt-get install librdkafka-dev libyajl-dev
sudo apt-get install kafkacat
  1. The data is sent to Kafka using a clever curl command that grabs the latest Wikimedia events, parses out the JSON data and sends that to the Kafka topic:
curl -N https://stream.wikimedia.org/v2/stream/recentchange  | awk '/^data: /{gsub(/^data: /, ""); print}' | kafkacat -P -b localhost:9092 -t wikimedia
  1. You can "describe" the topic:
bin/kafka-topics.sh --describe --topic wikimedia --bootstrap-server localhost:9092
  1. Let's verify everything is working by consuming some events:
bin/kafka-console-consumer.sh --topic wikimedia --from-beginning --bootstrap-server localhost:9092
  1. Hit Ctrl+c to kill the previous command.

3. Ingest the Data into ClickHouse

  1. Here is what the incoming data looks like:
{
"$schema": "/mediawiki/recentchange/1.0.0",
"meta": {
"uri": "https://www.wikidata.org/wiki/Q45791749",
"request_id": "f64cfb17-04ba-4d09-8935-38ec6f0001c2",
"id": "9d7d2b5a-b79b-45ea-b72c-69c3b69ae931",
"dt": "2024-04-18T13:21:21Z",
"domain": "www.wikidata.org",
"stream": "mediawiki.recentchange",
"topic": "eqiad.mediawiki.recentchange",
"partition": 0,
"offset": 5032636513
},
"id": 2196113017,
"type": "edit",
"namespace": 0,
"title": "Q45791749",
"title_url": "https://www.wikidata.org/wiki/Q45791749",
"comment": "/* wbsetqualifier-add:1| */ [[Property:P1545]]: 20, Modify PubMed ID: 7292984 citation data from NCBI, Europe PMC and CrossRef",
"timestamp": 1713446481,
"user": "Cewbot",
"bot": true,
"notify_url": "https://www.wikidata.org/w/index.php?diff=2131981357&oldid=2131981341&rcid=2196113017",
"minor": false,
"patrolled": true,
"length": {
"old": 75618,
"new": 75896
},
"revision": {
"old": 2131981341,
"new": 2131981357
},
"server_url": "https://www.wikidata.org",
"server_name": "www.wikidata.org",
"server_script_path": "/w",
"wiki": "wikidatawiki",
"parsedcomment": "<span dir=\"auto\"><span class=\"autocomment\">Added qualifier: </span></span> <a href=\"/wiki/Property:P1545\" title=\"series ordinal | position of an item in its parent series (most frequently a 1-based index), generally to be used as a qualifier (different from &quot;rank&quot; defined as a class, and from &quot;ranking&quot; defined as a property for evaluating a quality).\"><span class=\"wb-itemlink\"><span class=\"wb-itemlink-label\" lang=\"en\" dir=\"ltr\">series ordinal</span> <span class=\"wb-itemlink-id\">(P1545)</span></span></a>: 20, Modify PubMed ID: 7292984 citation data from NCBI, Europe PMC and CrossRef"
}
  1. We will need the Kafka table engine to pull the data from the Kafka topic:
CREATE OR REPLACE TABLE wikiQueue
(
`id` UInt32,
`type` String,
`title` String,
`title_url` String,
`comment` String,
`timestamp` UInt64,
`user` String,
`bot` Bool,
`server_url` String,
`server_name` String,
`wiki` String,
`meta` Tuple(uri String, id String, stream String, topic String, domain String)
)
ENGINE = Kafka(
'ec2.compute.amazonaws.com:9092',
'wikimedia',
'consumer-group-wiki',
'JSONEachRow'
);
  1. For some reason the Kafka table engine seems to take the public ec2 URL and convert it to the private DNS name, so I had to add that to my local /etc/hosts file:
52.14.154.92  ip.us-east-2.compute.internal
  1. You can read from a Kafka table, you just have to enable a setting:
SELECT *
FROM wikiQueue
LIMIT 20
FORMAT Vertical
SETTINGS stream_like_engine_allow_direct_select = 1;

The rows should come back nicely parsed based on the columns defined in the wikiQueue table:

id:          2473996741
type: edit
title: File:Père-Lachaise - Division 6 - Cassereau 05.jpg
title_url: https://commons.wikimedia.org/wiki/File:P%C3%A8re-Lachaise_-_Division_6_-_Cassereau_05.jpg
comment: /* wbcreateclaim-create:1| */ [[d:Special:EntityPage/P921]]: [[d:Special:EntityPage/Q112327116]], [[:toollabs:quickstatements/#/batch/228454|batch #228454]]
timestamp: 1713457283
user: Ameisenigel
bot: false
server_url: https://commons.wikimedia.org
server_name: commons.wikimedia.org
wiki: commonswiki
meta: ('https://commons.wikimedia.org/wiki/File:P%C3%A8re-Lachaise_-_Division_6_-_Cassereau_05.jpg','01a832e2-24c5-4ccb-bd93-8e2c0e429418','mediawiki.recentchange','eqiad.mediawiki.recentchange','commons.wikimedia.org')
  1. We need a MergeTree table to store these incoming events:
CREATE TABLE rawEvents (
id UInt64,
type LowCardinality(String),
comment String,
timestamp DateTime64(3, 'UTC'),
title_url String,
topic LowCardinality(String),
user String
)
ENGINE = MergeTree
ORDER BY (type, timestamp);
  1. Let's define a materialized view that gets triggered when an insert occurs on the Kafka table and sends the data to our rawEvents table:
CREATE MATERIALIZED VIEW rawEvents_mv TO rawEvents
AS
SELECT
id,
type,
comment,
toDateTime(timestamp) AS timestamp,
title_url,
tupleElement(meta, 'topic') AS topic,
user
FROM wikiQueue
WHERE title_url <> '';
  1. You should start seeing data going into rawEvents almost immediately:
SELECT count()
FROM rawEvents;
  1. Let's view some of the rows:
SELECT *
FROM rawEvents
LIMIT 5
FORMAT Vertical
Row 1:
──────
id: 124842852
type: 142
comment: Pere prlpz commented on "Plantilles Enciclopèdia Catalana" (Diria que no cal fer res als articles. Es pot actualitzar els enllaços que es facin servir a les referències (tot i que l'antic encara ha...)
timestamp: 2024-04-18 16:22:29.000
title_url: https://ca.wikipedia.org/wiki/Tema:Wu36d6vfsiuu4jsi
topic: eqiad.mediawiki.recentchange
user: Pere prlpz

Row 2:
──────
id: 2473996748
type: categorize
comment: [[:File:Ruïne van een poortgebouw, RP-T-1976-29-6(R).jpg]] removed from category
timestamp: 2024-04-18 16:21:20.000
title_url: https://commons.wikimedia.org/wiki/Category:Pieter_Moninckx
topic: eqiad.mediawiki.recentchange
user: Warburg1866

Row 3:
──────
id: 311828596
type: categorize
comment: [[:Cujo (película)]] añadida a la categoría
timestamp: 2024-04-18 16:21:21.000
title_url: https://es.wikipedia.org/wiki/Categor%C3%ADa:Pel%C3%ADculas_basadas_en_obras_de_Stephen_King
topic: eqiad.mediawiki.recentchange
user: Beta15

Row 4:
──────
id: 311828597
type: categorize
comment: [[:Cujo (película)]] eliminada de la categoría
timestamp: 2024-04-18 16:21:21.000
title_url: https://es.wikipedia.org/wiki/Categor%C3%ADa:Trabajos_basados_en_obras_de_Stephen_King
topic: eqiad.mediawiki.recentchange
user: Beta15

Row 5:
──────
id: 48494536
type: categorize
comment: [[:braiteremmo]] ajoutée à la catégorie
timestamp: 2024-04-18 16:21:21.000
title_url: https://fr.wiktionary.org/wiki/Cat%C3%A9gorie:Wiktionnaire:Exemples_manquants_en_italien
topic: eqiad.mediawiki.recentchange
user: Àncilu bot
  1. Let's see what types of events are coming in:
SELECT
type,
count()
FROM rawEvents
GROUP BY type
   ┌─type───────┬─count()─┐
1. │ 142 │ 1 │
2. │ new │ 1003 │
3. │ categorize │ 12228 │
4. │ log │ 1799 │
5. │ edit │ 17142 │
└────────────┴─────────┘

Let's define a materialized view chained to our current materialized view. We will keep track of some aggregated stats per minute:

CREATE TABLE byMinute
(
`dateTime` DateTime64(3, 'UTC') NOT NULL,
`users` AggregateFunction(uniq, String),
`pages` AggregateFunction(uniq, String),
`updates` AggregateFunction(sum, UInt32)
)
ENGINE = AggregatingMergeTree
ORDER BY dateTime;

CREATE MATERIALIZED VIEW byMinute_mv TO byMinute
AS SELECT
toStartOfMinute(timestamp) AS dateTime,
uniqState(user) AS users,
uniqState(title_url) AS pages,
sumState(toUInt32(1)) AS updates
FROM rawEvents
GROUP BY dateTime;
  1. We will need -Merge functions to view the results:
SELECT
dateTime AS dateTime,
uniqMerge(users) AS users,
uniqMerge(pages) AS pages,
sumMerge(updates) AS updates
FROM byMinute
GROUP BY dateTime
ORDER BY dateTime DESC
LIMIT 10;

· 2 min read

This is a step by step example on how to start using Python with ClickHouse Cloud service.

note

Keep in mind that Python versions and libraries dependencies are constantly evolving. Make also sure to use the latest supported versions of both the driver and Python environment when trying this.

At the time of writing this article, we're using the clickhouse-connect driver version 0.5.23 and python 3.11.2 respectively.

Steps

  1. Check the Python version:
$  python -V
Python 3.11.2
  1. We'll assemble the project in a folder called ch-python:
$ mkdir ch-python
$ cd ch-python
  1. Create a dependencies file named requirements.txt with:
clickhouse-connect==0.5.23
  1. Create a python source file named main.py:
import clickhouse_connect
import sys
import json

CLICKHOUSE_CLOUD_HOSTNAME = 'HOSTNAME.clickhouse.cloud'
CLICKHOUSE_CLOUD_USER = 'default'
CLICKHOUSE_CLOUD_PASSWORD = 'YOUR_SECRET_PASSWORD'

client = clickhouse_connect.get_client(
host=CLICKHOUSE_CLOUD_HOSTNAME, port=8443, username=CLICKHOUSE_CLOUD_USER, password=CLICKHOUSE_CLOUD_PASSWORD)

print("connected to " + CLICKHOUSE_CLOUD_HOSTNAME + "\n")
client.command(
'CREATE TABLE IF NOT EXISTS new_table (key UInt32, value String, metric Float64) ENGINE MergeTree ORDER BY key')

print("table new_table created or exists already!\n")

row1 = [1000, 'String Value 1000', 5.233]
row2 = [2000, 'String Value 2000', -107.04]
data = [row1, row2]
client.insert('new_table', data, column_names=['key', 'value', 'metric'])

print("written 2 rows to table new_table\n")

QUERY = "SELECT max(key), avg(metric) FROM new_table"

result = client.query(QUERY)

sys.stdout.write("query: ["+QUERY + "] returns:\n\n")
print(result.result_rows)
  1. Create the virtual environment:
chpython$ python -m venv venv
  1. Load the virtual environment:
chpython$ source venv/bin/activate

Once loaded, your terminal prompt should be prefixed with (venv), install dependencies:

(venv) ➜  chpython$ pip install -r requirements.txt
Collecting certifi
Using cached certifi-2023.5.7-py3-none-any.whl (156 kB)
Collecting urllib3>=1.26
Using cached urllib3-2.0.2-py3-none-any.whl (123 kB)
Collecting pytz
Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting zstandard
Using cached zstandard-0.21.0-cp311-cp311-macosx_11_0_arm64.whl (364 kB)
Collecting lz4
Using cached lz4-4.3.2-cp311-cp311-macosx_11_0_arm64.whl (212 kB)
Installing collected packages: pytz, zstandard, urllib3, lz4, certifi, clickhouse-connect
Successfully installed certifi-2023.5.7 clickhouse-connect-0.5.23 lz4-4.3.2 pytz-2023.3 urllib3-2.0.2 zstandard-0.21.0
  1. Launch the code!
(venv) chpython$ venv/bin/python main.py

connected to HOSTNAME.clickhouse.cloud

table new_table created or exists already!

written 2 rows to table new_table

query: [SELECT max(key), avg(metric) FROM new_table] returns:

[(2000, -50.9035)]
tip

If using an older Python version (e.g. 3.9.6) you might be getting an ImportError related to urllib3 library. In that case either upgrade your Python environment to a newer version or pin the urllib3 version to 1.26.15 in your requirements.txt file.

· 5 min read

How can I use API to manage clusters on ClickHouse Cloud?

Answer

We will use Terraform to configure our infra and ClickHouse Provider

Steps:

1). Create an API Key on Cloud. Follow the docs here - https://clickhouse.com/docs/en/cloud/manage/openapi

Save the creds locally.

2). Install Terraform using - https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli

You can use Homebrew package manager if you're on Mac.

3). Create a directory anywhere you like:

mkdir test
➜ test pwd
/Users/jaijhala/Desktop/terraform/test

4). Create 2 files: main.tf and secret.tfvars

Copy the following:

main.tf file would be:

terraform {
required_providers {
clickhouse = {
source = "ClickHouse/clickhouse"
version = "0.0.2"
}
}
}

variable "organization_id" {
type = string
}

variable "token_key" {
type = string
}

variable "token_secret" {
type = string
}

provider clickhouse {
environment = "production"
organization_id = var.organization_id
token_key = var.token_key
token_secret = var.token_secret
}


variable "service_password" {
type = string
sensitive = true
}

resource "clickhouse_service" "service123" {
name = "jai-terraform"
cloud_provider = "aws"
region = "us-east-2"
tier = "development"
idle_scaling = true
password = var.service_password
ip_access = [
{
source = "0.0.0.0/0"
description = "Anywhere"
}
]
}

output "CLICKHOUSE_HOST" {
value = clickhouse_service.service123.endpoints.0.host
}

You can replace your own parameters like service name, region etc.. in the resources section above.

secret.tfvars is where you'll put all the API Key related info that you downloaded earlier. The idea behind this file is that all your secret credentials will be hidden from the main config file.

It would be something like (replace these parameters):

organization_id = "e957a5f7-4qe3-4b05-ad5a-d02b2dcd0593"
token_key = "QWhhkMeytqQruTeKg"
token_secret = "4b1dNmjWdLUno9lXxmKvSUcPP62jvn7irkuZPbY"
service_password = "password123!"

5). Run terraform init from this directory

Expected output:

Initializing the backend...

Initializing provider plugins...
- Finding clickhouse/clickhouse versions matching "0.0.2"...
- Installing clickhouse/clickhouse v0.0.2...
- Installed clickhouse/clickhouse v0.0.2 (self-signed, key ID D7089EE5C6A92ED1)

Partner and community providers are signed by their developers.
If you'd like to know more about provider signing, you can read about it here:
https://www.terraform.io/docs/cli/plugins/signing.html

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

6). Run terraform apply -var-file=secret.tfvars command.

Something like:

➜  test terraform apply -var-file=secret.tfvars

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with
the following symbols:
+ create

Terraform will perform the following actions:

# clickhouse_service.service123 will be created
+ resource "clickhouse_service" "service123" {
+ cloud_provider = "aws"
+ endpoints = (known after apply)
+ id = (known after apply)
+ idle_scaling = true
+ ip_access = [
+ {
+ description = "Anywhere"
+ source = "0.0.0.0/0"
},
]
+ last_updated = (known after apply)
+ name = "jai-terraform"
+ password = (sensitive value)
+ region = "us-east-2"
+ tier = "development"
}

Plan: 1 to add, 0 to change, 0 to destroy.

Changes to Outputs:
+ CLICKHOUSE_HOST = (known after apply)

Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.

Enter a value: yes

Type yes and hit enter

Side note: Notice it says password = (sensitive value) above. This is because we set sensitive = true for the password in the main.tf file.

7). It will take a couple of mins to create the service but eventually it should come up like:

  Enter a value: yes

clickhouse_service.service123: Creating...
clickhouse_service.service123: Still creating... [10s elapsed]
clickhouse_service.service123: Still creating... [20s elapsed]
clickhouse_service.service123: Still creating... [30s elapsed]
clickhouse_service.service123: Still creating... [40s elapsed]
clickhouse_service.service123: Still creating... [50s elapsed]
clickhouse_service.service123: Still creating... [1m0s elapsed]
clickhouse_service.service123: Still creating... [1m10s elapsed]
clickhouse_service.service123: Still creating... [1m20s elapsed]
clickhouse_service.service123: Still creating... [1m30s elapsed]
clickhouse_service.service123: Still creating... [1m40s elapsed]
clickhouse_service.service123: Creation complete after 1m41s [id=aa8d8d63-1878-4600-8470-630715af38ed]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

CLICKHOUSE_HOST = "h3ljlaqez6.us-east-2.aws.clickhouse.cloud"
➜ test

8). Check Cloud Console, you should be able to see the service created.

9). To clean up/destroy the service again, run terraform destroy -var-file=secret.tfvars

Something like:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with
the following symbols:
- destroy

Terraform will perform the following actions:

# clickhouse_service.service123 will be destroyed
- resource "clickhouse_service" "service123" {
- cloud_provider = "aws" -> null
- ............

Plan: 0 to add, 0 to change, 1 to destroy.

Changes to Outputs:
- CLICKHOUSE_HOST = "h3ljlaqez6.us-east-2.aws.clickhouse.cloud" -> null

Do you really want to destroy all resources?
Terraform will destroy all your managed infrastructure, as shown above.
There is no undo. Only 'yes' will be accepted to confirm.

Enter a value:

Type yes and hit enter

10).

clickhouse_service.service123: Destroying... [id=aa8d8d63-1878-4600-8470-630715af38ed]
clickhouse_service.service123: Still destroying... [id=aa8d8d63-1878-4600-8470-630715af38ed, 10s elapsed]
clickhouse_service.service123: Still destroying... [id=aa8d8d63-1878-4600-8470-630715af38ed, 20s elapsed]
clickhouse_service.service123: Destruction complete after 27s

Destroy complete! Resources: 1 destroyed.

And it should be gone from the Cloud Console.

More details about the Cloud API can be found here - https://clickhouse.com/docs/en/cloud/manage/api/api-overview

· 3 min read

To compare metrics between two queries, you must first obtain the query_id for both queries.

You can then run the following query:

WITH
initial_query_id = '82142964-0b5d-4263-b996-302ce14bd779' AS second,
initial_query_id = '7ea39e31-2f89-4085-843c-7246cb3baa5c' AS first
SELECT
PE.Names AS metric,
sumIf(PE.Values, first) AS v1,
sumIf(PE.Values, second) AS v2,
10 * log10(v2 / v1) AS dB
FROM clusterAllReplicas(
default, system.query_log)
ARRAY JOIN ProfileEvents AS PE
WHERE (first OR second)
AND (event_date >= today() - 3) AND (type = 2)
GROUP BY metric
HAVING v1 != v2
ORDER BY
dB DESC,
v2 DESC,
metric ASC FORMAT PrettyCompactMonoBlock

WITH
initial_query_id = '82142964-0b5d-4263-b996-302ce14bd779' AS second,
initial_query_id = '7ea39e31-2f89-4085-843c-7246cb3baa5c' AS first
SELECT
PE.Names AS metric,
sumIf(PE.Values, first) AS v1,
sumIf(PE.Values, second) AS v2,
10 * log10(v2 / v1) AS dB
FROM clusterAllReplicas(default, system.query_log)
ARRAY JOIN ProfileEvents AS PE
WHERE (first OR second) AND (event_date >= (today() - 3)) AND (type = 2)
GROUP BY metric
HAVING v1 != v2
ORDER BY
dB DESC,
v2 DESC,
metric ASC
FORMAT PrettyCompactMonoBlock

You will receive a table with metrics comparing the two queries:

Query id: d7747d26-a231-47c8-ae8c-284895b1aeaf

┌─metric──────────────────────────────────────┬─────────v1─┬─────────v2─┬───────────────────────dB─┐
│ SystemTimeMicroseconds │ 13812127240819382.4143087099482767
│ SoftPageFaults │ 265188740568891.846381108610876
│ DiskReadElapsedMicroseconds │ 111394712737860.582319430863304
│ CachedReadBufferReadFromCacheMicroseconds │ 112650512854500.57322064922068
│ OSCPUVirtualTimeMicroseconds │ 70301588800453770.5637111926869545
│ RealTimeMicroseconds │ 86686457963394710.4585300419916516
│ QueryProfilerRuns │ 1571740.4464959587336597
│ NetworkSendBytes │ 8681979408590.349062627796429
│ NetworkReceiveElapsedMicroseconds │ 1611740.3372337225075003
│ ArenaAllocBytes │ 148058931214973665280.04893510724370622
│ OSWriteBytes │ 3809283850240.04644905045763538
│ ArenaAllocChunks │ 215321570.00806115279057892
│ FileOpen │ 751175160.0028900944828012766
│ OpenedFileCacheMisses │ 751175160.0028900944828012766
│ ContextLock │ 588058810.0007385332589917156
│ OSReadChars │ 23407914322340789818-0.000002994506583727971
│ OSWriteChars │ 25213102513992-0.012623549714419216
│ AggregationPreallocatedElementsInHashTables │ 128039910127563540-0.016187974135432794
│ OSCPUWaitMicroseconds │ 15436431536999-0.018732829140838268
│ OpenedFileCacheHits │ 539534-0.040475081581823065
│ UserTimeMicroseconds │ 5649084055961729-0.04086908559606555
│ WaitMarksLoadMicroseconds │ 388571359985-0.3318598023153847
│ ThreadpoolReaderTaskMicroseconds │ 38166693392522-0.5116182478775457
│ NetworkSendElapsedMicroseconds │ 47454122-0.6112822932011739
│ AsynchronousReadWaitMicroseconds │ 23802842025078-0.7018702173136342
│ NetworkReceiveBytes │ 516372-1.4210676174531387
└─────────────────────────────────────────────┴────────────┴────────────┴──────────────────────────┘

26 rows in set. Elapsed: 0.173 sec. Processed 5.86 million rows, 2.40 GB (33.92 million rows/s., 13.92 GB/s.)

· One min read

In order to build and contribute to ClickHouse, you must use LLVM and Clang.

These are the commands to build the latest version of LLVM and Clang on Linux:

git clone git@github.com:llvm/llvm-project.git
mkdir llvm-build
cd llvm-build
cmake -GNinja -DCMAKE_BUILD_TYPE:STRING=Release -DLLVM_ENABLE_PROJECTS=all -DLLVM_TARGETS_TO_BUILD=all ../llvm-project/llvm
time ninja
sudo ninja install

· 3 min read

This is useful when there are tables that have similar naming conventions and similar columns but are not replicated. An example is searching the system database for entries in the query log tables.

The query_log table is not replicated, and only queries that are executed on a specific node get logged. Data may also roll to a different table For example, data may be inserted into query_log_0, query_log_1, etc. Since one node may roll at a different time than others, it is useful to try to find the data we're looking for in tables that are not exactly named the same.

In essence, we need to do something like this, but in ClickHouse syntax:

SELECT column1, column2 FROM my_db.my_table_*

For this, we can use the clusterAllReplicas() to search all the nodes and the merge() table function to be able to use a regex pattern to search the multiple tables.

The following example shows how to query all tables with the prefix query_log:

clickhouse-cloud :) SELECT 
`event_time`,
`query_id`,
`query`,
`type`
FROM
clusterAllReplicas(default,merge('system', '^query_log*'))
WHERE
query ilike '%db1.table1%' and event_time > now() - toIntervalMinute(5);

SELECT
event_time,
query_id,
query,
type
FROM clusterAllReplicas(default, merge('system', '^query_log*'))
WHERE (query ILIKE '%db1.table1%') AND (event_time > (now() - toIntervalMinute(5)))

Query id: de95c13e-5759-436e-90d9-a12c1327889e

┌──────────event_time─┬─query_id─────────────────────────────┬─query──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─type────────┐
│ 2024-02-08 00:15:20 │ d1dd0d6a-4337-4e58-bdd1-c2c827b6dfe2 │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryStart │
│ 2024-02-08 00:15:20 │ d1dd0d6a-4337-4e58-bdd1-c2c827b6dfe2 │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryFinish │
└─────────────────────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘
┌──────────event_time─┬─query_id─────────────────────────────┬─query──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─type────────┐
│ 2024-02-08 00:15:20 │ f0ca43b2-544e-4b94-a21d-0f05e777fa96 │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryStart │
│ 2024-02-08 00:15:20 │ f0ca43b2-544e-4b94-a21d-0f05e777fa96 │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryFinish │
└─────────────────────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘
┌──────────event_time─┬─query_id─────────────────────────────┬─query──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─type────────┐
│ 2024-02-08 00:15:20 │ 5cc0a508-7f64-460b-a5be-949ef1d1f2ca │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryStart │
│ 2024-02-08 00:15:20 │ 5cc0a508-7f64-460b-a5be-949ef1d1f2ca │ /* ddl_entry=query-0000000428 */ CREATE TABLE db1.table1 UUID '781f25db-3cd1-47c6-a76e-701945a67485' (`id` Int32, `string_column` String) ENGINE = ReplicatedMergeTree ORDER BY id │ QueryFinish │
│ 2024-02-08 00:15:20 │ d1e01cb0-a27c-44b2-829c-90fb2596c9c0 │ create table db1.table1
(
id Int32,
string_column String
)
engine = MergeTree
order by id │ QueryStart │
│ 2024-02-08 00:15:20 │ d1e01cb0-a27c-44b2-829c-90fb2596c9c0 │ create table db1.table1
(
id Int32,
string_column String
)
engine = MergeTree
order by id │ QueryFinish │
│ 2024-02-08 00:15:27 │ 6c2c6c3f-173e-464f-bfa0-643089ca085e │ insert into db1.table1
values
│ QueryStart │
│ 2024-02-08 00:15:27 │ 6c2c6c3f-173e-464f-bfa0-643089ca085e │ insert into db1.table1
values
│ QueryFinish │
└─────────────────────┴──────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘

10 rows in set. Elapsed: 0.046 sec. Processed 317.27 thousand rows, 33.57 MB (6.89 million rows/s., 729.43 MB/s.)
Peak memory usage: 67.04 MiB.

Note that the columns you select must exist on each of the tables being queried or you may encounter an error such as:

Received exception from server (version 24.0.2):
Code: 47. DB::Exception: Received from abc123.us-west-2.aws.clickhouse.cloud:9440. DB::Exception: Missing columns: 'hostname' while processing query: 'WITH 'query_log_0' AS _table

· 2 min read

Dictionaries created in ClickHouse Cloud may experience inconsistency during the initial creation phase. This means that you may not see any data in the dictionary right after creation. However, after several retries, the creation query may land on different replicas, and data will be visible.

This sometimes occurs because the dictionary was created before the part reached the server. As an example:

2024-01-25 13:38:25.615837 - CREATE DICTIONARY received
2024-01-25 13:38:25.626468 - CREATE DICTIONARY finished
2024-01-25 13:38:25.733008 - Part all_0_0_0 downloaded

As you can see, the part only arrived after the dictionary was created. This can be a bigger problem if you are using LIFETIME(MIN 0 MAX 0) because this means that dictionary will never be refreshed automatically. Therefore, the dictionary will remain empty until the command RELOAD DICTIONARIES is executed.

The solution to this issue is to use a SELECT query instead of specifying a source table when creating the dictionary and enabling the setting select_sequential_consistency=1.

Instead of specifying a source table:

SOURCE(CLICKHOUSE(
table 'test.temp_title_table_1706189903924'
user default password 'PASSWORD'))

Use a SELECT query with select_sequential_consistency=1:

SOURCE(CLICKHOUSE(QUERY
'SELECT songTitle, mappedTitle
FROM test.temp_title_table_1706189903924
SETTINGS select_sequential_consistency=1' USER default PASSWORD ''))

Why does this issue occur?

When you insert data and then create or reload a dictionary, the DDL may reach a replica before the data (or new data) does. This leads to the dictionaries being inconsistent between replicas. Then, depending on which replica receives the query, you may get different results.

Note that the same thing happens when you insert and immediately after read from a table. If you read from a replica that hasn't replicated the data yet, you won't see the newly inserted data. When you need sequential consistency, at the cost of performance (which is why it's generally not recommended to use) you can enable select_sequential_consistency.

The case of dictionaries is a bit trickier since dictionaries don't use the settings from the query, but the settings from the server. As a result, when loading data into the dictionary, even if you SET select_sequential_consistency=1 data may load inconsistently across replicas. Specifying select_sequential_consistency=1 in the dictionary source query allows the dictionary to adhere to this setting even if it's not globally enabled as a server setting.

· 3 min read

Question

How can I backup a specific partition in ClickHouse?

Answer

See the below example, this uses the S3(Minio) disk configuration listed in our docker compose examples page.

note

This does NOT apply to ClickHouse Cloud

Create a table:

ch_minio_s3 :) CREATE TABLE my_table
(
`event_time` DateTime,
`field_foo` String,
`field_bar` String,
`number` UInt256
)
ENGINE = MergeTree
PARTITION BY number % 2
ORDER BY tuple()

CREATE TABLE my_table
(
`event_time` DateTime,
`field_foo` String,
`field_bar` String,
`number` UInt256
)
ENGINE = MergeTree
PARTITION BY number % 2
ORDER BY tuple()

Query id: a1a54a5a-eac0-477c-b847-b40acaa62780

Ok.

0 rows in set. Elapsed: 0.016 sec.

Add some data that will fill both partitions equally:

ch_minio_s3 :) INSERT INTO my_table SELECT
toDateTime(now() + number) AS event_time,
randomPrintableASCII(10) AS field_foo,
randomPrintableASCII(20) AS field_bar,
number
FROM numbers(1000000)

INSERT INTO my_table SELECT
toDateTime(now() + number) AS event_time,
randomPrintableASCII(10) AS field_foo,
randomPrintableASCII(20) AS field_bar,
number
FROM numbers(1000000)

Query id: bf6ef803-5747-4ea1-ad00-a17967e349b6

Ok.

0 rows in set. Elapsed: 0.282 sec. Processed 1.00 million rows, 8.00 MB (3.55 million rows/s., 28.39 MB/s.)

verify data:

ch_minio_s3 :) SELECT
_partition_id AS partition_id,
cityHash64(sum(number)) AS hash,
count() AS count
FROM my_table
GROUP BY partition_id

SELECT
_partition_id AS partition_id,
cityHash64(sum(number)) AS hash,
count() AS count
FROM my_table
GROUP BY partition_id

Query id: d8febfb0-5339-4f97-aefa-ef0003128526

┌─partition_id─┬─cityHash64(sum(number))─┬──count─┐
015460940821314360342500000
111827822647069388611500000
└──────────────┴─────────────────────────┴────────┘

2 rows in set. Elapsed: 0.025 sec. Processed 1.00 million rows, 32.00 MB (39.97 million rows/s., 1.28 GB/s.)

backup partition with id 1 to configured s3 disk:

ch_minio_s3 :) BACKUP TABLE my_table PARTITION 1 TO Disk('s3','backups/');

BACKUP TABLE my_table PARTITION 1 TO Disk('s3', 'backups/')

Query id: 810f6144-e282-42e2-99d0-9a80c75a927d

┌─id───────────────────────────────────┬─status─────────┐
4d1da197-c4c9-4b6e-966c-76202eadbd53 │ BACKUP_CREATED │
└──────────────────────────────────────┴────────────────┘

1 row in set. Elapsed: 0.095 sec.

Drop the table:

ch_minio_s3 :) DROP TABLE my_table

DROP TABLE my_table

Query id: c3456044-4689-406e-82ac-8d08b8b618fe

Ok.

0 rows in set. Elapsed: 0.007 sec.

restore just partition with id 1 from backup:

ch_minio_s3 :) RESTORE TABLE my_table PARTITION 1 FROM Disk('s3','backups/');

RESTORE TABLE my_table PARTITION 1 FROM Disk('s3', 'backups/')

Query id: ea306c73-83c5-479f-9c0c-391594facc69

┌─id───────────────────────────────────┬─status───┐
│ ec6841a8-0607-465e-bc4d-d446f960d40a │ RESTORED │
└──────────────────────────────────────┴──────────┘

1 row in set. Elapsed: 0.065 sec.

validate the restored data:

ch_minio_s3 :) SELECT
_partition_id AS partition_id,
cityHash64(sum(number)) AS hash,
count() AS count
FROM my_table
GROUP BY partition_id

SELECT
_partition_id AS partition_id,
cityHash64(sum(number)) AS hash,
count() AS count
FROM my_table
GROUP BY partition_id

Query id: a916176d-6a6e-47fc-ba7d-79bb33b152d8

┌─partition_id─┬─────────────────hash─┬──count─┐
111827822647069388611500000
└──────────────┴──────────────────────┴────────┘

1 row in set. Elapsed: 0.012 sec. Processed 500.00 thousand rows, 16.00 MB (41.00 million rows/s., 1.31 GB/s.)

· 2 min read

Question

I'm writing data into ClickHouse cloud and need to be able ,when reading data, to guarantee that I'm getting the latest complete information.

Answer

Talking to same node

If you are using native protocol, or a session to do your write/read, you should then be connected to the same replica: in this scenario you're reading directly from the node where you're writing, then your read will always be consistent.

Talking to a random node

If you can't guarantee you're talking to the same node (for example talking to the node via HTTPS calls which get shuffled via the a load load balancer), you can either:

A)

  1. write your data
  2. connect to a new replica
  3. run SYSTEM SYNC REPLICA db.table_name LIGHTWEIGHT
  4. read the latest data

See SYSTEM commands reference

OR

B) read anytime with sequential consistency

SELECT 
...
SETTINGS select_sequential_consistency = 1

note when using ClickHouse Cloud and its default SharedMergeTree Engine, using insert_quorum_parallel is not required (it's a given)

Using SYSTEM SYNC REPLICAS or select_sequential_consistency will increase the load on ClickHouse Keeper and might have slower performance depending on the load on the service.

The recommended approach is to do the writes/read using the same session or the native protocol (sticky connection).

· 2 min read

If you install ClickHouse using brew, you may encounter an error from MacOS. By default, MacOS will not run applications or tools created by a developer who cannot be verified. When attempting to run any clickhouse command, you may see this error:

MacOS showing a developer verification error.

To get around this verification error, you need to remove the app from MacOS' quarintine bin either by finding the appropriate setting in your System Settings window, or using the terminal.

System settings process

The easiest way to remove the clickhouse executable from the quarintine bin is to:

  1. Open System settings.

  2. Navigate to Privacy & Security:

  3. Scroll to the bottom of the window to find a message saying _"clickhouse-macos-aarch64" was blocked from use because it is not from an identified developer".

  4. Click Allow Anyway.

  5. Enter your MacOS user password.

You should now be able to run clickhouse commands in your terminal.

Terminal process

You can perform this process using the command-line:

First find out where Homebrew installed the clickhouse executable:

which clickhouse

This should output something like:

/opt/homebrew/bin/clickhouse

Remove clickhouse from the quarantine bin by running xattr -d com.apple.quarantine following by the path from the previous command:

xattr -d com.apple.quarantine /opt/homebrew/bin/clickhouse

You should now be able to run the clickhouse executable:

clickhouse

This should output something like:

Use one of the following commands:
clickhouse local [args]
clickhouse client [args]
clickhouse benchmark [args]
...