Integrating Object Storage with ClickHouse Cloud

Object Storage ClickPipes provide a simple and resilient way to ingest data from Amazon S3 and Google Cloud Storage into ClickHouse Cloud. Both one-time and continuous ingestion are supported with exactly-once semantics.

Prerequisite

You have familiarized yourself with the ClickPipes intro.

Creating your first ClickPipe

In the cloud console, select the Data Sources button on the left-side menu and click on "Set up a ClickPipe"

Select imports

Select your data source.

Select data source type

Fill out the form by providing your ClickPipe with a name, a description (optional), your IAM role or credentials, and bucket URL. You can specify multiple files using bash-like wildcards. For more information, see the documentation on using wildcards in path.

Fill out connection details

The UI will display a list of files in the specified bucket. Select your data format (we currently support a subset of ClickHouse formats) and if you want to enable continuous ingestion More details below.

Set data format and topic

In the next step, you can select whether you want to ingest data into a new ClickHouse table or reuse an existing one. Follow the instructions in the screen to modify your table name, schema, and settings. You can see a real-time preview of your changes in the sample table at the top.

Set table, schema, and settings

You can also customize the advanced settings using the controls provided

Set advanced controls

Alternatively, you can decide to ingest your data in an existing ClickHouse table. In that case, the UI will allow you to map fields from the source to the ClickHouse fields in the selected destination table.

Use and existing table

Info

You can also map virtual columns, like _path or _size, to fields.

Finally, you can configure permissions for the internal ClickPipes user.

Permissions: ClickPipes will create a dedicated user for writing data into a destination table. You can select a role for this internal user using a custom role or one of the predefined role:

Full access: with the full access to the cluster. Required if you use Materialized View or Dictionary with the destination table.
Only destination table: with the INSERT permissions to the destination table only.

permissions

By clicking on "Complete Setup", the system will register you ClickPipe, and you'll be able to see it listed in the summary table.

Success notice

Remove notice

The summary table provides controls to display sample data from the source or the destination table in ClickHouse

View destination

As well as controls to remove the ClickPipe and display a summary of the ingest job.

View overview

Congratulations! you have successfully set up your first ClickPipe. If this is a streaming ClickPipe it will be continuously running, ingesting data in real-time from your remote data source. Otherwise it will ingest the batch and complete.

Supported Data Sources

Name	Logo	Type	Status	Description
Amazon S3		Object Storage	Beta	Configure ClickPipes to ingest large volumes of data from object storage.
Google Cloud Storage		Object Storage	Beta	Configure ClickPipes to ingest large volumes of data from object storage.

More connectors are will get added to ClickPipes, you can find out more by contacting us.

Supported Data Formats

The supported formats are:

Exactly-Once Semantics

Various types of failures can occur when ingesting large dataset, which can result in a partial inserts or duplicate data. Object Storage ClickPipes are resilient to insert failures and provides exactly-once semantics. This is accomplished by using temporary "staging" tables. Data is first inserted into the staging tables. If something goes wrong with this insert, the staging table can be truncated and the insert can be retried from a clean state. Only when an insert is completed and successful, the partitions in the staging table are moved to target table. To read more about this strategy, check-out this blog post.

View Support

Materialized views on the target table are also supported. ClickPipes will create staging tables not only for the target table, but also any dependent materialized view.

We do not create staging tables for non-materialized views. This means that if you have a target table with one of more downstream materialized views, those materialized views should avoid selecting data via a view from the target table. Otherwise, you may find that you are missing data in the materialized view.

Scaling

Object Storage ClickPipes are scaled based on the minimum ClickHouse service size determined by the configured vertical autoscaling settings. The size of the ClickPipe is determined when the pipe is created. Subsequent changes to the ClickHouse service settings will not affect the ClickPipe size.

To increase the throughput on large ingest jobs, we recommend scaling the ClickHouse service before creating the ClickPipe.

Limitations

Any changes to the destination table, its materialized views (including cascading materialized views), or the materialized view's target tables won't be picked up automatically by the pipe and can result in errors. You must stop the pipe, make the necessary modifications, and then restart the pipe for the changes to be picked up and avoid errors and duplicate data due to retries.
There are limitations on the types of views that are supported. Please read the section on exactly-once semantics and view support for more information.
Role authentication is not available for S3 ClickPipes for ClickHouse Cloud instances deployed into GCP or Azure. It is only supported for AWS ClickHouse Cloud instances.
ClickPipes will only attempt to ingest objects at 10GB or smaller in size. If a file is greater than 10GB an error will be appended to the ClickPipes dedicated error table.
S3 / GCS ClickPipes does not share a listing syntax with the S3 Table Function.
- ? — Substitutes any single character
- * — Substitutes any number of any characters except / including empty string
- ** — Substitutes any number of any character include / including empty string

Note

This is a valid path:

https://datasets-documentation.s3.eu-west-3.amazonaws.com/http/**.ndjson.gz

This is not a valid path. {N..M} are not supported in ClickPipes.

https://datasets-documentation.s3.eu-west-3.amazonaws.com/http/{documents-01,documents-02}.ndjson.gz

Continuous Ingest

ClickPipes supports continuous ingestion from both S3 and GCS. When enabled, ClickPipes will continuously ingest data from the specified path, it will poll for new files at a rate of once every 30 seconds. However, new files must be lexically greater than the last ingested file, meaning they must be named in a way that defines the ingestion order. For instance, files named file1, file2, file3, etc., will be ingested sequentially. If a new file is added with a name like file0, ClickPipes will not ingest it because it is not lexically greater than the last ingested file.

Archive table

ClickPipes will create a table next to your destination table with the postfix s3_clickpipe_<clickpipe_id>_archive. This table will contain a list of all the files that have been ingested by the ClickPipe. This table is used to track files during ingestion and can be used to verify files have been ingested. The archive table has a TTL of 7 days.

Note

These tables will not be visible using ClickHouse Cloud SQL Console, you will need to connect via an external client either using HTTPS or Native connection to read them.

Authentication

S3

You can access public buckets with no configuration, and with protected buckets you can use IAM credentials or an IAM Role. You can refer to this guide to understand the required permission for accessing your data.

GCS

Like S3, you can access public buckets with no configuration, and with protected buckets you can use HMAC Keys in place of the AWS IAM credentials. You can read this guide from Google Cloud on how to setup such keys.

Service Accounts for GCS aren't directly supported. HMAC (IAM) Credentials must be used when authenticating with non-public buckets. The Service Account permissions attached to the HMAC credentials should be storage.objects.list and storage.objects.get.

F.A.Q.

Does ClickPipes support GCS buckets prefixed with gs://?

No. For interoperability reasons we ask you to replace your gs:// bucket prefix with https://storage.googleapis.com/.

What permissions does a GCS public bucket require?

allUsers requires appropriate role assignment. The roles/storage.objectViewer role must be granted at the bucket level. This role provides the storage.objects.list permission, which allows ClickPipes to list all objects in the bucket which is required for onboarding and ingestion. This role also includes the storage.objects.get permission, which is required to read or download individual objects in the bucket. See: Google Cloud Access Control for further information.

Prerequisite​

Creating your first ClickPipe​

Supported Data Sources​

Supported Data Formats​

Exactly-Once Semantics​

View Support​

Scaling​

Limitations​

Continuous Ingest​

Archive table​

Authentication​

S3​

GCS​

F.A.Q.​