RowBinary
| Input | Output | Alias |
|---|---|---|
| ✔ | ✔ |
Description
The RowBinary format parses data by row in binary format.
Rows and values are listed consecutively, without separators.
Because data is in the binary format the delimiter after FORMAT RowBinary is strictly specified as follows:
- Any number of whitespaces:
' '(space - code0x20)'\t'(tab - code0x09)'\f'(form feed - code0x0C)
- Followed by exactly one new line sequence:
- Windows style
"\r\n" - or Unix style
'\n'
- Windows style
- Immediately followed by binary data.
This format is less efficient than the Native format since it is row-based.
Data types wire format
Most of the queries provided in the examples can be executed with curl with a file output.
Then, the data can be examined with a hex editor.
Unsigned LEB128 (Little Endian Base 128)
An unsigned little-endian variable-width integer encoding used to encode the length of variable-size data types such as String, Array and Map. A sample implementation can be found on the LEB128 wiki page.
(U)Int8, (U)Int16, (U)Int32, (U)Int64, (U)Int128, (U)Int256
All integer types are encoded with an appropriate number of bytes as little-endian. Signed types (Int8 through Int256) use two's complement representation. Most languages support extracting such integers from byte arrays, using either built-in tools, or well-known libraries. For Int128/Int256 and UInt128/UInt256, which exceed most languages' native integer sizes, custom deserialization may be required.
Bool
Boolean values are encoded as a single byte, and can be deserialized similarly to UInt8.
0isfalse1istrue
Float32, Float64
Little-endian floating-point numbers encoded as 4 bytes for Float32 and 8 bytes for Float64. Similarly to integers, most languages provide proper tools to deserialize these values.
BFloat16
BFloat16 (Brain Floating Point) is a 16-bit floating point format with the range of Float32 and reduced precision, making it useful for machine learning workloads. The wire format is essentially the top 16 bits of a Float32 value. If your language doesn't support it natively, the easiest way to handle it is to read and write as UInt16, converting to and from Float32:
To convert BFloat16 to Float32 (pseudocode):
To convert Float32 to BFloat16 (pseudocode):
Sample underlying values for BFloat16:
Decimal32, Decimal64, Decimal128, Decimal256
Decimal types are represented as little-endian integers with respective bit width.
Decimal32- 4 bytes, orInt32.Decimal64- 8 bytes, orInt64.Decimal128- 16 bytes, orInt128.Decimal256- 32 bytes, orInt256.
When deserializing a Decimal value, the whole and fractional parts can be derived using the following pseudocode:
Where trunc performs truncation toward zero (not floor division, which differs for negative values), and scale is the number of digits after the decimal point. For example, for Decimal(10, 2) (an equivalent to Decimal32(2)), the scale is 2, and the value 12345 will be represented as (123, 45).
Serialization requires the reverse operation:
See more details in the Decimal types ClickHouse docs.
String
ClickHouse strings are arbitrary byte sequences. They are not required to be valid UTF-8. The length prefix is the byte length, not the character count.
Encoded in two parts:
- A variable-length integer (LEB128) that indicates the length of the string in bytes.
- The raw bytes of the string.
For example, a string foobar will be encoded using seven bytes as follows:
FixedString
Unlike String, FixedString has a fixed length, which is defined in the schema. It is encoded as a sequence of bytes, padded with trailing zero bytes if the value is shorter than N.
When reading a FixedString, trailing zero bytes may be either padding or actual \0 characters in the data, they are indistinguishable on the wire. ClickHouse itself preserves all N bytes as-is.
An empty FixedString(3) contains only padding zeroes:
Non-empty FixedString(3) containing the string hi:
Non-empty FixedString(3) containing the string bar:
No padding is required in the last example, since all three bytes are used.
Date
Stored as UInt16 (two bytes) representing the number of days since 1970-01-01.
Supported range of values: [1970-01-01, 2149-06-06].
Sample underlying values for Date:
Date32
Stored as Int32 (four bytes) representing the number of days before or after 1970-01-01.
Supported range of values: [1900-01-01, 2299-12-31].
Sample underlying values for Date32:
A date before the epoch:
DateTime
Stored as UInt32 (four bytes) representing the number of seconds since 1970-01-01 00:00:00 UTC.
Syntax:
For example, DateTime or DateTime('UTC').
The binary value is always a UTC epoch offset. The timezone does not change the encoding. However, the timezone does affect how string values are interpreted on insertion: inserting '2024-01-15 10:30:00' into a DateTime('America/New_York') column stores a different epoch value than inserting the same string into a DateTime('UTC') column, because the string is interpreted as local time in the column's timezone. On the wire, both are just UInt32 epoch seconds.
Supported range of values: [1970-01-01 00:00:00, 2106-02-07 06:28:15].
Sample underlying values for DateTime:
DateTime64
Stored as Int64 (eight bytes) representing the number of ticks before or after 1970-01-01 00:00:00 UTC. Tick resolution is defined by the precision parameter, see the syntax below:
Where precision is an integer from 0 to 9. Typically, only the following are used: 3 (milliseconds), 6 (microseconds),
9 (nanoseconds).
Examples of valid DateTime64 definitions: DateTime64(0), DateTime64(3), DateTime64(6, 'UTC'), or DateTime64(9, 'Europe/Amsterdam').
As with DateTime, the binary value is always a UTC epoch offset. The timezone affects how string values are interpreted on insertion (see the DateTime note), but the encoding itself is always Int64 ticks since the UTC epoch.
The underlying Int64 value of the DateTime64 type can be interpreted as the number of the following units before or after the UNIX epoch:
DateTime64(0)- seconds.DateTime64(3)- milliseconds.DateTime64(6)- microseconds.DateTime64(9)- nanoseconds.
Supported range of values: [1900-01-01 00:00:00, 2299-12-31 23:59:59.99999999].
Sample underlying values for DateTime64:
DateTime64(3): value1546300800000represents2019-01-01 00:00:00 UTC.DateTime64(6): value1705314600123456represents2024-01-15 10:30:00.123456 UTC.DateTime64(9): value1705314600123456789represents2024-01-15 10:30:00.123456789 UTC.
The precision of the maximum value is 8. If the maximum precision of 9 digits (nanoseconds) is used, the maximum supported value is 2262-04-11 23:47:16 in UTC.
Time
Stored as Int32 representing a time value in seconds. Negative values are valid.
Supported range of values: [-999:59:59, 999:59:59] (i.e., [-3599999, 3599999] seconds).
At the moment, the setting enable_time_time64_type must be set to 1 to use Time or Time64.
Sample underlying values for Time:
Time64
Internally stored as a Decimal64 (which is stored as Int64) representing a time value with fractional seconds, with configurable precision. Negative values are valid.
Syntax:
Where precision is an integer from 0 to 9. Common values: 3 (milliseconds), 6 (microseconds), 9 (nanoseconds).
Supported range of values: [-999:59:59.xxxxxxxxx, 999:59:59.xxxxxxxxx].
At the moment, the setting enable_time_time64_type must be set to 1 to use Time or Time64.
The underlying Int64 value represents fractional seconds scaled by 10^precision.
Sample underlying values for Time64:
Interval types
All interval types are stored as Int64 (eight bytes, little-endian). The value represents the count of the respective time unit. Negative values are valid.
The interval types are: IntervalNanosecond, IntervalMicrosecond, IntervalMillisecond, IntervalSecond, IntervalMinute, IntervalHour, IntervalDay, IntervalWeek, IntervalMonth, IntervalQuarter, IntervalYear.
The interval type name (e.g., IntervalSecond vs IntervalDay) determines the unit of the stored value. The wire encoding is always the same.
Sample underlying values:
Enum8, Enum16
Stored as a single byte (Enum8 == Int8) or two bytes (Enum16 == Int16) representing the index of the enum value in the enum definition. Note that the storage type is signed — enum values can be negative (e.g., Enum8('a' = -128, 'b' = 0)).
An Enum can be defined in a simple way, like this:
The Enum8 defined above will have the following values map on the client:
Or in a more complex way, like this:
The Enum16 defined above will have the following values map on the client:
For the data type parser, the main challenge is tracking escaped symbols in the enum definition, such as \', and special symbols like = that may appear within quoted strings.
UUID
Represented as a sequence of 16 bytes. The UUID is stored as two little-endian UInt64 values: the first 8 bytes of the standard UUID representation are byte-reversed, and the second 8 bytes are independently byte-reversed.
For example, given UUID 61f0c404-5cb3-11e7-907b-a6006ad3dba0:
- Standard byte representation:
61 f0 c4 04 5c b3 11 e7|90 7b a6 00 6a d3 db a0 - First half reversed (LE UInt64):
e7 11 b3 5c 04 c4 f0 61 - Second half reversed (LE UInt64):
a0 db d3 6a 00 a6 7b 90
Sample underlying values for UUID:
61f0c404-5cb3-11e7-907b-a6006ad3dba0is represented as:
- The default UUID
00000000-0000-0000-0000-000000000000is represented as 16 zero bytes:
It can be used when a new record was inserted, but the UUID value was not specified.
IPv4
Stored in four bytes as UInt32 in little-endian byte order. Note that this differs from the traditional network byte order (big-endian) commonly used for IP addresses. Sample underlying values for IPv4:
IPv6
Stored in 16 bytes in big-endian / network byte order (MSB first). Sample underlying values for IPv6:
Nullable
A nullable data type is encoded as follows:
- A single byte that indicates whether the value is
NULLor not:0x00means the value is notNULL.0x01means the value isNULL.
- If the value is not
NULL, the underlying data type is encoded as usual. If the value isNULL, no additional bytes are written for the underlying type.
For example, a Nullable(UInt32) value:
LowCardinality
In RowBinary format, the low-cardinality marker does not affect the wire format. For example, a LowCardinality(String) is encoded the same way as a regular String.
This only applies to RowBinary. In the Native format, LowCardinality uses a different dictionary-based encoding.
A column can be defined as LowCardinality(Nullable(T)), but it is not possible to define it as Nullable(LowCardinality(T)) - it will always result in an error from the server.
While testing, allow_suspicious_low_cardinality_types can be set to 1 to allow most of the data types inside LowCardinality for better coverage.
Array
An array is encoded as follows:
- A variable-length integer (LEB128) that indicates the number of elements in the array.
- The elements of the array, encoded in the same way as the underlying data type.
For example, an array with UInt32 values:
A slightly more complex example:
An array can contain nullable values, but the array itself cannot be nullable.
The following is valid:
And it will be encoded as follows:
An example of dealing with multidimensional arrays can be found in the Geo section.
Tuple
A tuple is encoded as all elements of the tuple following each other in their corresponding wire format without any additional meta-information or delimiters.
The string encoding of the tuple data type presents similar challenges as with the Enum type, such as tracking the escaped symbols and special characters; now, with Tuple it is also required to track open and closing parentheses. Additionally, note that the most complex Tuples can contain other nested Tuples, Arrays, Maps, and even enums.
For example, in the following table, the tuple contains an enum with a tick and parenthesis in the name, which can cause parsing issues if not handled properly:
Map
A map can be viewed as an Array(Tuple(K, V)), where K is the key type and V is the value type. The map is encoded as follows:
- A variable-length integer (LEB128) that indicates the number of elements in the map.
- The elements of the map as key-value pairs, encoded as their corresponding types.
For example, a map with String keys and UInt32 values:
It is possible to have maps with deeply nested structures, such as Map(String, Map(Int32, Array(Nullable(String)))), which will be encoded similarly to what is described above.
Variant
This type represents a union of other data types. Type Variant(T1, T2, ..., TN) means that each row of this type has a value of either type T1 or T2 or … or TN or none of them (NULL value).
While for the end user Variant(T1, T2) means exactly the same as Variant(T2, T1), the order of types in the definition matters for the wire format: the types in the definition are always sorted alphabetically, and this is important, since the exact variant is encoded by a "discriminant" - the data type index in the definition.
Consider the following example:
A NULL value is encoded with a discriminant byte of 0xFF:
The allow_suspicious_variant_types setting can be used to allow more exhaustive testing of the Variant type.
Dynamic
The Dynamic type can hold values of any type, determined at runtime. In RowBinary format, each value is self-describing: the first part is the type specification in this format. The contents then follow, with the value encoding as described in this document. So to parse a value you just need to use the type index to determine the right parser and then re-use the RowBinary parsing you already have elsewhere.
Where BinaryTypeIndex is a single byte identifying the type. See the reference here for the type indices and parameters.
A NULL Dynamic value is encoded with BinaryTypeIndex 0x00 (the Nothing type), with no additional bytes:
Examples:
JSON
The JSON type encodes data in two distinct categories:
- Typed Paths - Paths declared with explicit types in the schema (e.g.,
JSON(user_id UInt32, name String)) - Dynamic Paths/Overflow paths when dynamic path limit is exceeded - Runtime-discovered paths stored as
Dynamictype. The value encoding is preceded by the type definition.
The wire format and rules are different for these two categories.
| Path Category | Included in Serialization | Value Encoding | Variant/Nullable allowed |
|---|---|---|---|
| Typed paths | Always (even if NULL) | Type-specific binary format | Yes |
| Dynamic paths | Only if non-null | Dynamic | No |
Paths are serialized in three groups, written sequentially: typed paths, dynamic paths, then shared data (overflow) paths. Typed and dynamic paths are written in an implementation-defined order (determined by internal hash-map iteration), while shared data paths are written in alphabetical order. Readers should not rely on any specific path ordering. The deserializer dispatches each path by name, not by position.
Each JSON row in RowBinary format is serialized as:
Examples:
1. Simple JSON with typed paths only:
Schema: JSON(user_id UInt32, active Bool)
Row: {"user_id": 42, "active": true}
Binary encoding (hex with annotations):
2. Simple JSON with typed and dynamic paths:
Schema: JSON(user_id UInt32, active Bool)
Row: {"user_id": 42, "active": true, "name": "Alice"}
Binary encoding (hex with annotations):
3. Null handling:
With a typed nullable column you get null:
Schema: JSON(score Nullable(Int32))
Row: {"score": null }
Binary encoding (hex with annotations):
With a typed non-nullable column, you get the default value:
Schema: JSON(name String)
Row: {"name": null}
Binary encoding:
With a dynamic path, it is ignored:
Schema: JSON(id UInt64)
Row: {"id": 100, "metadata": null}
Binary encoding:
Note: The metadata path with NULL value is not included because dynamic paths are only serialized when non-null. This is a key difference from typed paths.
4. Nested JSON objects:
Schema: JSON()
Row: {"user": {"name": "Bob", "age": 30}}
Binary encoding (hex with annotations):
Note: Nested objects are flattened into dot-separated paths (e.g., user.name instead of a nested structure).
Alternative: JSON as String Mode
With the setting output_format_binary_write_json_as_string=1, JSON columns are serialized as a single JSON text string instead of the structured binary format. There is a corresponding setting for writing to JSON columns, input_format_binary_read_json_as_string. The choice of setting here comes down to whether you want to parse the JSON in the client or the server.
Geo types
Geo is a category of data types that represent geographical data. It includes:
Point- asTuple(Float64, Float64).Ring- asArray(Point), orArray(Tuple(Float64, Float64)).Polygon- asArray(Ring), orArray(Array(Tuple(Float64, Float64))).MultiPolygon- asArray(Polygon), orArray(Array(Array(Tuple(Float64, Float64)))).LineString- asArray(Point), orArray(Tuple(Float64, Float64)).MultiLineString- asArray(LineString), orArray(Array(Tuple(Float64, Float64))).
The wire format of the Geo values is exactly the same as with Tuple and Array. RowBinaryWithNamesAndTypes format headers will contain the aliases for these types, e.g., Point, Ring, Polygon, MultiPolygon, LineString, and MultiLineString.
Geometry
Geometry is a Variant type that can hold any of the Geo types listed above. On the wire, it is encoded exactly like a Variant, with a discriminant byte indicating which geo type follows.
The discriminant indices for Geometry are:
| Index | Type |
|---|---|
| 0 | LineString |
| 1 | MultiLineString |
| 2 | MultiPolygon |
| 3 | Point |
| 4 | Polygon |
| 5 | Ring |
Wire format structure:
Sample encoding of a Point as Geometry:
Sample encoding of a Ring as Geometry:
Nested
The wire format for Nested depends on the flatten_nested setting.
All component arrays in a single row must have the same length. This is a server-enforced constraint. Mismatched lengths will cause insertion errors.
flatten_nested = 1 (default)
With the default setting, Nested is flattened into independent arrays. Each sub-column becomes a separate Array column with a dot-separated name:
DESCRIBE TABLE foo shows the flattened columns:
Each array is serialized independently, as described in the Array section:
flatten_nested = 0
With flatten_nested = 0, Nested is preserved as a single column of type Array(Tuple(...)). The column name is not dot-separated:
DESCRIBE TABLE foo shows a single column:
The encoding is Array(Tuple(String, Int32)): an array length prefix, then each element's tuple fields in order:
Note how the fields are interleaved per element (a₁, b₁, a₂, b₂) rather than grouped by column (a₁, a₂, b₁, b₂) as in the flattened representation.
SimpleAggregateFunction
SimpleAggregateFunction(func, T) is encoded identically to its underlying data type T. The aggregate function name does not affect the wire format.
For example, SimpleAggregateFunction(max, UInt32) is encoded the same way as a plain UInt32:
The RowBinaryWithNamesAndTypes header reports the type as SimpleAggregateFunction(max, UInt32), but the value on the wire is just a UInt32:
AggregateFunction
AggregateFunction(func, T) stores the full intermediate state of an aggregate function. Unlike SimpleAggregateFunction, which also stores an intermediate state but encodes it identically to the underlying data type, AggregateFunction stores an opaque binary blob whose format is specific to each aggregate function.
Aggregate states have no length prefix in RowBinary. A parser must understand the internal serialization format of each specific aggregate function to know how many bytes to consume. In practice, most clients treat aggregate states as opaque and use *State / *Merge combinators to let the server handle serialization.
The internal format varies by function. Some simple examples:
countState — stores the count as a VarUInt (LEB128):
sumState — stores the accumulated sum in a fixed-size integer. The width depends on the argument type (UInt64 for integer arguments):
minState / maxState — stores a flag byte followed by the value in the underlying type. The flag is 0x00 for an empty state (no values seen) or 0x01 when a value is present:
An empty state (no rows aggregated):
More complex functions like uniq, quantile, or groupArray use implementation-specific formats. If you need to read or write these states, consult the ClickHouse source code for the specific function.
QBit
QBit is a vector type for efficient lookup with different levels of precision. Internally it’s stored in a transposed format. On the wire, QBit is simply an Array of the underlying element type (Float32, Float64, or BFloat16). The bit-transpose optimization for storage happens server-side, not in the RowBinary protocol.
Syntax:
Where element_type is Float32, Float64, or BFloat16, and dimension is the fixed vector dimension.
Wire format: identical to Array(element_type):
Sample encoding of QBit(Float32, 4) containing [1.0, 2.0, 3.0, 4.0]:
Format settings
The following settings are common to all RowBinary type formats.
| Setting | Description | Default |
|---|---|---|
format_binary_max_string_size | The maximum allowed size for String in RowBinary format. | 1GiB |
output_format_binary_encode_types_in_binary_format | Allows to write types in header using binary encoding instead of strings with type names in RowBinaryWithNamesAndTypes output format. | false |
input_format_binary_decode_types_in_binary_format | Allows to read types in header using binary encoding instead of strings with type names in RowBinaryWithNamesAndTypes input format. | false |
output_format_binary_write_json_as_string | Allows to write values of the JSON data type as JSON String values in RowBinary output format. | false |
input_format_binary_read_json_as_string | Allows to read values of the JSON data type as JSON String values in RowBinary input format. | false |