Performance Mode
DataStore has two compatibility modes that control whether output is shaped for pandas compatibility or optimized for raw SQL performance.
Overview
| Mode | compat_mode value | Description |
|---|---|---|
| Pandas (default) | "pandas" | Full pandas behavior compatibility. Row order preserved, MultiIndex, set_index, dtype corrections, stable sort tiebreakers, -If/isNaN wrappers. |
| Performance | "performance" | SQL-first execution. All pandas compatibility overhead removed. Maximum throughput, but results may differ structurally from pandas. |
What Performance Mode Disables
| Overhead | Pandas mode behavior | Performance mode behavior |
|---|---|---|
| Row-order preservation | _row_id injection, rowNumberInAllBlocks(), __orig_row_num__ subqueries | Disabled — row order not guaranteed |
| Stable sort tiebreaker | rowNumberInAllBlocks() ASC appended to ORDER BY | Disabled — ties may have arbitrary order |
| Parquet preserve_order | input_format_parquet_preserve_order=1 | Disabled — parallel Parquet reading allowed |
| GroupBy auto ORDER BY | ORDER BY group_key added (pandas default sort=True) | Disabled — groups returned in arbitrary order |
| GroupBy dropna WHERE | WHERE key IS NOT NULL added (pandas default dropna=True) | Disabled — NULL groups included |
| GroupBy set_index | Group keys set as index | Disabled — group keys stay as columns |
| MultiIndex columns | agg({'col': ['sum','mean']}) returns MultiIndex columns | Disabled — flat column names (col_sum, col_mean) |
-If/isNaN wrappers | sumIf(col, NOT isNaN(col)) for skipna | Disabled — plain sum(col) (ClickHouse natively skips NULL) |
toInt64 on count | toInt64(count()) to match pandas int64 | Disabled — native SQL dtype returned |
fillna(0) for all-NaN sum | Sum of all-NaN returns 0 (pandas behavior) | Disabled — returns NULL |
| Dtype corrections | abs() unsigned→signed, etc. | Disabled — native SQL dtypes |
| Index preservation | Restores original index after SQL execution | Disabled |
first()/last() | argMin/argMax(col, rowNumberInAllBlocks()) | any(col) / anyLast(col) — faster but non-deterministic |
| Single-SQL aggregation | ColumnExpr groupby materializes intermediate DataFrame | Injects LazyGroupByAgg into lazy ops chain — single SQL query |
Enabling Performance Mode
Using config object
Using module-level functions
Using convenience imports
Setting performance mode automatically sets the execution engine to chdb. You do not need to call config.use_chdb() separately.
When to Use Performance Mode
Use performance mode when:
- Processing large datasets (hundreds of thousands to millions of rows)
- Running aggregation-heavy workloads (groupby, sum, mean, count)
- Row order does not matter (e.g., aggregated results, reports, dashboards)
- You want maximum SQL throughput and minimal overhead
- Memory usage is a concern (parallel Parquet reading, no intermediate DataFrames)
Stay in pandas mode when:
- You need exact pandas behavior (row order, MultiIndex, dtypes)
- You rely on
first()/last()returning the true first/last row - You use
shift(),diff(),cumsum()that depend on row order - You're writing tests that compare DataStore output with pandas
Behavior Differences
Row Order
In performance mode, row order is not guaranteed for any operation. This includes:
- Filter results
- GroupBy aggregation results
head()/tail()without explicitsort_values()first()/last()aggregations
If you need ordered results, add an explicit sort_values():
GroupBy Results
| Aspect | Pandas mode | Performance mode |
|---|---|---|
| Group key location | Index (via set_index) | Regular column |
| Group order | Sorted by key (default) | Arbitrary order |
| NULL groups | Excluded (default dropna=True) | Included |
| Column format | MultiIndex for multi-agg | Flat names (col_func) |
first()/last() | Deterministic (row order) | Non-deterministic (any()/anyLast()) |
Aggregation
Single-SQL Execution
In performance mode, ColumnExpr groupby aggregation (e.g., ds[condition].groupby('col')['val'].sum()) is executed as a single SQL query instead of the two-step process used in pandas mode:
This eliminates the intermediate DataFrame materialization and can significantly reduce memory usage and execution time.
Comparison with Execution Engine
Performance mode (compat_mode) and execution engine (execution_engine) are independent configuration axes:
| Config | Controls | Values |
|---|---|---|
execution_engine | Which engine runs the computation | auto, chdb, pandas |
compat_mode | Whether to reshape output for pandas compatibility | pandas, performance |
Setting compat_mode='performance' automatically sets execution_engine='chdb', since performance mode is designed for SQL execution.
Testing with Performance Mode
When writing tests for performance mode, results may differ from pandas in row order and structural format. Use these strategies:
Sort-then-compare (aggregations, filters)
Value-range check (first/last)
Schema-and-count (LIMIT without ORDER BY)
Best Practices
1. Enable early in your script
2. Add explicit sorting when order matters
3. Use for batch/ETL workloads
4. Switch modes within a session
Related Documentation
- Execution Engine — Engine selection (auto/chdb/pandas)
- Performance Guide — General optimization tips
- Key Differences from pandas — Behavioral differences