Steven's Knowledge

Getting Started

Run ClickHouse locally, load a public dataset, explore columnar speed, try DuckDB for in-process analytics

Getting Started

This page picks two warehouses to learn the model: ClickHouse (self-hosted, fastest OSS columnar engine) and DuckDB (in-process, instantly available). Both teach the same lessons: columnar storage, vectorized queries, compression, and SQL ergonomics.

Path A: ClickHouse Locally

docker run -d --name clickhouse \
  -p 8123:8123 -p 9000:9000 \
  --ulimit nofile=262144:262144 \
  clickhouse/clickhouse-server:24.8

# Connect
docker exec -it clickhouse clickhouse-client

Load a Public Dataset

The classic: GitHub events, a billion rows of public events from the GH API.

-- Inside clickhouse-client
CREATE TABLE github_events (
  file_time   DateTime,
  event_type  Enum8('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3,
                    'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6,
                    'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9,
                    'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11,
                    'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14,
                    'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17,
                    'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19,
                    'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22),
  actor_login LowCardinality(String),
  repo_name   LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (event_type, repo_name, file_time);

Load a chunk (one day's worth, ~10M rows):

INSERT INTO github_events
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github_events/2024-01-01.parquet',
        'Parquet',
        'file_time DateTime, event_type LowCardinality(String), actor_login LowCardinality(String), repo_name LowCardinality(String)');

(For learning, even a smaller CSV works — clickhouse-client --query "INSERT INTO github_events FORMAT CSV" < your.csv.)

Run Analytical Queries

-- Top 10 most-active repos by pull request count
SELECT repo_name, count() AS prs
FROM github_events
WHERE event_type = 'PullRequestEvent'
GROUP BY repo_name
ORDER BY prs DESC
LIMIT 10;
-- Time: < 100ms on 10M rows
-- Hourly event volume by type
SELECT toHour(file_time) AS hour, event_type, count() AS events
FROM github_events
GROUP BY hour, event_type
ORDER BY hour, events DESC;
-- Compare: how slow is the same query on row-oriented storage?
-- Postgres would take 10-30s on the same data. ClickHouse: ~150ms.

See What's Going On

-- How much disk is the table using?
SELECT
  formatReadableSize(sum(bytes_on_disk)) AS size,
  sum(rows) AS rows,
  formatReadableSize(sum(bytes_on_disk) / sum(rows)) AS bytes_per_row
FROM system.parts
WHERE table = 'github_events';

-- Compression ratio per column
SELECT
  name,
  formatReadableSize(data_compressed_bytes) AS compressed,
  formatReadableSize(data_uncompressed_bytes) AS uncompressed,
  round(data_uncompressed_bytes / data_compressed_bytes, 1) AS ratio
FROM system.columns
WHERE table = 'github_events';

You'll see ratios like 10-50× on text columns (LowCardinality does dictionary encoding), and lossless compression on integers. That compression is half the speed: less disk to read, more data fits in CPU cache.

Explain a Query

EXPLAIN PLAN
SELECT repo_name, count() FROM github_events
WHERE event_type = 'PushEvent' GROUP BY repo_name LIMIT 100;

Note the projections, the parallel reads, the filter pushed down. Pulling 10M rows over a slow scan isn't happening — ClickHouse reads the indexed event_type column first and skips entire blocks that don't match.

Path B: DuckDB In-Process

DuckDB is "SQLite for analytics": an in-process database with no server, columnar storage, and SQL that rivals commercial warehouses for small-to-medium data (< 100GB).

brew install duckdb
duckdb mywarehouse.db

Read Parquet Directly from S3

DuckDB can query files in place — no load step:

INSTALL httpfs;
LOAD httpfs;

SELECT
  COUNT(*) AS total_pickups,
  AVG(fare_amount) AS avg_fare
FROM 's3://nyc-tlc/trip data/yellow_tripdata_2024-01.parquet';

On a laptop, this runs in seconds against a multi-GB file directly on S3, without downloading or loading it.

Aggregations on Local Parquet

-- Pull pickups by hour of day
SELECT
  EXTRACT(HOUR FROM tpep_pickup_datetime) AS hour,
  COUNT(*) AS trips,
  AVG(fare_amount) AS avg_fare
FROM 's3://nyc-tlc/trip data/yellow_tripdata_2024-01.parquet'
GROUP BY hour ORDER BY hour;

Loading and Persisting

-- Persist as a DuckDB table for repeated queries
CREATE TABLE trips AS
SELECT * FROM 's3://nyc-tlc/trip data/yellow_tripdata_2024-01.parquet';

-- Now it's local; queries get faster
SELECT * FROM trips LIMIT 10;

DuckDB writes its own columnar format on disk, similar speed to ClickHouse on moderate datasets. The killer feature is zero infrastructure: a single binary, embedded in Python/R/JS as a library.

DuckDB in Python (notebooks)

import duckdb
import pandas as pd

# Read a DataFrame as a view
df = pd.read_csv('events.csv')
duckdb.sql("SELECT region, SUM(revenue) FROM df GROUP BY region").df()

# Or query parquet directly
duckdb.sql("""
SELECT region, SUM(revenue)
FROM read_parquet('s3://bucket/events/*.parquet')
WHERE event_time > '2026-01-01'
GROUP BY region
""").df()

This is production-grade analytics in a notebook with no warehouse to spin up.

Path C: BigQuery / Snowflake (Sketch)

If you have access to a cloud warehouse, the experience is similar — give it credentials and start querying. The differences:

BigQuery

# gcloud auth, then:
bq query --use_legacy_sql=false '
SELECT name, SUM(number) AS total
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = "CA"
GROUP BY name ORDER BY total DESC LIMIT 10
'
  • Serverless: no instances; pay per byte scanned
  • Public datasets: a huge corpus to learn on
  • SQL flavor: standard SQL with some extensions; UDFs in JavaScript/Python

Snowflake

-- Snowflake worksheet
USE WAREHOUSE COMPUTE_WH;
USE DATABASE SNOWFLAKE_SAMPLE_DATA;

SELECT C_CUSTKEY, COUNT(*) AS orders, SUM(O_TOTALPRICE) AS revenue
FROM TPCH_SF10.CUSTOMER c JOIN TPCH_SF10.ORDERS o ON c.C_CUSTKEY = o.O_CUSTKEY
WHERE O_ORDERDATE > '1995-01-01'
GROUP BY C_CUSTKEY ORDER BY revenue DESC LIMIT 10;
  • Virtual warehouses: compute clusters you size and start/stop
  • Time travel: query the table as of any moment in the last N days
  • Marketplace: external datasets pluggable in seconds

Both: SQL ergonomics, warehouse-sized data, serverless-ish cost models. Try the free tiers (Snowflake gives credits; BigQuery has a generous free tier).

A Mini Pipeline

Tie it together: source data, load to warehouse, transform with dbt.

# Local stack
pipx install dbt-duckdb

mkdir analytics && cd analytics
dbt init my_project --adapter duckdb

profiles.yml:

my_project:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: 'mywarehouse.db'
      threads: 4

models/daily_trips.sql:

SELECT
  DATE(tpep_pickup_datetime) AS day,
  COUNT(*) AS trips,
  AVG(fare_amount) AS avg_fare
FROM 's3://nyc-tlc/trip data/yellow_tripdata_2024-01.parquet'
GROUP BY day
dbt run        # builds the model
dbt test       # runs data quality tests
dbt docs serve # generates documentation site

A working analytics pipeline in 5 minutes. Scale this to Snowflake or BigQuery and the model files are the same; just change the adapter.

Cleanup

docker rm -f clickhouse
rm -f mywarehouse.db events.csv

What's Next

  • Patterns — star schemas, SCD types, dbt models, partitioning, materialized views, CDC
  • Best Practices — cost control, query optimization, data quality, governance, lakehouse vs warehouse

On this page