Getting Started
Run ClickHouse locally, load a public dataset, explore columnar speed, try DuckDB for in-process analytics
Getting Started
This page picks two warehouses to learn the model: ClickHouse (self-hosted, fastest OSS columnar engine) and DuckDB (in-process, instantly available). Both teach the same lessons: columnar storage, vectorized queries, compression, and SQL ergonomics.
Path A: ClickHouse Locally
docker run -d --name clickhouse \
-p 8123:8123 -p 9000:9000 \
--ulimit nofile=262144:262144 \
clickhouse/clickhouse-server:24.8
# Connect
docker exec -it clickhouse clickhouse-clientLoad a Public Dataset
The classic: GitHub events, a billion rows of public events from the GH API.
-- Inside clickhouse-client
CREATE TABLE github_events (
file_time DateTime,
event_type Enum8('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3,
'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6,
'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9,
'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11,
'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14,
'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17,
'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19,
'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22),
actor_login LowCardinality(String),
repo_name LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (event_type, repo_name, file_time);Load a chunk (one day's worth, ~10M rows):
INSERT INTO github_events
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/github_events/2024-01-01.parquet',
'Parquet',
'file_time DateTime, event_type LowCardinality(String), actor_login LowCardinality(String), repo_name LowCardinality(String)');(For learning, even a smaller CSV works — clickhouse-client --query "INSERT INTO github_events FORMAT CSV" < your.csv.)
Run Analytical Queries
-- Top 10 most-active repos by pull request count
SELECT repo_name, count() AS prs
FROM github_events
WHERE event_type = 'PullRequestEvent'
GROUP BY repo_name
ORDER BY prs DESC
LIMIT 10;
-- Time: < 100ms on 10M rows-- Hourly event volume by type
SELECT toHour(file_time) AS hour, event_type, count() AS events
FROM github_events
GROUP BY hour, event_type
ORDER BY hour, events DESC;-- Compare: how slow is the same query on row-oriented storage?
-- Postgres would take 10-30s on the same data. ClickHouse: ~150ms.See What's Going On
-- How much disk is the table using?
SELECT
formatReadableSize(sum(bytes_on_disk)) AS size,
sum(rows) AS rows,
formatReadableSize(sum(bytes_on_disk) / sum(rows)) AS bytes_per_row
FROM system.parts
WHERE table = 'github_events';
-- Compression ratio per column
SELECT
name,
formatReadableSize(data_compressed_bytes) AS compressed,
formatReadableSize(data_uncompressed_bytes) AS uncompressed,
round(data_uncompressed_bytes / data_compressed_bytes, 1) AS ratio
FROM system.columns
WHERE table = 'github_events';You'll see ratios like 10-50× on text columns (LowCardinality does dictionary encoding), and lossless compression on integers. That compression is half the speed: less disk to read, more data fits in CPU cache.
Explain a Query
EXPLAIN PLAN
SELECT repo_name, count() FROM github_events
WHERE event_type = 'PushEvent' GROUP BY repo_name LIMIT 100;Note the projections, the parallel reads, the filter pushed down. Pulling 10M rows over a slow scan isn't happening — ClickHouse reads the indexed event_type column first and skips entire blocks that don't match.
Path B: DuckDB In-Process
DuckDB is "SQLite for analytics": an in-process database with no server, columnar storage, and SQL that rivals commercial warehouses for small-to-medium data (< 100GB).
brew install duckdb
duckdb mywarehouse.dbRead Parquet Directly from S3
DuckDB can query files in place — no load step:
INSTALL httpfs;
LOAD httpfs;
SELECT
COUNT(*) AS total_pickups,
AVG(fare_amount) AS avg_fare
FROM 's3://nyc-tlc/trip data/yellow_tripdata_2024-01.parquet';On a laptop, this runs in seconds against a multi-GB file directly on S3, without downloading or loading it.
Aggregations on Local Parquet
-- Pull pickups by hour of day
SELECT
EXTRACT(HOUR FROM tpep_pickup_datetime) AS hour,
COUNT(*) AS trips,
AVG(fare_amount) AS avg_fare
FROM 's3://nyc-tlc/trip data/yellow_tripdata_2024-01.parquet'
GROUP BY hour ORDER BY hour;Loading and Persisting
-- Persist as a DuckDB table for repeated queries
CREATE TABLE trips AS
SELECT * FROM 's3://nyc-tlc/trip data/yellow_tripdata_2024-01.parquet';
-- Now it's local; queries get faster
SELECT * FROM trips LIMIT 10;DuckDB writes its own columnar format on disk, similar speed to ClickHouse on moderate datasets. The killer feature is zero infrastructure: a single binary, embedded in Python/R/JS as a library.
DuckDB in Python (notebooks)
import duckdb
import pandas as pd
# Read a DataFrame as a view
df = pd.read_csv('events.csv')
duckdb.sql("SELECT region, SUM(revenue) FROM df GROUP BY region").df()
# Or query parquet directly
duckdb.sql("""
SELECT region, SUM(revenue)
FROM read_parquet('s3://bucket/events/*.parquet')
WHERE event_time > '2026-01-01'
GROUP BY region
""").df()This is production-grade analytics in a notebook with no warehouse to spin up.
Path C: BigQuery / Snowflake (Sketch)
If you have access to a cloud warehouse, the experience is similar — give it credentials and start querying. The differences:
BigQuery
# gcloud auth, then:
bq query --use_legacy_sql=false '
SELECT name, SUM(number) AS total
FROM `bigquery-public-data.usa_names.usa_1910_2013`
WHERE state = "CA"
GROUP BY name ORDER BY total DESC LIMIT 10
'- Serverless: no instances; pay per byte scanned
- Public datasets: a huge corpus to learn on
- SQL flavor: standard SQL with some extensions; UDFs in JavaScript/Python
Snowflake
-- Snowflake worksheet
USE WAREHOUSE COMPUTE_WH;
USE DATABASE SNOWFLAKE_SAMPLE_DATA;
SELECT C_CUSTKEY, COUNT(*) AS orders, SUM(O_TOTALPRICE) AS revenue
FROM TPCH_SF10.CUSTOMER c JOIN TPCH_SF10.ORDERS o ON c.C_CUSTKEY = o.O_CUSTKEY
WHERE O_ORDERDATE > '1995-01-01'
GROUP BY C_CUSTKEY ORDER BY revenue DESC LIMIT 10;- Virtual warehouses: compute clusters you size and start/stop
- Time travel: query the table as of any moment in the last N days
- Marketplace: external datasets pluggable in seconds
Both: SQL ergonomics, warehouse-sized data, serverless-ish cost models. Try the free tiers (Snowflake gives credits; BigQuery has a generous free tier).
A Mini Pipeline
Tie it together: source data, load to warehouse, transform with dbt.
# Local stack
pipx install dbt-duckdb
mkdir analytics && cd analytics
dbt init my_project --adapter duckdbprofiles.yml:
my_project:
target: dev
outputs:
dev:
type: duckdb
path: 'mywarehouse.db'
threads: 4models/daily_trips.sql:
SELECT
DATE(tpep_pickup_datetime) AS day,
COUNT(*) AS trips,
AVG(fare_amount) AS avg_fare
FROM 's3://nyc-tlc/trip data/yellow_tripdata_2024-01.parquet'
GROUP BY daydbt run # builds the model
dbt test # runs data quality tests
dbt docs serve # generates documentation siteA working analytics pipeline in 5 minutes. Scale this to Snowflake or BigQuery and the model files are the same; just change the adapter.
Cleanup
docker rm -f clickhouse
rm -f mywarehouse.db events.csvWhat's Next
- Patterns — star schemas, SCD types, dbt models, partitioning, materialized views, CDC
- Best Practices — cost control, query optimization, data quality, governance, lakehouse vs warehouse