Partition Class: Learn Data Partitioning by Doing It

Partitioning is one of those concepts that every data engineer needs to understand but few resources teach well. Most explanations stop at “split your data into chunks.” They skip the hard parts: how to pick a key, what skew does to parallelism, and why the wrong strategy can make queries slower, not faster.

I built Partition Class to fill that gap — a five-lesson interactive course that runs entirely in the browser.

Why Interactive

The problem with partitioning docs is that they’re abstract. “Choose a high-cardinality column” means nothing until you’ve seen what happens when you partition by status (4 distinct values, massive skew) versus user_id (millions of values, uniform distribution). Reading about it doesn’t build the intuition. Watching it happen in real time does.

Every lesson in Partition Class includes live visualizations. You adjust sliders, pick partition keys, simulate distributions, and see the results immediately — bar charts updating, skew meters shifting, pipeline bottlenecks appearing. By the time you finish the course, you’ve developed a feel for what good partitioning looks like.

The Five Lessons

Lesson 1: What Are Partitions? — starts from zero. You see a table split into chunks, watch parallel processing in action, and learn why partition pruning skips irrelevant data. Interactive sliders let you control the number of rows and partitions to see the tradeoffs.

Lesson 2: Keys & Distribution — the core decision. You pick from different partition keys (country, date, user_id, status, hour, device) and watch how each one distributes data across buckets. The lesson generates realistic, weighted data — US traffic at 28%, mobile at 58%, HTTP 200 at 72% — so the distributions look like production.

Lesson 3: Understanding Skew — where things get interesting. You learn the coefficient of variation (CV = stddev / mean) as a concrete metric for partition imbalance. A Zipf distribution simulator lets you crank up skew and watch the pipeline bottleneck — one hot partition blocking everything else.

Lesson 4: Strategies & Trade-offs — covers the four main approaches: natural (value-based), hash (modulo), range (time or numeric bounds), and composite. Each has tradeoffs around query patterns, data locality, and operational complexity.

Lesson 5: Real-World Practice — the hands-on finale. Upload your own CSV or Parquet file, analyze its partition distribution, run benchmarks across different strategies (hash, range, direct), and execute custom SQL queries. This is where theory meets your actual data.

Each lesson ends with a quiz that tracks your score and progress.

DuckDB in the Browser

The entire analysis engine runs client-side with DuckDB-WASM. When you upload a file in Lesson 5, it never leaves your browser. DuckDB handles the partitioning logic — hash functions with modulo for uniform distribution, range bounds for time-series data, dense_rank for categorical columns. The benchmarking measures per-partition query latency so you can see exactly where the bottleneck sits.

The SQL editor uses CodeMirror with syntax highlighting and autocomplete, so you can run arbitrary queries against your data. It’s the same approach I used in Duck-UI and Bloom Filters: push compute to the browser, keep data local, eliminate the server.

The Stack

Svelte 5 for the UI, DuckDB-WASM for the query engine, Tailwind for styling, CodeMirror for the SQL editor. No backend. The entire app is a static build served from a CDN.

The data generation uses weighted distributions to feel realistic rather than random. Country distributions follow real-world internet traffic patterns. Time distributions peak during working hours. Status code distributions match typical API traffic. These details matter — partitioning random data teaches you nothing about production.

Why I Built This

I’ve spent years working with ClickHouse, DuckDB, and other analytical databases where partitioning is a first-class concern. The difference between a well-partitioned table and a poorly-partitioned one can be 100x in query performance. But I’ve seen teams ship partition schemes based on blog posts they half-read, then wonder why their queries are slow.

Partition Class is the resource I wish existed when I was learning this. Not a blog post, not a documentation page — a course where you touch the data, break things, and build intuition through experience.

Try it at partition.caioricciuti.com