All Articles
On this page
Data Engineering

Amazon Redshift Distribution vs Partitioning: Why Redshift Has No PARTITION BY (and What to Use Instead)

C
Chandan Kumar
Founder, beCloudReady
May 17, 20269 min read
Amazon Redshift Distribution vs Partitioning: Why Redshift Has No PARTITION BY (and What to Use Instead)

If you're learning AWS Redshift after PostgreSQL or Hive, the missing PARTITION BY clause is confusing. Here's how DISTKEY, SORTKEY, and zone maps replace partitioning in an AWS data lake and data warehouse — with side-by-side SQL examples.

If you come to Amazon Redshift from PostgreSQL, MySQL, Oracle, or BigQuery, one of the first things that trips you up is this:

CREATE TABLE orders (...)
PARTITION BY RANGE (order_date);   -- ❌ Syntax error in Redshift

Redshift doesn't support PARTITION BY. And it's not an oversight — AWS calls this out explicitly in the Redshift docs:

"Amazon Redshift does not require or support the concept of partitioning data within database objects. There's no need to partition databases or tables."

That sentence confuses a lot of people learning AWS data lake and data warehouse patterns for the first time. If partitioning is the standard tool for making big tables fast, how can a modern cloud data warehouse just… not have it?

The short answer: Redshift solves the same two problems partitioning solves, but with two different mechanisms. Once you see what those two problems actually are, the whole Redshift design — DISTKEY, SORTKEY, slices, zone maps — clicks into place.

This post is for engineers, data analysts, and bootcamp learners trying to make that mental leap.


The Two Problems Partitioning Tries to Solve

Whenever you reach for PARTITION BY in another database, you're really trying to fix one (or both) of these:

  1. Parallelism — "I have 200M rows. I want N workers to scan their slice of the table at the same time."
  2. Pruning — "My query only touches order_date >= '2025-01-01'. I don't want to read the other 364 days off disk."

Most engines bundle both jobs into the partition definition. You write PARTITION BY RANGE (order_date) and that single clause decides which physical chunk a row lives in and which chunks the planner can skip.

Amazon Redshift unbundles them.

ProblemPostgres / Hive / BigQuery answerAmazon Redshift answer
Parallelism — spread work across workersPartitions + partition-aware scansDISTKEY (or AUTO / EVEN / ALL) places rows on slices
Pruning — skip data you don't need to readPartition pruning by partition columnSORTKEY + zone maps skip 1MB blocks whose min/max don't match the predicate

Two problems, two knobs. They're independent — you can pick a DISTKEY for join performance and a totally unrelated SORTKEY for filter performance, and they won't interfere with each other.

Diagram: Two problems, two mechanisms — Redshift uses DISTKEY for parallelism and SORTKEY plus zone maps for pruning, while Postgres and Hive use PARTITION BY for both


How Other Database Systems Compare

If you're moving between cloud data warehouses and lakehouses, this comparison is worth pinning to the wall:

SystemHas partitions?Primary parallelism mechanismPrimary pruning mechanism
PostgreSQLYes — declarative PARTITION BYPartition-wise scansPartition pruning
Hive / Spark on S3Yes — Hive-style folder partitionsOne task per partition fileFolder-level partition pruning
Google BigQueryYes — partitioned + clustered tablesSlot-based; partitions helpPartition pruning + clustering
SnowflakeNo (uses micro-partitions, automatic)Micro-partition fan-outMicro-partition metadata
Amazon RedshiftNoDISTKEY → slicesSORTKEY + zone maps

Snowflake and Redshift are actually closer to each other than either is to Postgres: both hide the physical layout from you and ask you to express intent (distribute by this, sort by that) instead of layout (put rows in this file). This is a defining trait of modern cloud-native data warehouses.


Side-by-Side Example: PostgreSQL Partitioning vs Redshift DISTKEY + SORTKEY

Take a classic orders table. You want fast filters by date and fast joins to customers.

PostgreSQL — partition the table

CREATE TABLE orders (
    order_id    BIGINT,
    customer_id BIGINT,
    order_date  DATE,
    amount      NUMERIC(10,2)
)
PARTITION BY RANGE (order_date);
 
CREATE TABLE orders_2024 PARTITION OF orders
    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
 
CREATE TABLE orders_2025 PARTITION OF orders
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
 
CREATE INDEX ON orders (customer_id);

Result: physical files split by year. Filtering by order_date prunes whole files. Joins on customer_id use a B-tree index.

Amazon Redshift — distribute and sort

CREATE TABLE orders (
    order_id    BIGINT,
    customer_id BIGINT,
    order_date  DATE,
    amount      NUMERIC(10,2)
)
DISTKEY (customer_id)      -- rows with the same customer_id co-locate on one slice
SORTKEY (order_date);      -- within each slice, rows are stored in date order

Result: rows hashed across slices by customer_id (so a join to customers DISTKEY(customer_id) is a local join, no shuffle). Within every slice, rows sit in date order, and zone maps on each 1MB block let the planner skip blocks whose date range doesn't match.

Side-by-side diagram comparing PostgreSQL PARTITION BY RANGE on order_date with Amazon Redshift DISTKEY on customer_id and SORTKEY on order_date, showing partition slabs vs distributed slices with sorted blocks

Notice what we got:

  • Same parallelism story as partitioning — work is spread across N workers — but the unit is a slice, not a partition file.
  • Same pruning story as partitioning — irrelevant data isn't read — but the unit is a 1MB block, not a partition.
  • Plus something partitioning can't easily give you: co-located joins. You only get this in Postgres if both tables are partitioned identically and the planner cooperates.

DISTKEY vs SORTKEY vs Partitioning — A Detailed Comparison

AspectPartitioning (Postgres / Hive / BigQuery)Redshift Distribution + Sort
Declared in DDL?Yes — PARTITION BY ...Yes — DISTKEY(...) + SORTKEY(...)
GranularityCoarse — usually one partition per day/month/regionFine — 1MB blocks, automatic
Storage layoutOne file/segment per partitionOne column file per slice; blocks within
Pruning mechanismPlanner reads partition metadata, skips filesPlanner reads zone map (min/max per block), skips blocks
Parallelism unitPartitionSlice (fixed per node, e.g. 2 per ra3.xlplus)
Maintenance burdenAdd new partitions, drop old ones, repartition on schema changeNone — sort order maintained by VACUUM; AUTO can choose for you
Join co-locationOnly if both tables partitioned the same wayYes — same DISTKEY = local join, no network shuffle
Goes wrong when…Wrong partition column, or too many/few partitionsWrong DISTKEY (skew) or wrong SORTKEY (no pruning)

The maintenance row is the one most engineers undersell. With Hive-style partitions on S3, picking the wrong partition column at table-creation time is a multi-day rebuild. With Redshift, you can ALTER TABLE ... ALTER DISTKEY or ALTER SORTKEY and let the system reorganize in the background. This is one of the underrated advantages of a managed cloud data warehouse.


The Redshift Mental Model in One Sentence

Distribution decides which machine a row lives on. Sort key + zone maps decide which blocks the engine bothers to read. Partitioning conflates the two into a single, coarser knob.

If you internalize that one line, every Redshift performance decision gets easier:

  • "My join is slow." → Look at DISTKEY. Are the two tables co-located? Is there data skew across slices?
  • "My filter is slow." → Look at SORTKEY. Does it match the predicate? Has ANALYZE been run recently?
  • "One slice is doing all the work." → Bad DISTKEY. You probably hashed on a low-cardinality column.
  • "Zone maps aren't helping." → SORTKEY doesn't match your filter columns, or the table needs a VACUUM.

In Postgres, all of those symptoms have a single suspect: "did I partition this right?" In Redshift, the suspects are split — and that's a feature, not a bug.


Where Partitioning Does Show Up in the AWS Data Lake World

There's one big exception worth knowing: external tables in Redshift Spectrum, Athena, and AWS Glue. When Redshift reads Parquet files on Amazon S3, it uses Hive-style folder partitions:

s3://lake/orders/year=2025/month=03/day=14/part-00000.parquet
CREATE EXTERNAL TABLE spectrum.orders (
    order_id    BIGINT,
    customer_id BIGINT,
    amount      DECIMAL(10,2)
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION 's3://lake/orders/';

This is Hive-style partition pruning, and it works exactly like it does in Apache Spark or Athena. The reason it exists here and not for native Redshift tables is that S3 has no zone maps — the only metadata available is the folder path. Partitioning is how you bolt pruning onto a filesystem that doesn't have it natively.

So the full picture for engineers building on AWS:

  • Native Redshift tables on cluster storage → no partitions. Use DISTKEY + SORTKEY.
  • External tables on S3 (Spectrum, Athena, Glue) → partitions, because that's the only pruning mechanism Parquet-on-S3 offers. (Iceberg metadata is starting to change this, but partitioning is still the cheap, universally supported path.)

Diagram comparing native Amazon Redshift table storage with zone maps and slices versus external Parquet tables on Amazon S3 using Hive-style folder partitions for pruning with Redshift Spectrum

This is the same split you see in a typical modern AWS data lake architecture: hot, low-latency analytics on Redshift native storage; cheap, scalable historical data on S3 via Spectrum.


Practical Takeaway for Engineers Migrating to Redshift

If you're moving a workload from PostgreSQL, MySQL, or Hive to Amazon Redshift, the migration question isn't "what's the equivalent partition?" — it's "what are the two things my partition column was doing for me?" Then map each one separately:

  • The "spread work across workers" job → DISTKEY (usually your largest join key).
  • The "skip irrelevant rows" job → SORTKEY (usually your most common filter column — often a date or timestamp).

Most of the time, those are different columns. That's the part partitioning hides from you, and the part Redshift makes you confront. Once you see them as separate decisions, you'll write better DDL on every engine you touch — including the ones that still call them partitions.

If you're learning Redshift as part of a broader AWS data lake or data engineering path, this is one of the highest-leverage mental models you can build early. Get this right, and the rest of the Redshift performance toolkit — VACUUM, ANALYZE, AUTO distribution, materialized views, concurrency scaling — becomes a much shorter list to learn.


Further Reading

If you're working through Redshift and AWS data lake concepts as part of a structured learning path, our bootcamp programs cover these design decisions hands-on with real datasets — including the trade-offs between DISTKEY choices, sort key strategies, and when to push workloads into Spectrum vs keep them on native storage.

Amazon RedshiftAWS Data LakeData WarehouseDISTKEYSORTKEYZone MapsPostgreSQLHiveRedshift SpectrumData EngineeringAWS Tutorial