什么是表分区？

Question

什么是表分区？

Accepted Answer

**分区**根据某列的值将大表分割成较小的物理片段（**分区**）— 但对查询仍然表现为一个逻辑表。它通过让数据库只扫描或维护相关分区，改进了超大表的性能和可管理性。

## 分区的工作原理

```sql
-- a huge `orders` table partitioned BY RANGE on the order date
CREATE TABLE orders (id INT, order_date DATE, amount DECIMAL)
PARTITION BY RANGE (order_date);

CREATE TABLE orders_2023 PARTITION OF orders
  FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
CREATE TABLE orders_2024 PARTITION OF orders
  FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
-- orders is ONE logical table, but stored as separate per-year partitions
```

该表在逻辑上是一个实体，但在物理上存储在多个分区中。查询和插入会自动路由到正确的分区。

## 关键优势：分区剪裁

```sql
-- a query filtered by the partition key only scans RELEVANT partitions
SELECT * FROM orders WHERE order_date >= '2024-06-01';
-- → the database skips orders_2023 entirely, scanning only orders_2024 ("partition pruning")
```

**分区剪裁**是主要的性能收益：在分区键上过滤的查询仅扫描相关分区，跳过其余部分 — 将扫描数十亿行转变为仅扫描相关子集。

## 分区策略

```text
RANGE → by a value range (dates, numeric ranges) — common for time-series data
LIST  → by a list of discrete values (region, category)
HASH  → by a hash of the key (even distribution across partitions)
```

## 其他优势

```text
✓ Maintenance — drop/archive old data by dropping a whole partition (instant vs DELETE)
  e.g. DROP TABLE orders_2020 — far faster than DELETE WHERE year = 2020
✓ Smaller indexes per partition; parallel operations across partitions
✓ Manage huge tables in manageable pieces
```

## 何时使用分区

```text
✓ VERY LARGE tables (hundreds of millions / billions of rows) where queries filter
  on a natural partition key (often time/date for logs, events, orders)
✓ Time-series data where you regularly archive/drop old data
✗ Small/medium tables — partitioning adds complexity with little benefit
  (proper indexing usually suffices)
→ Partitioning complements indexing for VERY large tables; it's not a substitute.
```

## 为什么这很重要

表分区是管理和查询**超大表**的重要技术，理解它对于大规模数据库工作是宝贵的高级知识。

当表增长到数亿或数十亿行时，即使是索引查询和维护操作也会变得缓慢，而分区通过将表分割成较小的物理片段来解决这个问题。

关键的性能优势是**分区剪裁** — 在分区键上过滤的查询仅扫描相关分区（跳过其余部分），这极大地减少了超大表扫描的数据量（特别是对于按日期过滤的时间序列数据）。

除了查询性能外，分区还提供主要的**维护优势**：归档或删除旧数据变成即时操作（删除整个分区相比缓慢、昂贵的`DELETE`数百万行 — 这对具有保留政策的日志/事件数据是重大的运维优势），以及较小的每分区索引和并行操作。

理解**策略**（RANGE用于日期/数值范围 — 常见于时间序列；LIST用于离散值；HASH用于均匀分布）以及关键的**何时使用分区**（具有自然分区键的超大表，特别是具有定期归档的时间序列 — *不是*小/中等表，其中索引就足够了，分区只会增加复杂性）反映了成熟的判断力。

由于管理超大表是大规模系统中的真实挑战，而分区（具有分区剪裁和高效维护）是其标准工具 — 补充而非取代索引 — 理解表分区是数据库可扩展性的宝贵高级知识，特别适用于大规模、时间序列或高容量数据系统，也是展示大规模数据库管理理解的话题。