Partition data by time

Partition data for efficient data management

Partitioning refers to splitting what is logically one large table into smaller physical pieces. The key advantage of partitioning in YugabyteDB is that, because each partition is a separate table, it is efficient to keep the most significant (for example, most recent) data in one partition, and not-so-important data in other partitions so that they can be dropped easily.

The following example describes the advantages of partitions in more detail.

Note

Partitioning is only available in YSQL.

Setup

To set up a local universe, refer to Set up a local YugabyteDB universe.

Setup

To set up a cluster, refer to Set up a YugabyteDB Managed cluster.

Setup

To set up a universe, refer to Set up a YugabyteDB Anywhere universe.

Consider a scenario where you have a lot of data points from cars and you care about only the last month's data. Although you can execute a statement to delete data that is older than 30 days, because data is not immediately removed from the underlying storage (LSM-based DocDB), it could affect the performance of scans.

Create a table with an example schema as follows:

CREATE TABLE part_demo (
    ts timestamp,/* time at which the event was generated */
    car varchar, /* name of the car */
    speed int,   /* speed of your car */
    PRIMARY KEY(car HASH, ts ASC)
) PARTITION BY RANGE (ts);

Create partitions for each month. Also, create a DEFAULT partition for data that does not fall into any of the other partitions.

CREATE TABLE part_7_23 PARTITION OF part_demo
    FOR VALUES FROM ('2023-07-01') TO ('2023-08-01');

CREATE TABLE part_8_23 PARTITION OF part_demo
    FOR VALUES FROM ('2023-08-01') TO ('2023-09-01');

CREATE TABLE part_9_23 PARTITION OF part_demo
    FOR VALUES FROM ('2023-09-01') TO ('2023-10-01');

CREATE TABLE def_part_demo PARTITION OF part_demo DEFAULT;

Insert some data into the main table part_demo:

INSERT INTO part_demo (ts, car, speed)
    (SELECT '2023-07-01 00:00:00'::timestamp +
        make_interval(secs=>id, months=>((random()*2)::int)),
        'car-' || ceil(random()*2), ceil(random()*60)
        FROM generate_series(1,100) AS id);

If you retrieve the rows from the respective partitions, notice that they have the rows for their respective date ranges. For example:

SELECT * FROM part_9_23 LIMIT 4;

         ts          |  car  | speed
---------------------+-------+-------
 2023-09-01 00:00:04 | car-2 |    45
 2023-09-01 00:00:05 | car-2 |    38
 2023-09-01 00:00:08 | car-2 |    49
 2023-09-01 00:00:23 | car-2 |    33

Fetch data

Although data is stored in different tables as partitions, to access all the data, you just need to query the parent table. Take a look at the query plan for a select query as follows:

EXPLAIN ANALYZE SELECT * FROM part_demo;

                                  QUERY PLAN
-------------------------------------------------------------------------------
 Append (actual time=1.085..5.351 rows=100 loops=1)
   ->  Seq Scan on public.part_7_23 (actual time=1.079..2.431 rows=25 loops=1)
         Output: part_7_23.ts, part_7_23.car, part_7_23.speed
   ->  Seq Scan on public.part_8_23 (actual time=0.665..1.555 rows=47 loops=1)
         Output: part_8_23.ts, part_8_23.car, part_8_23.speed
   ->  Seq Scan on public.part_9_23 (actual time=0.648..1.342 rows=28 loops=1)
         Output: part_9_23.ts, part_9_23.car, part_9_23.speed
 Planning Time: 0.105 ms
 Execution Time: 5.434 ms
 Peak Memory Usage: 19 kB

When querying the parent table, the child partitions are automatically queried.

Fetch data in a time range

As the data is split based on time, when querying for a specific time range, the query executor fetches data only from the partition that the data is expected to be in. For example, see the query plan for fetching data for a specific month:

EXPLAIN ANALYZE SELECT * FROM part_demo WHERE ts > '2023-07-01' AND ts < '2023-08-01';

                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Append (actual time=2.288..2.310 rows=25 loops=1)
   ->  Seq Scan on public.part_7_23 (actual time=2.285..2.301 rows=25 loops=1)
         Output: part_7_23.ts, part_7_23.car, part_7_23.speed
         Storage Filter: ((part_7_23.ts > '2023-07-01 00:00:00'::timestamp without time zone)
                AND (part_7_23.ts < '2023-08-01 00:00:00'::timestamp without time zone))
 Planning Time: 0.309 ms
 Execution Time: 2.411 ms
 Peak Memory Usage: 14 kB

You can see that the planner has chosen only one partition to fetch the data from.

Dropping older data

The key advantage of data partitioning is to drop older data easily. To drop the older data, all you need to do is drop that particular partition table. For example, when month 7's data is not needed, do the following:

DROP TABLE part_7_23;

DROP TABLE
Time: 103.214 ms

Learn more

Table partitioning