Distributed Data modeling

This page documents a preview version. v2.23 Preview
Preview includes features under active development and is for development and testing only.
For production, use the latest stable version (v2024.1).

Data modeling is the process of defining the structure, organization, and relationships of data in a database. In a distributed SQL database, this process becomes even more crucial due to the complexities introduced by data distribution, replication, and consistency. To fully leverage the benefits offered by YugabyteDB, you need to approach data modeling with a distributed mindset. Data modeling for distributed SQL databases requires a careful balance of theoretical principles and practical considerations.

Organization

In YugabyteDB, data is stored as rows and columns in tables; tables are organized under schemas and databases.

To understand how to create and manage tables, schemas, and databases, see Schemas and tables.

Sharding

In YugabyteDB, table data is split into tablets, and distributed across multiple nodes in the cluster. Applications can connect to any node for storing and retrieving data. Because reads and writes can span multiple nodes, it's crucial to consider how table data is sharded and distributed when modeling your data. To design your tables and indexes for fast retrieval and storage in YugabyteDB, you first need to understand the data distribution schemes: Hash and Range sharding.

To learn more about data distribution schemes, see Configurable data sharding.

Primary keys

The primary key is the unique identifier for each row in the table. The distribution and ordering of table data depends on the primary key.

To design optimal primary keys for your tables, see Primary keys.

Secondary indexes

Indexes provide alternate access patterns for queries not involving the primary key of the table. With the help of an index, you can improve the access operations of your queries.

To design optimal indexes for faster lookup, see Secondary indexes.

Hot shards

In distributed systems, a hot-spot or hot-shard refers to a node that is overloaded with queries due to disproportionate traffic compared to other nodes in the cluster.

To understand the hot-shard problem and solutions to overcome the issue, see Hot shards.

Table partitioning

When the data in tables keep growing, you can partition the tables for better performance and enhanced data management. Partitioning also makes it easier to drop older data by dropping partitions. In YugabyteDB, you can also use partitioning with Tablespaces to improve latency in multi-region scenarios and adhere to data residency laws like GDPR.

To understand partitioning in YugabyteDB, see Table partitioning.