Data modeling is a process that involves identifying the entities (items to be stored) and the relationships between entities. To create your data model, identify the patterns used to access data and the types of queries to be performed. These two ideas inform the organization and structure of the data, and the design and creation of the database’s tables.
Keyspaces, tables, rows and columns
Cassandra keyspaces are a collection of tables. They are analogous to SQL namespaces. Typically, each application creates all its tables in one keyspaces.
A table is a collection of data. A keyspace most often contains one or more tables. Each table is identified by a name. Tables have a set of columns and contain records (rows) with data. Tables can be created, dropped, and altered at runtime without blocking updates and queries.
Each table contains multiple rows of data. A row is a set of columns that is uniquely identifiable among all of the other rows.
Each row is composed of one or more columns. An column is a fundamental data element, and does not need to be broken down any further.
As the example of a
users table which holds information about users of a service.
|1001||Sherlock||Holmes||221b Baker St, London, UK|
|1003||Clark||Kent||344 Clinton Street, Metropolis|
Note the following about the
- Each row in the table has a unique value for the primary key column (
- Other than the primary key, the
userstable has three other columns -
addresseach of which is a string.
- Some columns may have no data (for example, James Bond’s address
addressis unknown). These have
nullvalues in the database.
Now consider another example of the
books table that keeps tracks of authors and the books they have written.
|Charles Dickens||Oliver Twist||9.25||1837||serial novel|
|Charles Dickens||A Tale of Two Cities||11.40||1859||historical novel|
Note the following about the
- The primary key for this table consists of two columns -
book_title. Each row in the table must have values for these two attributes, and the combination of these values must be unique.
- Other than the primary key, the table has other columns such as
- The columns
priceis a float,
yearis an integer.
When creating a table, the primary key of the table must be specified in addition to the table name. The primary key uniquely identifies each row in the table, therefore no two rows can have the same key.
There are two types of primary keys, and they are described below.
Partition key columns (required)
Such tables have simple primary keys. One or more columns of a table can be made the partition key columns. The values of the partition key columns are used to compute an internal hash value. This hash value determines the tablet (or partition) in which the row will be stored. This has two implications:
Each unique set of partition key values is hashed and distributed across nodes randomly to ensure uniform utilization of the cluster.
All the data for a unique set of partition key values are always stored on the same node. This matters only if there are clustering key columns, which are described in the next section.
In the case of the
users table, we can make
user_id column the only primary key column. This is a good choice for a partition key because our queries do not case about the order of the user ids. If the table is split into a number of tablets (partitions), the data may be assigned as follows.
|tablet-22||1001||Sherlock||Holmes||221b Baker St, London, UK|
|tablet-4||1003||Clark||Kent||344 Clinton Street, Metropolis|
Clustering key columns (optional)
The clustering columns specify the order in which the column data is sorted and stored on disk for a given unique partition key value. More than one clustering column can be specified, and the columns are sorted in order they are declared in the clustering column. It is also possible to control the sort order (ascending or descending sort) for these columns. Note that the sort order respects the data type.
In a table that has both partition keys and clustering key, it is possible for two rows to have the same partition key value and therefore they end up on the same node. However, those rows must have different clustering key values in order to satisfy the primary key requirements.
In the case of the
author is a good partition key and
book_title is a good clustering key. Such a data model would allow easily listing all the books for a given author, as well as look up details of a specific book. This would cause the data to be stored as follows.
|tablet-21||Charles Dickens||A Tale of Two Cities||11.40||1859||historical novel|
|tablet-21||Charles Dickens||Oliver Twist||9.25||1837||serial novel|
Note that if we had made both
book_title partition key columns, we would not be able to list all the books for a given author efficiently.
The partition key columns are also often referred to as its hash columns. This is because an internal hash function is used to distributes data items across tablets based on their partition key values.
The clustering key columns are also referred to as its range columns. This is because rows with the same partition key are stored on disk in sorted order by the clustering key value.