Cache and storage subsystems

Monitor RocksDB storage subsystem metrics

Storage layer IOPS

DocDB uses a modified version of RocksDB (an LSM-based key-value store that consists of multiple logical levels, and data in each level are sorted by key) as the storage layer. This storage layer performs seek, next, and prev operations.

The following table describes key throughput and latency metrics for the storage (RocksDB) layer.

Metric Unit Type Description
rocksdb_number_db_next keys counter Whenever a tuple is read/updated from the database, a request is made to RocksDB key. Each database operation makes multiple requests to RocksDB. The number of NEXT operations performed to look up a key by RocksDB when a tuple is read/updated by the database.
rocksdb_number_db_prev keys counter The number of PREV operations performed to look up a key by RocksDB when a tuple is read/updated from the database.
rocksdb_number_db_seek keys counter The number of SEEK operations performed to look up a key by the RocksDB when a tuple is read/updated from the database. 
rocksdb_db_write_micros microseconds counter The time spent by RocksDB in microseconds to write data. 
rocksdb_db_get_micros microseconds counter The time spent by RocksDB in microseconds to retrieve data matching a value.
rocksdb_db_seek_micros microseconds counter The time spent by RocksDB in microseconds to retrieve data in a range query.

These metrics can be aggregated across the entire cluster using appropriate aggregations.

Block cache

When the data requested from YSQL layer is sitting in an SST File, it will be cached in RocksDb Block Cache. This is the fundamental cache that sits in RocksDB instead of the YSQL layer. A block requires multiple touches before it is added to the multi-touch (hot) portion of the cache.

The following table describes key cache metrics for the storage (RocksDB) layer.

Metric Unit Type Description
rocksdb_block_cache_hit blocks counter The total number of block cache hits (cache index + cache filter + cache data).
rocksdb_block_cache_miss blocks counter The total number of block cache misses (cache index + cache filter + cache data).
block_cache_single_touch_usage blocks counter Blocks of data cached and read once by the YSQL layer are classified in single touch portion of the cache. The size (in bytes) of the cache usage by blocks having a single touch.
block_cache_multi_touch_usage blocks counter Blocks of data cached and read more than once by the YSQL layer are classified in the multi-touch portion of the cache. The size (in bytes) of the cache usage by blocks having multiple touches.

These metrics can be aggregated across the entire cluster using appropriate aggregations.

Bloom filters

Bloom filters are hash tables used to determine if a given SSTable has the data for a query looking for a particular value.

Metric Unit Type Description
rocksdb_bloom_filter_checked blocks counter The number of times the bloom filter has been checked.
rocksdb_bloom_filter_useful blocks counter The number of times the bloom filter has avoided file reads (avoiding IOPS).

These metrics can be aggregated across the entire cluster using appropriate aggregations.

SST files

RocksDB LSM-trees buffer incoming data in a memory buffer that, when full, is sorted, and flushed to disk in the form of a sorted run. When a sorted run is flushed to disk, it may be iteratively merged with existing runs of the same size. Overall, as a result of such iterative merges, the sorted runs on disk (also called Sorted-String Table or SST files) form a collection of levels of exponentially increasing size with potentially overlapping key ranges across the levels.

Metric Unit Type Description
rocksdb_current_version_sst_files_size bytes counter The aggregate size of all SST files.
rocksdb_current_version_num_sst_files files counter The number of SST files.

These metrics can be aggregated across the entire cluster using appropriate aggregations.

Compaction

To make reads more performant over time, RocksDB periodically reduces the number of logical levels by running compaction (sorted-merge) on the SST files in the background, where part or multiple logical levels are merged into one. In other words, RocksDB uses compactions to balance write, space, and read amplifications.

A description of key metrics in this category is listed in the following table:

Metric Unit Type Description
rocksdb_compact_read_bytes bytes counter Number of bytes being read to do compaction.
rocksdb_compact_write_bytes bytes counter Number of bytes being written to do compaction.
rocksdb_compaction_times_micros microseconds counter Time for the compaction process to complete.
rocksdb_numfiles_in_singlecompaction files counter Number of files in any single compaction.

Memtable

Memtable is the first level of data storage where data is stored when you start inserting. It provides statistics about reading documents, which are essentially columns in the table. If a memtable is full, the existing memtable is made immutable and stored on disk as an SST file.

Memtable has statistics about reading documents, which essentially are columns in the table.

Metric Unit Type Description
rocksdb_memtable_compaction_micros microseconds counter Total time to compact a set of SST files.
rocksdb_memtable_hit keys counter Number of memtable hits.
rocksdb_memtable_miss keys counter Number of memtable misses.

These metrics are available per tablet and can be aggregated across the entire cluster using appropriate aggregations.

Write-Ahead-Logging (WAL)

The Write Ahead Log (or WAL) is used to write and persist updates to disk on each tablet. The following table describes metrics for observing the performance of the WAL component.

Metric Unit Type Description
log_sync_latency microseconds counter Time spent to flush (fsync) the WAL entries to disk.
log_append_latency microseconds counter Time spent on appending a batch of values to the WAL.
log_group_commit_latency microseconds counter Time spent on committing an entire group.
log_bytes_logged bytes counter Number of bytes written to the WAL after the tablet starts.
log_reader_bytes_read bytes counter Number of bytes read from WAL after the tablet start.

These metrics are available per tablet and can be aggregated across the entire cluster using appropriate aggregations.