Join us on
Star us on
Get Started
Slack
GitHub
Get Started
v2.0 (latest version) v1.3 (earlier version) v1.2 (earlier version) v1.1 (earlier version) v1.0 (earlier version)
  • GET STARTED
    • Quick Start
      • 1. Install YugabyteDB
      • 2. Create Local Cluster
      • 3. Explore YSQL
      • 4. Build an app
        • Java
        • NodeJS
        • Go
        • Python
        • Ruby
        • C#
        • PHP
        • C++
        • C
    • Introduction
    • Explore Core
      • 1. Linear Scalability
      • 2. Fault Tolerance
      • 3. Global Distribution
      • 4. Auto Sharding
      • 5. Tunable Reads
      • 6. Observability
  • USER GUIDES
    • Develop
      • Learn App Dev
        • 1. SQL vs NoSQL
        • 2. Data Modeling
        • 3. Data Types
        • 4. ACID Transactions
        • 5. Aggregations
        • 6. Batch Operations
      • Ecosystem Integrations
        • Apache Kafka
        • Apache Spark
        • JanusGraph
        • KairosDB
        • Presto
        • Metabase
      • Real World Examples
        • E-Commerce App
        • IoT Fleet Management
        • Retail Analytics
      • Explore Sample Apps
    • Deploy
      • Checklist
      • Manual Deployment
        • 1. System Configuration
        • 2. Install Software
        • 3. Start YB-Masters
        • 4. Start YB-TServers
        • 5. Verify Deployment
      • Kubernetes
        • Helm Chart
        • Helm Configuration
        • Local SSD
      • Docker
      • Public Clouds
        • Amazon Web Services
        • Google Cloud Platform
        • Microsoft Azure
      • Pivotal Cloud Foundry
      • Enterprise Edition
        • 1. Prepare Cloud Env
        • 2. Install Admin Console
        • 3. Configure Admin Console
        • 4. Configure Cloud Providers
    • Benchmark
      • Performance
      • YCSB
      • Large Datasets
    • Secure
      • Security Checklist
      • Authentication
      • Authorization
        • 1. RBAC Model
        • 2. Create Roles
        • 3. Grant Permissions
      • TLS Encryption
        • 1. Prepare Nodes
        • 2. Server-Server Encryption
        • 3. Client-Server Encryption
        • 4. Connect to Cluster
    • Manage
      • Backup and Restore
        • Backing Up Data
        • Restoring Data
      • Data Migration
        • Bulk Import
        • Bulk Export
      • Change Cluster Config
      • Upgrade Deployment
      • Diagnostics Reporting
      • Enterprise Edition
        • Create Universe - Multi-Zone
        • Create Universe - Multi-Region
        • Edit Universe
        • Edit Config Flags
        • Health Checking and Alerts
        • Create/Edit Instance Tags
        • Node Status & Actions
        • Read Replicas
        • Backup & Restore
        • Upgrade Universe
        • Delete Universe
    • Troubleshoot
      • Troubleshooting Overview
      • Cluster Level Issues
        • YCQL Connection Issues
        • YEDIS Connection Issues
      • Node Level Issues
        • Check Processes
        • Inspect Logs
        • System Stats
      • Enterprise Edition
        • Troubleshoot Universes
  • REFERENCE
    • APIs
      • YSQL
        • Commands
          • ABORT
          • ALTER DOMAIN
          • ALTER TABLE
          • BEGIN TRANSACTION
          • COMMIT
          • COPY
          • CREATE DATABASE
          • CREATE DOMAIN
          • CREATE INDEX
          • CREATE SCHEMA
          • CREATE SEQUENCE
          • CREATE TABLE
          • CREATE TABLE AS
          • CREATE USER
          • CREATE VIEW
          • DEALLOCATE
          • DELETE
          • DROP DATABASE
          • DROP DOMAIN
          • DROP SEQUENCE
          • DROP TABLE
          • END TRANSACTION
          • EXECUTE
          • EXPLAIN
          • INSERT
          • PREPARE
          • RESET
          • ROLLBACK
          • SELECT
          • SET
          • SET TRANSACTION
          • SHOW
          • SHOW TRANSACTION
          • TRUNCATE
          • UPDATE
        • Datatypes
          • Binary
          • Boolean
          • Character
          • Date-time
          • Json
          • Money
          • Numeric
          • Serial
          • UUID
        • Expressions
          • currval()
          • lastval()
          • nextval()
        • Keywords
        • Reserved Names
      • YCQL
        • Quick Start YCQL
        • ALTER KEYSPACE
        • ALTER ROLE
        • ALTER TABLE
        • CREATE INDEX
        • CREATE KEYSPACE
        • CREATE ROLE
        • CREATE TABLE
        • CREATE TYPE
        • DROP INDEX
        • DROP KEYSPACE
        • DROP ROLE
        • DROP TABLE
        • DROP TYPE
        • GRANT PERMISSION
        • GRANT ROLE
        • REVOKE PERMISSION
        • REVOKE ROLE
        • USE
        • INSERT
        • SELECT
        • UPDATE
        • DELETE
        • TRANSACTION
        • TRUNCATE
        • Simple Value
        • Subscript
        • Function Call
        • Operator Call
        • BLOB
        • BOOLEAN
        • MAP, SET, LIST
        • FROZEN
        • INET
        • Integer & Counter
        • Non-Integer
        • TEXT
        • Date & Time Types
        • UUID & TIMEUUID
        • JSONB
        • Date & Time Functions
    • CLIs
      • yb-ctl
      • yb-docker-ctl
      • yb-master
      • yb-tserver
      • ysqlsh
      • cqlsh
    • Tools
      • TablePlus
  • RELEASES
    • Release History
      • v1.2.12
      • v1.2.11
      • v1.2.10
      • v1.2.9
      • v1.2.8
      • v1.2.6
      • v1.2.5
      • v1.2.4
  • CONCEPTS
    • Architecture
      • Design Goals
      • Layered Architecture
      • Basic Concepts
        • Universe
        • YB-TServer
        • YB-Master
        • Acknowledgements
      • Query Layer
        • Overview
      • DocDB Store
        • Sharding
        • Replication
        • Persistence
        • Performance
      • DocDB Transactions
        • Isolation Levels
        • Single Row Transactions
        • Distributed Transactions
        • Transactional IO Path
  • FAQ
    • Comparisons
      • CockroachDB
      • Google Cloud Spanner
      • MongoDB
      • FoundationDB
      • Amazon DynamoDB
      • Azure Cosmos DB
      • Apache Cassandra
      • Redis In-Memory Store
      • Apache HBase
    • Other FAQs
      • Product
      • Architecture
      • Enterprise Edition
      • API Compatibility
  • CONTRIBUTOR GUIDES
    • Get Involved
  • Misc
    • YEDIS
      • Quick Start
      • Develop
        • Client Drivers
          • C
          • C++
          • C#
          • Go
          • Java
          • NodeJS
          • Python
      • API Reference
        • APPEND
        • AUTH
        • CONFIG
        • CREATEDB
        • DELETEDB
        • LISTDB
        • SELECT
        • DEL
        • ECHO
        • EXISTS
        • EXPIRE
        • EXPIREAT
        • FLUSHALL
        • FLUSHDB
        • GET
        • GETRANGE
        • GETSET
        • HDEL
        • HEXISTS
        • HGET
        • HGETALL
        • HINCRBY
        • HKEYS
        • HLEN
        • HMGET
        • HMSET
        • HSET
        • HSTRLEN
        • HVALS
        • INCR
        • INCRBY
        • KEYS
        • MONITOR
        • PEXPIRE
        • PEXPIREAT
        • PTTL
        • ROLE
        • SADD
        • SCARD
        • RENAME
        • SET
        • SETEX
        • PSETEX
        • SETRANGE
        • SISMEMBER
        • SMEMBERS
        • SREM
        • STRLEN
        • ZRANGE
        • TSADD
        • TSCARD
        • TSGET
        • TSLASTN
        • TSRANGEBYTIME
        • TSREM
        • TSREVRANGEBYTIME
        • TTL
        • ZADD
        • ZCARD
        • ZRANGEBYSCORE
        • ZREM
        • ZREVRANGE
        • ZSCORE
        • PUBSUB
        • PUBLISH
        • SUBSCRIBE
        • UNSUBSCRIBE
        • PSUBSCRIBE
        • PUNSUBSCRIBE
> Architecture > DocDB Store >

Making DocDB High-Performance

Attention

This page documents an earlier version. Go to the latest (v2.0)version.

    • Enhancements to RocksDB
      • Efficiently model documents
      • Raft vs RocksDB WAL logs
      • MVCC at a higher layer
      • Backups/Snapshots
    • Data Model Aware Bloom Filters
    • Range Query Optimizations
    • Efficient Memory Usage
      • Server-global block cache
      • Server-global memstore limits
    • Scan-resistant block cache

DocDB uses a highly customized version of RocksDB, a log-structured merge tree (LSM) based key-value store.

DocDB Document Storage Layer

A number of enhancements or customizations were done to RocksDB in order to achieve scale and performance. These are described below.

Enhancements to RocksDB

Efficiently model documents

To implement a flexible data model---such as a row (comprising of columns), or collections types (such as list, map, set) with arbitrary nesting---on top of a key-value store, but, more importantly, to implement efficient operations on this data model such as:

  • fine-grained updates to a part of the row or collection without incurring a read-modify-write penalty of the entire row or collection
  • deleting/overwriting a row or collection/object at an arbitrary nesting level without incurring a read penalty to determine what specific set of KVs need to be deleted
  • enforcing row/object level TTL based expiry:

A tighter coupling into the “read/compaction” layers of the underlying KV store (RocksDB) is needed. We use RocksDB as an append-only store and operations such as row or collection delete are modeled as an insert of a special “delete marker”. This allows deleting an entire subdocument efficiently by just adding one key/value pair to RocksDB. Read hooks automatically recognize these markers and suppress expired data. Expired values within the subdocument are cleaned up/garbage collected by our customized compaction hooks.

Raft vs RocksDB WAL logs

DocDB uses Raft for replication. Changes to the distributed system are already recorded/journalled as part of Raft logs. When a change is accepted by a majority of peers, it is applied to each tablet peer’s DocDB, but the additional WAL mechanism in RocksDB (under DocDB) is unnecessary and adds overhead. For correctness, in addition to disabling the WAL mechanism in RocksDB, Yugabyte tracks the Raft “sequence id” up to which data has been flushed from RocksDB’s memtables to SSTable files. This ensures that we can correctly garbage collect the Raft WAL logs as well as replay the minimal number of records from Raft WAL logs on a server crash or restart.

MVCC at a higher layer

Multi-version concurrency control (MVCC) in DocDB is done at a higher layer, and does not use the MVCC mechanism of RocksDB.

The mutations to records in the system are versioned using hybrid-timestamps maintained at the YBase layer. As a result, the notion of MVCC as implemented in a vanilla RocksDB (using sequence ids) is not necessary and only adds overhead. Yugabyte does not use RocksDB’s sequence ids, and instead uses hybrid-timestamps that are part of the encoded key to implement MVCC.

Backups/Snapshots

These need to be higher level operations that take into consideration data in DocDB as well as in the Raft logs to get a consistent cut of the state of the system

Data Model Aware Bloom Filters

The keys stored by DocDB in RocksDB consist of a number of components, where the first component is a “document key”, followed by a few scalar components, and finally followed by a timestamp (sorted in reverse order).

The bloom filter needs to be aware of what components of the key need be added to the bloom so that only the relevant SSTable files in the LSM store are looked up during a read operation.

In a traditional KV store, range scans do not make use of bloom filters because exact keys that fall in the range are unknown. However, we have implemented a data-model aware bloom filter, where range scans within keys that share the same hash component can also benefit from bloom filters. For example, a scan to get all the columns within row or all the elements of a collection can also benefit from bloom filters.

Range Query Optimizations

The ordered (or range) components of the compound-keys in DocDB frequently have a natural order. For example, it may be an int that represents a message id (for a messaging like application) or a timestamp (for a IoT/Timeseries like use case). See example below. By keeping hints with each SSTable file in the LSM store about the min/max values for these components of the “key”, range queries can intelligently prune away the lookup of irrelevant SSTable files during the read operation.

SELECT message_txt
  FROM messages
WHERE user_id = 17
  AND message_id > 50
  AND message_id < 100;

Or, in the context of a time-series application:

SELECT metric_value
  FROM metrics
WHERE metric_name = ’system.cpu’
  AND metric_timestamp < ?
  AND metric_timestamp > ?

Efficient Memory Usage

There are two instances where memory usage across components in a manner that is global to the server yields benefits, as described below.

Server-global block cache

A shared block cache is used across the DocDB/RocksDB instances of all the tablets hosted by a YB-TServer. This maximizes the use of memory resources, and avoids creating silos of cache that each need to be sized accurately for different user tables.

Server-global memstore limits

While per-memstore flush sizes can be configured, in practice, because the number of memstores may change over time as users create new tables, or tablets of a table move between servers, we have enhanced the storage engine to enforce a global memstore threshold. When such a threshold is reached, selection of which memstore to flush takes into account what memstores carry the oldest records (determined using hybrid timestamps) and therefore are holding up Raft logs and preventing them from being garbage collected.

Scan-resistant block cache

We have enhanced RocksDB’s block cache to be scan resistant. The motivation was to prevent operations such as long-running scans (e.g., due to an occasional large query or background Spark jobs) from polluting the entire cache with poor quality data and wiping out useful/hot data.

    • Enhancements to RocksDB
      • Efficiently model documents
      • Raft vs RocksDB WAL logs
      • MVCC at a higher layer
      • Backups/Snapshots
    • Data Model Aware Bloom Filters
    • Range Query Optimizations
    • Efficient Memory Usage
      • Server-global block cache
      • Server-global memstore limits
    • Scan-resistant block cache
Architecture
Persistence on top of RocksDB
DocDB Transactions
Isolation Levels
Talk to Community
  • Slack
  • Github
  • Forum
  • StackOverflow
Yugabyte
Contact Us
Copyright © 2017-2019 Yugabyte, Inc. All rights reserved.