Change data capture (CDC) Beta

Change data capture (CDC) in YugabyteDB provides technology to ensure that any changes in data due to operations such as inserts, updates, and deletions are identified, captured, and automatically applied to another data repository instance, or made available for consumption by applications and other tools.

Use cases

CDC is useful in a number of scenarios.

Note

In this document, the terms data center, cluster, and universe are used interchangeably. The assumption is made that each YugabyteDB universe is deployed in a single data center.

Microservice-oriented architectures

Some microservices require a stream of changes to the data and using CDC in YugabyteDB can provide consumable data changes to CDC subscribers.

Asynchronous replication to remote systems

Remote systems may subscribe to a stream of data changes and then transform and consume the changes. Maintaining separate database instances for transactional and reporting purposes can be used to manage workload performance.

Multiple data center strategies

Maintaining multiple data centers enables enterprises to provide the following:

  • High availability (HA) — Redundant systems help ensure that your operations almost never fail.
  • Geo-redundancy — Geographically-dispersed servers provide resiliency against catastrophic events and natural disasters.

Compliance and auditing

Auditing and compliance requirements can require you to use CDC to maintain records of data changes.

Process architecture

The following diagram depicts the CDC process architecture:

r C e N D p o C l d i e ( S c S m t a 2 Y t e r t B o t e e - r a a d M e d m a s a s w s t i t C a m t e D ) e h r C t a R d a a f t t a N Y C o B D d - C e ( T S m S S 1 Y t e e e B o t r r - r a v v M e d e i a s a r c s t e t C a e D ) r C Y C B D - C N T o S S d ( e e e S m r r Y t e v v 3 B o t e i - r a r c M e d e a s a s t t C a e D ) r C C D C S e Y C r B D v - C i T c S S e e e r r i v v s e i r c s e t a t e l e s s

Every YB-TServer has a CDC service that is stateless. The main APIs provided by the CDC service are the following:

  • createCDCSDKStream API for creating the stream on the database.
  • getChangesCDCSDK API that can be used by the client to get the latest set of changes.

CDC streams

Creating a new CDC stream returns a stream UUID. This is facilitated via the yb-admin tool.

Debezium

To consume the events generated by CDC, Debezium is used as the connector. Debezium is an open-source distributed platform that needs to be pointed at the database using the stream ID. For information on how to set up Debezium for YugabyteDB CDC, see Debezium integration.

Pushing changes to external systems

Using the Debezium connector for YugabyteDB, changes are pushed from YugabyteDB to a Kafka topic, which can then be used by any end-user application for the processing and analysis of the records.

CDC guarantees

There is a number of guarantees that CDC makes.

Per-tablet ordered delivery guarantee

All data changes for one row or multiple rows in the same tablet are received in the order in which they occur. Due to the distributed nature of the problem, however, there is no guarantee for the order across tablets.

Consider the following scenario:

  • Two rows are being updated concurrently.
  • These two rows belong to different tablets.
  • The first row row #1 was updated at time t1, and the second row row #2 was updated at time t2.

In this case, it is possible for CDC to push the later update corresponding to row #2 change to Kafka before pushing the earlier update, corresponding to row #1.

At-least-once delivery

Updates for rows are pushed at least once. With the at-least-once delivery, you never lose a message, however the message might be delivered to a CDC consumer more than once. This can happen in case of a tablet leader change, where the old leader already pushed changes to Kafka, but the latest pushed op id was not updated in the CDC metadata.

For example, a CDC client has received changes for a row at times t1 and t3. It is possible for the client to receive those updates again.

No gaps in change stream

When you have received a change for a row for timestamp t, you do not receive a previously unseen change for that row from an earlier timestamp. This guarantees that receiving any change implies that all earlier changes have been received for a row.