Change data capture (CDC) Beta
Change data capture (CDC) is a process to capture changes made to data in the database and stream those changes to external processes, applications or other databases.
On this page
- Process architecture
- Debezium connector for YugabyteDB
- Java console client
- Tserver configuration
- Consistency semantics
- Performance impact
- yb-admin commands
- DDL command support
- Snapshot support
What is CDC?
The core primitive of CDC is the stream. Streams can be enabled/disabled on databases. Every change to a watched database table is emitted as a record in a configurable format to a configurable sink. Streams scale to any YugabyteDB cluster independent of its size and are designed to impact production traffic as little as possible.
- The database and its tables must be created using YugabyteDB version 2.13 or later.
- There should be a primary key on the table you want to stream the changes from.
Be aware of the following:
- You can't stream data out of system tables.
- You can't create a stream on a table created after you created the stream. For example, if you create a DB stream on the database and then create a new table in that database, you won't be able to stream data out of the new table. You need to create a new DB Stream ID and use it to stream data.
NoteThe current YugabyteDB CDC implementation supports only Debezium and Kafka.
WarningYugabyteDB doesn't yet support the DROP TABLE and TRUNCATE TABLE commands. The behavior of these commands while streaming data from CDC is undefined. If you need to drop or truncate a table, delete the stream ID using yb-admin. See the limitations section, as well.
Streams are the YugabyteDB endpoints for fetching DB changes by applications, processes and systems. Streams can be enabled or disabled [on a per namespace basis]. Every change to a database table (for which the data is being streamed) is emitted as a record to the stream, which is then propagated further for consumption by applications, in our case to Debezium, and then ultimately to Kafka.
In order to facilitate the streaming of data, we have to create a DB Stream. This stream is created at the database level, and can access the data in all of that database's tables.
Debezium connector for YugabyteDB
The Debezium connector for YugabyteDB pulls data from YugabyteDB and publishes it to Kafka. The following illustration explains the pipeline:
CDC Java console client
There is a Java console client for Change data capture and is strictly meant for testing purposes only, it will help in building an understanding what all change records are emitted by YugabyteDB.
There are several GFlags you can use to fine-tune YugabyteDB's CDC behavior. These flags are documented in the Change data capture flags section of the yb-tserver reference page.
Per-Tablet Ordered Delivery Guarantee
All changes for a row (or rows in the same tablet) will be received in the order in which they happened. However, due to the distributed nature of the problem, there is no guarantee of the order across tablets.
At least Once Delivery
Updates for rows will be streamed at least once. This can happen in the case of Kafka Connect Node failure. If the Kafka Connect Node pushes the records to Kafka and crashes before committing the offset, on the restart, it will again get the same set of records.
No Gaps in Change Stream
Note that after you have received a change for a row for some timestamp
t, you won't receive a previously unseen change for that row at a lower timestamp. Receiving any change implies that you have received all older changes for that row.
The change records for CDC are read from the WAL. CDC module maintains checkpoint internally for each of the stream-id and garbage collects the WAL entries if those have been streamed to cdc clients.
In case CDC is lagging or away for some time, the disk usage may grow and may cause YugabyteDB cluster instability. To avoid a scenario like this if a stream is inactive for a configured amount of time we garbage collect the WAL. This is configurable by a Gflag.
The commands used to manipulate CDC DB streams can be found in the yb-admin reference documentation.
DDL command support
Change data capture supports schema changes (for example, adding a default value to column, adding a new column, or adding constraints to column) for tables as well. When you run a DDL command, the schema is altered, and YugabyteDB emits a DDL record with the new schema values; after that, further records will use the new schema format.
Initially, if you create a stream for a particular table that already contains some records, the stream takes a snapshot of the table, and streams all the data that resides in the table. After the snapshot of the whole table is completed, YugabyteDB starts streaming the changes that happen in the table.
The snapshot feature uses the
cdc_snapshot_batch_size GFlag. This flag's default value is 250 records included per batch in response to an internal call to get the snapshot. If the table contains a very large amount of data, you may need to increase this value to reduce the amount of time it takes to stream the complete snapshot. You can also choose not to take a snapshot by modifying the Debezium configuration.
- YCQL tables aren't currently supported. Issue 11320.
- DROP and TRUNCATE commands aren't supported. If a user tries to issue these commands on a table while a stream ID is there for the table, the server might crash, the behaviour is unstable. Issues for TRUNCATE 10010 and DROP 10069.
- If a stream ID is created, and after that a new table is created, the existing stream ID is not able to stream data from the newly created table. The user needs to create a new stream ID. Issue 10921.
- A single stream can only be used to stream data from one namespace only.
- Enabling CDC on tables created using previous versions of YugabyteDB is not supported, even after YugabyteDB is upgraded to version 2.13 or higher.
In addition, CDC support for the following features will be added in upcoming releases: