Testing high scale workloads of TPC-C benchmark

Understand how YugabyteDB performs with high scale workloads

Workloads in TPC-C are defined by the number of warehouses the benchmark run will simulate. We will explore how YugabyteDb performs as the number of warehouses is increased.

Get TPC-C binaries

First, you need the benchmark binaries. To download the TPC-C binaries, run the following commands:

$ wget https://github.com/yugabyte/tpcc/releases/latest/download/tpcc.tar.gz
$ tar -zxvf tpcc.tar.gz
$ cd tpcc

Client machine

The client machine is where the benchmark is run from. An 8vCPU machine with at least 16GB memory is recommended. The following instance types are recommended for the client machine.

vCPU AWS AZURE GCP
8 c5.2xlarge Standard_F8s_v2 n2-highcpu-8

Cluster setup

The following cloud provider instance types are recommended for this test.

Set up a local cluster

If a local universe is currently running, first destroy it.

Start a local three-node universe with an RF of 3 by first creating a single node, as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.1 \
                --base_dir=${HOME}/var/node1 \
                --cloud_location=aws.us-east-2.us-east-2a

On macOS, the additional nodes need loopback addresses configured, as follows:

sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3

Next, join more nodes with the previous node as needed. yugabyted automatically applies a replication factor of 3 when a third node is added.

Start the second node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.2 \
                --base_dir=${HOME}/var/node2 \
                --cloud_location=aws.us-east-2.us-east-2b \
                --join=127.0.0.1

Start the third node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.3 \
                --base_dir=${HOME}/var/node3 \
                --cloud_location=aws.us-east-2.us-east-2c \
                --join=127.0.0.1

After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:

./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=zone

This command can be executed on any node where you already started YugabyteDB.

To check the status of a running multi-node universe, run the following command:

./bin/yugabyted status --base_dir=${HOME}/var/node1

Store the IP addresses of the nodes in a shell variable for use in further commands.

IPS=127.0.0.1,127.0.0.2,127.0.0.3

Setup

To set up a cluster, refer to Set up a YugabyteDB Aeon cluster.

Store the IP addresses/public address of the cluster in a shell variable for use in further commands.

IPS=<cluster-name/IP>

Initialize the workloads

Before starting the workload, you need to load the data. Replace the IP addresses with that of the nodes in the cluster.

$ ./tpccbenchmark --create=true --nodes=${IPS}
$ ./tpccbenchmark --load=true --nodes=${IPS}
Cluster Loader threads Loading time Data set size
3 nodes, type 2vCPU 10 ~13 minutes ~20 GB

The loading time for ten warehouses on a cluster with 3 nodes of type 16vCPU is approximately 3 minutes.

Before starting the workload, you need to load the data. Replace the IP addresses with that of the nodes in the cluster.

$ ./tpccbenchmark --create=true --nodes=${IPS}
$ ./tpccbenchmark --load=true --nodes=${IPS} --warehouses=100 --loaderthreads 48
Cluster Loader threads Loading time Data set size
3 nodes, type 16vCPU 48 ~20 minutes ~80 GB

Tune the --loaderthreads parameter for higher parallelism during the load, based on the number and type of nodes in the cluster. The specified 48 threads value is optimal for a 3-node cluster of type 16vCPU. For larger clusters or computers with more vCPUs, increase this value accordingly. For clusters with a replication factor of 3, a good approximation is to use the number of cores you have across all the nodes in the cluster.

Before starting the workload, you need to load the data first. Replace the IP addresses with that of the nodes in the cluster.

$ ./tpccbenchmark --create=true --nodes=${IPS}
$ ./tpccbenchmark --load=true --nodes=${IPS} --warehouses=1000 --loaderthreads 48
Cluster Loader threads Loading time Data set size
3 nodes, type 16vCPU 48 ~3.5 hours ~420 GB

Tune the --loaderthreads parameter for higher parallelism during the load, based on the number and type of nodes in the cluster. The specified 48 threads value is optimal for a 3-node cluster of type c5d.4xlarge. For larger clusters or computers with more vCPUs, increase this value accordingly. For clusters with a replication factor of 3, a good approximation is to use the number of cores you have across all the nodes in the cluster.

Before starting the workload, you need to load the data. In addition, you need to ensure that you exported a list of all IP addresses of all the nodes involved.

For 10k warehouses, you would need ten clients of type c5.4xlarge to drive the benchmark. For multiple clients, you need to perform the following steps.

First, create the database and the corresponding tables. Execute the following command from one of the clients:

./tpccbenchmark  --nodes=$IPS  --create=true --vv

After the database and tables are created, load the data from all ten clients:

The initial-delay-secs is introduced to avoid a sudden overload on the master to fetch catalogs.
Client Command
1 ./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=1 --initial-delay-secs=0 --vv
2 ./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=2501 --initial-delay-secs=30 --vv
3 ./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=5001 --initial-delay-secs=60 --vv
4 ./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=7501 --initial-delay-secs=90 --vv

Tune the --loaderthreads parameter for higher parallelism during the load, based on the number and type of nodes in the cluster. The value specified, 48 threads, is optimal for a 3-node cluster of type 16vCPU. For larger clusters or computers with more vCPUs, increase this value accordingly. For clusters with a replication factor of 3, a good approximation is to use the number of cores you have across all the nodes in the cluster.

When the loading is completed, execute the following command to enable the foreign keys that were disabled to aid the loading times:

./tpccbenchmark  --nodes=$IPS  --enable-foreign-keys=true
Cluster Loader threads Loading time Data set size
30 nodes, type 16vCPU 480 ~5.5 hours ~4 TB

Run the benchmark

You can run the workload against the database as follows:

$ ./tpccbenchmark --execute=true --warmup-time-secs=60 --nodes=${IPS}

You can run the workload against the database as follows:

$ ./tpccbenchmark --execute=true --warmup-time-secs=60 --nodes=${IPS} --warehouses=100

You can run the workload against the database as follows:

$ ./tpccbenchmark --execute=true --warmup-time-secs=60 --nodes=${IPS} --warehouses=1000

You can then run the workload against the database from each client:

As you specify initial-delay-secs to ensure that the tests start together, adjust that time in the warmup phase.
Client Command
1 ./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=1 --total-warehouses=10000 --initial-delay-secs=0 --warmup-time-secs=900 --num-connections=90
2 ./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=2501 --total-warehouses=10000 --initial-delay-secs=30 --warmup-time-secs=870 --num-connections=90
3 ./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=5001 --total-warehouses=10000 --initial-delay-secs=60 --warmup-time-secs=840 --num-connections=90
4 ./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=7501 --total-warehouses=10000 --initial-delay-secs=90 --warmup-time-secs=810 --num-connections=90

Benchmark results

Metric Value
Efficiency 97.85
TPMC 125.83
Average NewOrder Latency (ms) 19.2
Connection Acquisition NewOrder Latency (ms) 0.72
Total YSQL Ops/sec 39.51
Metric Value
Efficiency 99.31
TPMC 1277.13
Average NewOrder Latency (ms) 10.36
Connection Acquisition NewOrder Latency (ms) 0.27
Total YSQL Ops/sec 379.78
Metric Value
Efficiency 100.01
TPMC 12861.13
Average NewOrder Latency (ms) 16.09
Connection Acquisition NewOrder Latency (ms) 0.38
Total YSQL Ops/sec 3769.86

When the execution is completed, you need to copy the csv files from each of the nodes to one of the nodes and run merge-results to display the merged results.

After copying the csv files to a directory such as results-dir, merge the results as follows:

./tpccbenchmark --merge-results=true --dir=results-dir --warehouses=10000
Metric Value
Efficiency 99.92
TPMC 128499.53
Average NewOrder Latency (ms) 26.4
Connection Acquisition NewOrder Latency (ms) 0.44
Total YSQL Ops/sec 37528.64

Largest benchmark

In our testing, YugabyteDB was able to process 1M tpmC with 150,000 warehouses at an efficiency of 99.8% on an RF3 cluster of 75 48vCPU/96GB machines with a total data size of 50TB.

To accomplish this feat, numerous new features had to be implemented and existing ones optimized, including the following:

  • The number of RPCs made was reduced to reduce CPU usage.
  • Fine-grained locking was implemented.
  • Index updates were optimized by skipping index updates when columns in the index were not updated.
  • An encoded table's primary key was used as the pointer in the index.
  • Continuous background compaction.
  • Hot data was stored in the block cache.
  • Transaction retries.

The 150K warehouses benchmark was run on v2.11.
Warehouses TPMC Efficiency(%) Nodes Connections New Order Latency Machine Type (vCPUs)
150,000 1M 99.30 75 9000 123.33 ms c5d.12xlarge  (48)