Testing high scale workloads of TPC-C benchmark

Understand how YugabyteDB performs with high scale workloads

Workloads in TPC-C are defined by the number of warehouses the benchmark run will simulate. We will explore how YugabyteDb performs as the number of warehouses is increased.

Get TPC-C binaries

First, you need the benchmark binaries. To download the TPC-C binaries, run the following commands:

$ wget https://github.com/yugabyte/tpcc/releases/latest/download/tpcc.tar.gz
$ tar -zxvf tpcc.tar.gz
$ cd tpcc

Client machine

The client machine is where the benchmark is run from. An 8vCPU machine with at least 16GB memory is recommended. The following instance types are recommended for the client machine.

vCPU	AWS	AZURE	GCP
8	c5.2xlarge	Standard_F8s_v2	n2-highcpu-8

Cluster setup

The following cloud provider instance types are recommended for this test.

vCPU	AWS	AZURE	GCP
2	m6i.large	Standard_D8s_v3	n2-standard-2
8	m6i.2xlarge	Standard_D2s_v3	n2-standard-8
16	m6i.4xlarge	Standard_D16s_v3	n2-standard-16

Local
YugabyteDB Aeon

Set up a local cluster

If a local universe is currently running, first destroy it.

Start a local three-node universe with an RF of 3 by first creating a single node, as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.1 \
                --base_dir=${HOME}/var/node1 \
                --cloud_location=aws.us-east-2.us-east-2a

On macOS, the additional nodes need loopback addresses configured, as follows:

sudo ifconfig lo0 alias 127.0.0.2
sudo ifconfig lo0 alias 127.0.0.3

Next, join more nodes with the previous node as needed. yugabyted automatically applies a replication factor of 3 when a third node is added.

Start the second node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.2 \
                --base_dir=${HOME}/var/node2 \
                --cloud_location=aws.us-east-2.us-east-2b \
                --join=127.0.0.1

Start the third node as follows:

./bin/yugabyted start \
                --advertise_address=127.0.0.3 \
                --base_dir=${HOME}/var/node3 \
                --cloud_location=aws.us-east-2.us-east-2c \
                --join=127.0.0.1

After starting the yugabyted processes on all the nodes, configure the data placement constraint of the universe, as follows:

./bin/yugabyted configure data_placement --base_dir=${HOME}/var/node1 --fault_tolerance=zone

This command can be executed on any node where you already started YugabyteDB.

To check the status of a running multi-node universe, run the following command:

./bin/yugabyted status --base_dir=${HOME}/var/node1

Store the IP addresses of the nodes in a shell variable for use in further commands.

IPS=127.0.0.1,127.0.0.2,127.0.0.3

Setup

To set up a cluster, refer to Set up a YugabyteDB Aeon cluster.

Store the IP addresses/public address of the cluster in a shell variable for use in further commands.

IPS=<cluster-name/IP>

Initialize the workloads

Before starting the workload, you need to load the data. Replace the IP addresses with that of the nodes in the cluster.

$ ./tpccbenchmark --create=true --nodes=${IPS}

$ ./tpccbenchmark --load=true --nodes=${IPS}

Cluster	Loader threads	Loading time	Data set size
3 nodes, type `2vCPU`	10	~13 minutes	~20 GB

The loading time for ten warehouses on a cluster with 3 nodes of type 16vCPU is approximately 3 minutes.

Before starting the workload, you need to load the data. Replace the IP addresses with that of the nodes in the cluster.

$ ./tpccbenchmark --create=true --nodes=${IPS}

$ ./tpccbenchmark --load=true --nodes=${IPS} --warehouses=100 --loaderthreads 48

Cluster	Loader threads	Loading time	Data set size
3 nodes, type 16vCPU	48	~20 minutes	~80 GB

Tune the --loaderthreads parameter for higher parallelism during the load, based on the number and type of nodes in the cluster. The specified 48 threads value is optimal for a 3-node cluster of type 16vCPU. For larger clusters or computers with more vCPUs, increase this value accordingly. For clusters with a replication factor of 3, a good approximation is to use the number of cores you have across all the nodes in the cluster.

Before starting the workload, you need to load the data first. Replace the IP addresses with that of the nodes in the cluster.

$ ./tpccbenchmark --create=true --nodes=${IPS}

$ ./tpccbenchmark --load=true --nodes=${IPS} --warehouses=1000 --loaderthreads 48

Cluster	Loader threads	Loading time	Data set size
3 nodes, type 16vCPU	48	~3.5 hours	~420 GB

Tune the --loaderthreads parameter for higher parallelism during the load, based on the number and type of nodes in the cluster. The specified 48 threads value is optimal for a 3-node cluster of type c5d.4xlarge. For larger clusters or computers with more vCPUs, increase this value accordingly. For clusters with a replication factor of 3, a good approximation is to use the number of cores you have across all the nodes in the cluster.

Before starting the workload, you need to load the data. In addition, you need to ensure that you exported a list of all IP addresses of all the nodes involved.

For 10k warehouses, you would need ten clients of type c5.4xlarge to drive the benchmark. For multiple clients, you need to perform the following steps.

First, create the database and the corresponding tables. Execute the following command from one of the clients:

./tpccbenchmark  --nodes=$IPS  --create=true --vv

After the database and tables are created, load the data from all ten clients:

The initial-delay-secs is introduced to avoid a sudden overload on the master to fetch catalogs.

Client	Command
1	./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=1 --initial-delay-secs=0 --vv
2	./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=2501 --initial-delay-secs=30 --vv
3	./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=5001 --initial-delay-secs=60 --vv
4	./tpccbenchmark -c config/workload_all.xml --load=true --nodes=$IPs --warehouses=2500 --total-warehouses=10000 --loaderthreads=16 --start-warehouse-id=7501 --initial-delay-secs=90 --vv

Tune the --loaderthreads parameter for higher parallelism during the load, based on the number and type of nodes in the cluster. The value specified, 48 threads, is optimal for a 3-node cluster of type 16vCPU. For larger clusters or computers with more vCPUs, increase this value accordingly. For clusters with a replication factor of 3, a good approximation is to use the number of cores you have across all the nodes in the cluster.

When the loading is completed, execute the following command to enable the foreign keys that were disabled to aid the loading times:

./tpccbenchmark  --nodes=$IPS  --enable-foreign-keys=true

Cluster	Loader threads	Loading time	Data set size
30 nodes, type 16vCPU	480	~5.5 hours	~4 TB

Run the benchmark

You can run the workload against the database as follows:

$ ./tpccbenchmark --execute=true --warmup-time-secs=60 --nodes=${IPS}

You can run the workload against the database as follows:

$ ./tpccbenchmark --execute=true --warmup-time-secs=60 --nodes=${IPS} --warehouses=100

You can run the workload against the database as follows:

$ ./tpccbenchmark --execute=true --warmup-time-secs=60 --nodes=${IPS} --warehouses=1000

You can then run the workload against the database from each client:

As you specify initial-delay-secs to ensure that the tests start together, adjust that time in the warmup phase.

Client	Command
1	./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=1 --total-warehouses=10000 --initial-delay-secs=0 --warmup-time-secs=900 --num-connections=90
2	./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=2501 --total-warehouses=10000 --initial-delay-secs=30 --warmup-time-secs=870 --num-connections=90
3	./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=5001 --total-warehouses=10000 --initial-delay-secs=60 --warmup-time-secs=840 --num-connections=90
4	./tpccbenchmark -c config/workload_all.xml --execute=true --nodes=$IPs --warehouses=2500 --start-warehouse-id=7501 --total-warehouses=10000 --initial-delay-secs=90 --warmup-time-secs=810 --num-connections=90

Benchmark results

Metric	Value
Efficiency	97.85
TPMC	125.83
Average NewOrder Latency (ms)	19.2
Connection Acquisition NewOrder Latency (ms)	0.72
Total YSQL Ops/sec	39.51

Metric	Value
Efficiency	99.31
TPMC	1277.13
Average NewOrder Latency (ms)	10.36
Connection Acquisition NewOrder Latency (ms)	0.27
Total YSQL Ops/sec	379.78

Metric	Value
Efficiency	100.01
TPMC	12861.13
Average NewOrder Latency (ms)	16.09
Connection Acquisition NewOrder Latency (ms)	0.38
Total YSQL Ops/sec	3769.86

When the execution is completed, you need to copy the csv files from each of the nodes to one of the nodes and run merge-results to display the merged results.

After copying the csv files to a directory such as results-dir, merge the results as follows:

./tpccbenchmark --merge-results=true --dir=results-dir --warehouses=10000

Metric	Value
Efficiency	99.92
TPMC	128499.53
Average NewOrder Latency (ms)	26.4
Connection Acquisition NewOrder Latency (ms)	0.44
Total YSQL Ops/sec	37528.64

Largest benchmark

In our testing, YugabyteDB was able to process 1M tpmC with 150,000 warehouses at an efficiency of 99.8% on an RF3 cluster of 75 48vCPU/96GB machines with a total data size of 50TB.

To accomplish this feat, numerous new features had to be implemented and existing ones optimized, including the following:

The number of RPCs made was reduced to reduce CPU usage.
Fine-grained locking was implemented.
Index updates were optimized by skipping index updates when columns in the index were not updated.
An encoded table's primary key was used as the pointer in the index.
Continuous background compaction.
Hot data was stored in the block cache.
Transaction retries.

The 150K warehouses benchmark was run on v2.11.

Warehouses	TPMC	Efficiency(%)	Nodes	Connections	New Order Latency	Machine Type (vCPUs)
150,000	1M	99.30	75	9000	123.33 ms	c5d.12xlarge (48)