Transaction retries in YSQL

YugabyteDB returns different error codes for the various scenarios that go wrong during transaction processing. Applications need to be designed to handle these scenarios correctly to be highly available so that users aren't impacted. Although most errors are common across multiple isolation levels, some errors are very specific to certain transaction isolation levels.

The examples in the following sections illustrate failure scenarios and techniques you can use to handle these failures in your applications.

Prerequisites

Follow the setup instructions to start a single local YugabyteDB instance, and create a table as follows:

  1. Create the table as follows:

    CREATE TABLE txndemo (
      k int,
      V int,
      PRIMARY KEY(k)
    );
    
  2. Add some data to the table as follows:

    INSERT INTO txndemo VALUES (1,10),(2,10),(3,10),(4,10),(5,10);
    

Automatic retries

YugabyteDB retries failed transactions automatically on the server side whenever possible without client intervention as per the concurrency control policies. This is the case even for single statements, which are implicitly considered transactions. In Read Committed isolation mode, the server retries indefinitely.

In some scenarios, a server-side retry is not suitable. For example, the retry limit has been reached or the transaction is not in a valid state. In these cases, it is the client's responsibility to retry the transaction at the application layer.

Client-side retry

Most transaction errors that happen due to conflicts and deadlocks can be restarted by the client. The following scenarios describe the causes for failures, and the required methods to be handled by the applications.

Execute the transaction in a try..catch block in a loop. When a re-tryable failure happens, issue ROLLBACK and then retry the transaction. To avoid overloading the server and ending up in an indefinite loop, wait for some time between retries and limit the number of retries. The following illustrates a typical client-side retry implementation:

max_attempts = 10   # max no.of retries
sleep_time = 0.002  # 2 ms - base sleep time
backoff = 2         # exponential multiplier

attempt = 0
while attempt < max_attempts:
    attempt += 1
    try :
        cursor = cxn.cursor()
        cursor.execute("BEGIN");

        # Execute Transaction Statments here

        cursor.execute("COMMIT");
        break
    except psycopg2.errors.SerializationFailure as e:
        cursor.execute("ROLLBACK")
        if attempt < max_attempts:
            time.sleep(sleep_time)
            sleep_time *= backoff

If the COMMIT is successful, the program exits the loop. attempt < max_attempts limits the number of retries to max_attempts, and the amount of time the code waits before the next retry also increases with sleep_time *= backoff. Choose values as appropriate for your application.

40001 - SerializationFailure

SerializationFailure errors happen when multiple transactions are updating the same set of keys (conflict) or when transactions are waiting on each other (deadlock). The error messages could be one of the following types:

  • During a conflict, certain transactions are retried. However, after the retry limit is reached, an error occurs as follows:

    ERROR:  40001: All transparent retries exhausted.
    
  • All transactions are given a dynamic priority. When a deadlock is detected, the transaction with lower priority is automatically killed. For this scenario, the client might receive a message similar to the following:

    ERROR:  40001: Operation expired: Heartbeat: Transaction XXXX expired or aborted by a conflict
    

The correct way to handle this error is with a retry loop with exponential backoff, as described in Client-side retry. When the UPDATE or COMMIT fails because of SerializationFailure, the code retries after waiting for sleep_time seconds, up to max_attempts.

Read Committed

In read committed isolation level, as the server retries internally, the client does not need to worry about handling SerializationFailure. Only transactions operating in repeated read and serializable levels need to handle serialization failures.

Another way to handle these failures is would be to rollback to a checkpoint before the failed statement and proceed further as described in Savepoints.

Savepoints

Savepoints are named checkpoints that can be used to rollback just a few statements, and then proceed with the transaction, rather than aborting the entire transaction when there is an error.

Consider the following example that inserts a row [k=1, v=30]:

connstr = 'postgresql://yugabyte@localhost:5433/yugabyte'
cxn = psycopg2.connect(connstr)
cursor = cxn.cursor()
try:
    cursor.execute("BEGIN")

    # ... Execute other statements

    cursor.execute("SAVEPOINT before_insert")
    try:
        # insert a row
        cursor.execute("INSERT INTO txndemo VALUES (1,30)")
    except psycopg2.errors.UniqueViolation as e:
        print(e)
        # k=1 already exists in our table
        cursor.execute("ROLLBACK TO SAVEPOINT before_insert")
        cursor.execute("UPDATE txndemo SET v=30 WHERE k=1;")

    # ... Execute other statements
    cursor.execute("COMMIT")
except Exception as e:
  print(e)
  cursor.execute("ROLLBACK")

If the row [k=1] already exists in the table, the INSERT would result in a UniqueViolation exception. Technically, the transaction would be in an error state and further statements would result in a 25P02: In failed SQL transaction error. You have to catch the exception and rollback. But instead of rolling back the entire transaction, you can rollback to the previously declared savepoint before_insert, and update the value of the row with k=1. Then you can continue with other statements in the transaction.

Non-retriable errors

Although most transactions can be retried in most error scenarios, there are cases where retrying a transaction will not resolve an issue. For example, errors can occur when statements are issued out of place. These statements have to be fixed in code to continue further.

25001 - Specify transaction isolation level

Transaction level isolation should be specified before the first statement of the transaction is executed. If not the following error occurs:

BEGIN;
BEGIN
Time: 0.797 ms
UPDATE txndemo SET v=20 WHERE k=1;
UPDATE 1
Time: 10.416 ms
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
ERROR:  25001: SET TRANSACTION ISOLATION LEVEL must be called before any query
Time: 3.808 ms
25006 - Modify a row in a read-only transaction

This error occurs when a row is modified after specifying a transaction to be READ ONLY as follows:

BEGIN READ ONLY;
BEGIN
Time: 1.095 ms
UPDATE txndemo SET v=20 WHERE k=1;
ERROR:  25006: cannot execute UPDATE in a read-only transaction
Time: 4.417 ms
25P02 - InFailedSqlTransaction

This error occurs when a statement is issued after there's already an error in a transaction. The error message would be similar to the following:

ERROR:  25P02: current transaction is aborted, commands ignored until end of transaction block

Consider the following scenario:

BEGIN;
BEGIN
Time: 0.393 ms
INVALID TXN STATEMENT;
ERROR:  42601: syntax error at or near "INVALID"
Time: 2.523 ms
SELECT * from txndemo where k=1;
ERROR:  25P02: current transaction is aborted, commands ignored until end of transaction block
Time: 17.074 ms

The only valid statements at this point would be ROLLBACK or COMMIT.

Learn more