TServer fatals with schema packing not found error

04 Jun 2025
Product Affected Versions Related Issues Fixed In
YSQL v2.20, v2024.1, v2024.2 #25106 v2024.1.6, v2024.2.3, v2.20.11.0 (upcoming)

Description

Background

DocDB packed row format has a dependency on the schema version in order to read rows. If a table undergoes schema change, then the older schema versions are kept around as long as there are Sorted String Tables (SSTs) that refer to the older schema versions. These schema versions cannot be kept around forever as compactions garbage collect the unused schema versions metadata. The schema versions for each SST file is stored in frontiers.

Problem

When a large transaction inserts rows into a table, the data may be split into multiple RocksDB batches, each potentially flushing independently into separate SST files (for example, File A, File B, and File C). The current issue is that only the last batch updates the schema version in the frontier.

This behavior can lead to aggressive garbage collection of schema version and packing metadata under specific conditions:

  • A schema version change occurs after earlier batches (for example, File A and File B) have been flushed.
  • Files containing the schema packing metadata (for example, File C) are compacted independently, allowing the garbage collection mechanism to remove this metadata.

While this scenario is rare, it has been observed in lab stress tests.

Necessary conditions

It requires the confluence of YSQL (with Packed Row which is GA only for YSQL), schema modification operations, large transactions, and subsequent compaction that selectively processes later SST files (if all the files A, B and C were compacted together, then there would not be any issue).

Error

TServer fatals with schema packing not found error. The TServer logs would contain logs like "Cannot find packing with version 0 for table …" as follows:

F20241115 05:29:51 ../../src/yb/tablet/tablet.cc:1731] T 2fe6443354144bacabb250de4511d771 P cc3fe0b91eb748c294e6e1183087c7d9: Failed to write a batch with 0 operations into RocksDB: Corruption (yb/tablet/tablet_metadata.cc:387): Cannot find packing with version 0 for table tb_0 (table_id=00004000000030008000000000004115 schema version=5 cotable_id=00000000-0000-0000-0000-000000000000): Not found (yb/dockv/schema_packing.cc:745): Schema packing not found: 0, available_versions: [5, 3, 2, 4]
    @     0xaaaae32c7b5c  google::LogMessage::SendToLog()
    @     0xaaaae32c8a00  google::LogMessage::Flush()
    @     0xaaaae32c909c  google::LogMessageFatal::~LogMessageFatal()
    @     0xaaaae474ea30  yb::tablet::Tablet::WriteToRocksDB()
    @     0xaaaae474a53c  yb::tablet::Tablet::ApplyIntents()
    @     0xaaaae48097f4  yb::tablet::TransactionParticipant::Impl::ProcessReplicated()
    @     0xaaaae472b164  yb::tablet::UpdateTxnOperation::DoReplicated()
    @     0xaaaae471e384  yb::tablet::Operation::Replicated()
    @     0xaaaae4720990  yb::tablet::OperationDriver::ReplicationFinished()
    @     0xaaaae37e287c  yb::consensus::ConsensusRound::NotifyReplicationFinished()
    @     0xaaaae382db34  yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked()
    @     0xaaaae382ceb0  yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked()
    @     0xaaaae3817118  yb::consensus::RaftConsensus::UpdateReplica()
    @     0xaaaae37f9634  yb::consensus::RaftConsensus::Update()
    @     0xaaaae4abebd8  yb::tserver::ConsensusServiceImpl::UpdateConsensus()
    @     0xaaaae388bd7c  std::__1::__function::__func<>::operator()()
    @     0xaaaae388ca84  yb::consensus::ConsensusServiceIf::Handle()
    @     0xaaaae4671444  yb::rpc::ServicePoolImpl::Handle()
    @     0xaaaae45bece0  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0xaaaae46806a8  yb::rpc::(anonymous namespace)::Worker::Execute()
    @     0xaaaae4f12dd8  yb::Thread::SuperviseThread()
    @     0xffff854478b8  start_thread
    @     0xffff854a3afc  thread_start

Mitigation

  • If a single replica of a tablet is affected by this issue within a cluster (Replication Factor 3 or greater), the problem can be resolved by removing the affected tablet. Raft will then replicate a healthy copy from the remaining replicas.

  • A general workaround is to set the TServer flag enable_schema_packing_gc to false. While this action causes schema metadata to be retained indefinitely, the associated storage cost is minimal (calculated as: number_of_schema_changes_for_the_table * number_of_tablets_for_table * size_of_schema). This retained metadata will be reclaimed once the enable_schema_packing_gc flag is reset to true.

  • In scenarios where all replicas of a tablet encounter this failure, self-remediation is not possible. Please contact Yugabyte Support for assistance.

Details

The issue was fixed by adding the missing call to FlushSchemaVersion when writing the intermediate transaction batch as part of #25106.