Correctness issue for queries using SELECT DISTINCT

06 Feb 2024
Product Affected Versions Related Issues Fixed In
YSQL v2.20.1.x #20827 v2.20.2.0, v2.21.1.0

Description

The impacted releases contain a correctness issue for queries using the SELECT DISTINCT clause. The queries may return incorrect results when the following conditions are true.

  • BatchNestedLoopJoin is enabled
  • JOIN clause is present in the query
  • A DISTINCT clause is present in the query
  • The table has range indexes

The issue does not apply to hash or merge joins.

Mitigation

Upgrade to a release with the fix.

Details

  • The issue is based on how the query planner handles the optimization of DISTINCT operations in the presence of join conditions involving columns from the inner relation.
  • The issue concerns a query like SELECT DISTINCT t2.k FROM t1 JOIN t2 ON t1.k = t2.k; where a distinct index scan is chosen on table t2 (inner relation).
  • Previously, the planner erroneously treated t2.k as a constant during the distinct index scan because the condition t2.k = <some terms not related to t2> made it appear constant. However, this assumption is incorrect in batch nested loop joins because a single scan on t2 can return multiple values for t2.k matching different values from t1.k.
  • The correct behavior is to include t2.k in the distinct prefix (the columns used to eliminate duplicates) unless the condition is of the form t2.k = C, where C is a true constant value.
  • In the case of a nested loop join with foreach $1 in t1.k do; distinct index scan on t2 where t2.k = $1 , it is correct to treat t2.k as a constant and exclude it from the distinct prefix. But for a batch nested loop join with foreach ($1, ..., $1024) in t1.k do; distinct index scan on t2 where t2.k in ($1, ..., $1024), it is incorrect to treat t2.k as a constant because the scan can return multiple values for t2.k.
  • The issue does not apply to hash or merge joins.