Join us on
Star us on
Get Started
Slack
GitHub
Get Started
v2.5 (latest) v2.2 (stable) v2.1 (earlier version) v2.0 (earlier version) v1.3 (earlier version)
  • YUGABYTEDB CORE
    • Quick start
      • 1. Install YugabyteDB
      • 2. Create a local cluster
      • 3. Explore distributed SQL
      • 4. Build an application
        • Java
        • NodeJS
        • Go
        • Python
        • Ruby
        • C#
        • PHP
        • C++
        • C
        • Scala
    • Explore features
      • YSQL vs PostgreSQL
        • Schemas and Tables
        • Data Types
      • Fault tolerance
      • Horizontal Scalability
        • Scaling Transactions
        • Sharding Data
      • Transactions
        • Distributed Transactions
        • Isolation Levels
        • Explicit Locking
      • JSON Support
      • Multi-Region Deployments
        • Sync replication (3+ regions)
        • Async Replication (2+ regions)
        • Row-Level Geo-Partitioning
      • Follower reads
      • Colocated tables
      • Change data capture (CDC)
      • Observability
        • Prometheus Integration
    • Develop
      • Learn app development
        • 1. SQL vs NoSQL
        • 2. Data modeling
        • 3. Data types
        • 4. ACID transactions
        • 5. Aggregations
        • 6. Batch operations
        • 7. Date and time
        • 8. Strings and text
        • 9. TTL for data expiration
      • Ecosystem integrations
        • Apache Kafka
        • Apache Spark
        • JanusGraph
        • KairosDB
        • Presto
        • Metabase
      • Build GraphQL apps
        • Hasura
        • Prisma
      • Real-world examples
        • E-Commerce app
        • IoT fleet management
        • Retail Analytics
      • Explore sample apps
      • Best practices
    • Migrate
      • Migration process overview
      • Migrate from PostgreSQL
        • Convert a PostgreSQL schema
        • Migrate a PostgreSQL application
        • Export PostgreSQL data
        • Prepare a cluster
        • Import PostgreSQL data
        • Verify Migration
    • Deploy
      • Deployment checklist
      • Manual deployment
        • 1. System configuration
        • 2. Install software
        • 3. Start YB-Masters
        • 4. Start YB-TServers
        • 5. Verify deployment
      • Kubernetes
        • Single-zone
          • Open Source
          • Amazon EKS
          • Google Kubernetes Engine
          • Azure Kubernetes Service
        • Multi-zone
          • Amazon EKS
          • Google Kubernetes Engine
        • Multi-cluster
          • Google Kubernetes Engine
        • Best practices
        • Connect clients
      • Docker
      • Public clouds
        • Amazon Web Services
        • Google Cloud Platform
        • Microsoft Azure
      • Multi-DC deployments
        • Three+ data center (3DC)
        • Two data center (2DC)
        • Read replica clusters
      • Change data capture (CDC)
        • CDC to Kafka
    • Benchmark
      • TPC-C
      • sysbench
      • YCSB
      • Key-value workload
      • Large datasets
      • Scalability
        • Scaling queries
      • Resilience
        • Jepsen testing
      • Performance Troubleshooting
    • Secure
      • Security checklist
      • Enable Authentication
        • Enable User Authentication
        • Configure ysql_hba_conf_csv
      • Authentication Methods
        • Password Authentication
        • LDAP Authentication
        • Host-Based Authentication
        • Trust Authentication
      • Role-Based Access Control
        • Overview
        • Manage Users and Roles
        • Grant Privileges
        • Row-Level Security (RLS)
        • Column-Level Security
      • Encryption in transit
        • Create client certificates
        • Create server certificates
        • Enable server-to-server encryption
        • Enable client-to-server encryption
        • Connect to clusters
      • Encryption at rest
      • Column-Level Encryption
      • Audit Logging
        • Configure Audit Logging
        • Session-Level Audit Logging
        • Object-Level Audit Logging
    • Manage
      • Back up and restore
        • Back up data
        • Restore data
        • Snapshot and restore data
      • Migrate data
        • Bulk import
        • Bulk export
      • Change cluster configuration
      • Diagnostics reporting
      • Upgrade a deployment
      • Grow cluster
    • Troubleshoot
      • Troubleshooting
      • Common error messages
      • Cluster level issues
        • YCQL connection issues
        • YEDIS connection Issues
        • Recover tserver/master
        • Replace a failed YB-TServer
        • Replace a failed YB-Master
      • Node level issues
        • Check servers
        • Inspect logs
        • System statistics
        • Disk failure
    • Contribute
      • Core database
        • Contribution checklist
        • Build the source
        • Configure a CLion project
        • Run the tests
  • YUGABYTE PLATFORM
    • Yugabyte Platform
      • Overview
        • Install
        • Configure
      • Install Yugabyte Platform
        • Prerequisites
        • Prepare the environment
        • Install software
        • Prepare nodes (on-prem)
        • Uninstall software
      • Configure Yugabyte Platform
        • Create admin user
        • Configure the cloud provider
        • Configure the backup target
        • Configure alerts and health checking
        • Create and edit instance tags
      • Create deployments
        • Multi-zone universe
        • Multi-region universe
        • Read replica cluster
      • Manage deployments
        • Start and stop processes
        • Add a node
        • Remove a node
        • Edit a universe
        • Edit configuration flags
        • Upgrade the YugabyteDB software
        • Delete a universe
        • Migrate to Helm 3
      • Back up and restore universes
        • Configure backup storage
        • Back up universe data
        • Restore universe data
        • Schedule data backups
      • Security
        • Security checklist
        • Customize ports
        • Authorization platform
        • Create a KMS configuration
        • Enable encryption at rest
        • Enable encryption in transit (TLS)
        • Network security
      • Alerts and monitoring
        • Live Queries dashboard
      • Troubleshoot
        • Install and upgrade issues
        • Universe issues
      • Administer Yugabyte Platform
        • Back up and restore Yugabyte Platform
  • YUGABYTE CLOUD
    • Yugabyte Cloud
      • Free tier
      • Create clusters
      • Monitor clusters
      • Create databases
      • Manage database access
      • Connect to clusters
  • REFERENCE
    • Reference
    • Architecture
      • Design goals
      • Key concepts
        • Universe
        • YB-TServer Service
        • YB-Master Service
      • Core functions
        • Universe creation
        • Table creation
        • Write IO path
        • Read IO path
        • High availability
      • Layered architecture
      • Query layer
        • Overview
      • DocDB transactions layer
        • Transactions overview
        • Transaction isolation levels
        • Explicit locking
        • Single-row transactions
        • Distributed transactions
        • Transactional IO path
      • DocDB sharding layer
        • Hash & range sharding
        • Tablet splitting
        • Colocated tables
      • DocDB replication layer
        • Replication
        • xCluster replication
        • Read replicas
        • Change data capture (CDC)
      • DocDB storage layer
        • Persistence
        • Performance
    • APIs
      • YSQL
        • The SQL language
          • SQL statements
            • ABORT
            • ALTER DATABASE
            • ALTER DEFAULT PRIVILEGES
            • ALTER DOMAIN
            • ALTER GROUP
            • ALTER POLICY
            • ALTER ROLE
            • ALTER SEQUENCE
            • ALTER TABLE
            • ALTER USER
            • BEGIN
            • CALL
            • COMMENT
            • COMMIT
            • COPY
            • CREATE AGGREGATE
            • CREATE CAST
            • CREATE DATABASE
            • CREATE DOMAIN
            • CREATE EXTENSION
            • CREATE FUNCTION
            • CREATE GROUP
            • CREATE INDEX
            • CREATE OPERATOR
            • CREATE OPERATOR CLASS
            • CREATE POLICY
            • CREATE PROCEDURE
            • CREATE ROLE
            • CREATE RULE
            • CREATE SCHEMA
            • CREATE SEQUENCE
            • CREATE TABLE
            • CREATE TABLE AS
            • CREATE TRIGGER
            • CREATE TYPE
            • CREATE USER
            • CREATE VIEW
            • DEALLOCATE
            • DELETE
            • DO
            • DROP AGGREGATE
            • DROP CAST
            • DROP DATABASE
            • DROP DOMAIN
            • DROP EXTENSION
            • DROP FUNCTION
            • DROP GROUP
            • DROP OPERATOR
            • DROP OPERATOR CLASS
            • DROP OWNED
            • DROP POLICY
            • DROP PROCEDURE
            • DROP ROLE
            • DROP RULE
            • DROP SEQUENCE
            • DROP TABLE
            • DROP TRIGGER
            • DROP TYPE
            • DROP USER
            • END
            • EXECUTE
            • EXPLAIN
            • GRANT
            • INSERT
            • LOCK
            • PREPARE
            • REASSIGN OWNED
            • RESET
            • REVOKE
            • ROLLBACK
            • SELECT
            • SET
            • SET CONSTRAINTS
            • SET ROLE
            • SET SESSION AUTHORIZATION
            • SET TRANSACTION
            • SHOW
            • SHOW TRANSACTION
            • TRUNCATE
            • UPDATE
            • VALUES
          • WITH clause
            • WITH clause—SQL syntax and semantics
            • recursive CTE
            • case study—traversing an employee hierarchy
            • traversing general graphs
              • graph representation
              • common code
              • undirected cyclic graph
              • directed cyclic graph
              • directed acyclic graph
              • rooted tree
              • Unique containing paths
              • Stress testing find_paths()
            • case study—Bacon Numbers from IMDb
              • Bacon numbers for synthetic data
              • Bacon numbers for IMDb data
        • Data types
          • Array
            • array[] constructor
            • Literals
              • Text typecasting and literals
              • Array of primitive values
              • Row
              • Array of rows
            • FOREACH loop (PL/pgSQL)
            • array of DOMAINs
            • Functions and operators
              • ANY and ALL
              • Array comparison
              • Array slice operator
              • Array concatenation
              • Array properties
              • array_agg(), unnest(), generate_subscripts()
              • array_fill()
              • array_position(), array_positions()
              • array_remove()
              • array_replace() / set value
              • array_to_string()
              • string_to_array()
          • Binary
          • Boolean
          • Character
          • Date and time
          • JSON
            • JSON literals
            • Primitive and compound data types
            • Code example conventions
            • Indexes and check constraints
            • Functions & operators
              • ::jsonb, ::json, ::text (typecast)
              • ->, ->>, #>, #>> (JSON subvalues)
              • - and #- (remove)
              • || (concatenation)
              • = (equality)
              • @> and <@ (containment)
              • ? and ?| and ?& (key or value existence)
              • array_to_json()
              • jsonb_agg()
              • jsonb_array_elements()
              • jsonb_array_elements_text()
              • jsonb_array_length()
              • jsonb_build_object()
              • jsonb_build_array()
              • jsonb_each()
              • jsonb_each_text()
              • jsonb_extract_path()
              • jsonb_extract_path_text() and json_extract_path_text()
              • jsonb_object()
              • jsonb_object_agg()
              • jsonb_object_keys()
              • jsonb_populate_record()
              • jsonb_populate_recordset()
              • jsonb_pretty()
              • jsonb_set() and jsonb_insert()
              • jsonb_strip_nulls()
              • jsonb_to_record()
              • jsonb_to_recordset()
              • jsonb_typeof()
              • row_to_json()
              • to_jsonb()
          • Money
          • Numeric
          • Serial
          • UUID
        • Functions and operators
          • Aggregate functions
            • Informal functionality overview
            • Invocation syntax and semantics
            • grouping sets, rollup, cube
            • Per function signature and purpose
              • avg(), count(), max(), min(), sum()
              • array_agg(), string_agg(), jsonb_agg(), jsonb_object_agg()
              • bit_and(), bit_or(), bool_and(), bool_or()
              • variance(), var_pop(), var_samp(), stddev(), stddev_pop(), stddev_samp()
              • linear regression
                • covar_pop(), covar_samp(), corr()
                • regr_%()
              • mode(), percentile_disc(), percentile_cont()
              • rank(), dense_rank(), percent_rank(), cume_dist()
            • case study—percentile_cont() and the "68–95–99.7" rule
            • case study—linear regression on COVID data
              • Download the COVIDcast data
              • Ingest the COVIDcast data
                • Inspect the COVIDcast data
                • Copy the .csv files to staging tables
                • Check staged data conforms to the rules
                • Join the staged data into a single table
                • SQL scripts
                  • Create cr_staging_tables()
                  • Create cr_copy_from_scripts()
                  • Create assert_assumptions_ok()
                  • Create xform_to_covidcast_fb_survey_results()
                  • ingest-the-data.sql
              • Analyze the COVIDcast data
                • symptoms vs mask-wearing by day
                • Data for scatter-plot for 21-Oct-2020
                • Scatter-plot for 21-Oct-2020
                • SQL scripts
                  • analysis-queries.sql
                  • synthetic-data.sql
          • currval()
          • lastval()
          • nextval()
          • Window functions
            • Informal functionality overview
            • Invocation syntax and semantics
            • Per function signature and purpose
              • row_number(), rank() and dense_rank()
              • percent_rank(), cume_dist() and ntile()
              • first_value(), nth_value(), last_value()
              • lag(), lead()
              • Tables for the code examples
                • table t1
                • table t2
                • table t3
                • table t4
            • case study—analyzing a normal distribution
              • Bucket allocation scheme
              • do_clean_start.sql
              • cr_show_t4.sql
              • cr_dp_views.sql
              • cr_int_views.sql
              • cr_pr_cd_equality_report.sql
              • cr_bucket_using_width_bucket.sql
              • cr_bucket_dedicated_code.sql
              • do_assert_bucket_ok
              • cr_histogram.sql
              • cr_do_ntile.sql
              • cr_do_percent_rank.sql
              • cr_do_cume_dist.sql
              • do_populate_results.sql
              • do_report_results.sql
              • do_compare_dp_results.sql
              • do_demo.sql
              • Reports
                • Histogram report
                • dp-results
                • compare-dp-results
                • int-results
        • Extensions
        • Keywords
        • Reserved names
      • YCQL
        • ALTER KEYSPACE
        • ALTER ROLE
        • ALTER TABLE
        • CREATE INDEX
        • CREATE KEYSPACE
        • CREATE ROLE
        • CREATE TABLE
        • CREATE TYPE
        • DROP INDEX
        • DROP KEYSPACE
        • DROP ROLE
        • DROP TABLE
        • DROP TYPE
        • GRANT PERMISSION
        • GRANT ROLE
        • REVOKE PERMISSION
        • REVOKE ROLE
        • USE
        • INSERT
        • SELECT
        • EXPLAIN
        • UPDATE
        • DELETE
        • TRANSACTION
        • TRUNCATE
        • Simple expressions
        • Subscripted expressions
        • Function call
        • Operators
        • BLOB
        • BOOLEAN
        • Collection
        • FROZEN
        • INET
        • Integer and counter
        • Non-integer
        • TEXT
        • DATE, TIME, and TIMESTAMP
        • UUID and TIMEUUID
        • JSONB
        • Date and time
        • BATCH
    • CLIs
      • yb-ctl
      • yb-docker-ctl
      • ysqlsh
      • ycqlsh
      • yb-admin
      • yb-ts-cli
      • ysql_dump
      • ysql_dumpall
    • Configuration
      • yb-tserver
      • yb-master
      • yugabyted
      • Default ports
    • Drivers
      • Client drivers for YSQL API
      • YugabyteDB JDBC Driver
      • Client drivers for YCQL
      • Spring Data YugabyteDB
    • Connectors
      • Kafka Connect YugabyteDB
    • Third party tools
      • DBeaver
      • DbSchema
      • pgAdmin
      • SQL Workbench/J
      • TablePlus
      • Visual Studio Code
    • Sample datasets
      • Chinook
      • Northwind
      • PgExercises
      • SportsDB
  • RELEASES
    • Releases
    • Releases overview
    • Release versioning
    • What's new
      • v2.5 (latest)
      • v2.4.0 (stable)
    • Earlier releases
      • v2.3.3
      • v2.3.2
      • v2.3.1
      • v2.3.0
      • v2.1.8
      • v2.1.6
      • v2.2.0
      • v2.1.5
      • v2.1.4
      • v2.1.3
      • v2.1.2
      • v2.1.1
      • v2.1.0
      • v2.0.11
      • v2.0.10
      • v2.0.9
      • v2.0.8
      • v2.0.7
      • v2.0.6
      • v2.0.5
      • v2.0.3
      • v2.0.1
      • v2.0.0
      • v1.3.1
      • v1.3.0
      • v1.2.12
      • v1.2.11
      • v1.2.10
      • v1.2.9
      • v1.2.8
      • v1.2.6
      • v1.2.5
      • v1.2.4
  • FAQ
    • Comparisons
      • Amazon Aurora
      • Google Cloud Spanner
      • CockroachDB
      • TiDB
      • Vitess
      • MongoDB
      • FoundationDB
      • Amazon DynamoDB
      • Azure Cosmos DB
      • Apache Cassandra
      • PostgreSQL
      • Redis in-memory store
      • Apache HBase
    • FAQs
      • General FAQ
      • Operations FAQ
      • API compatibility FAQ
      • Yugabyte Platform FAQ
  • MISC
    • YEDIS
      • Quick start
      • Develop
        • Build an application
        • C#
        • C++
        • Go
        • Java
        • NodeJS
        • Python
      • API reference
        • APPEND
        • AUTH
        • CONFIG
        • CREATEDB
        • DELETEDB
        • LISTDB
        • SELECT
        • DEL
        • ECHO
        • EXISTS
        • EXPIRE
        • EXPIREAT
        • FLUSHALL
        • FLUSHDB
        • GET
        • GETRANGE
        • GETSET
        • HDEL
        • HEXISTS
        • HGET
        • HGETALL
        • HINCRBY
        • HKEYS
        • HLEN
        • HMGET
        • HMSET
        • HSET
        • HSTRLEN
        • HVALS
        • INCR
        • INCRBY
        • KEYS
        • MONITOR
        • PEXPIRE
        • PEXPIREAT
        • PTTL
        • ROLE
        • SADD
        • SCARD
        • RENAME
        • SET
        • SETEX
        • PSETEX
        • SETRANGE
        • SISMEMBER
        • SMEMBERS
        • SREM
        • STRLEN
        • ZRANGE
        • TSADD
        • TSCARD
        • TSGET
        • TSLASTN
        • TSRANGEBYTIME
        • TSREM
        • TSREVRANGEBYTIME
        • TTL
        • ZADD
        • ZCARD
        • ZRANGEBYSCORE
        • ZREM
        • ZREVRANGE
        • ZSCORE
        • PUBSUB
        • PUBLISH
        • SUBSCRIBE
        • UNSUBSCRIBE
        • PSUBSCRIBE
        • PUNSUBSCRIBE
    • Legal
      • Third party software
> APIs > YSQL > Functions and operators > Window functions > Per function signature and purpose >

percent_rank(), cume_dist() and ntile()

  • percent_rank()
  • cume_dist()
  • ntile()
  • Comparing the effect of percent_rank(), cume_dist(), and ntile() on the same input

These three window functions bear a strong family resemblance to each other. The explanation of what they achieve rests on the notion of a centile. A centile is a measure used in statistics to denote the value below which a given fraction of the values in a set falls. The term percentile is often preferred by those who like to express fractions as percentages. For example, the 70th percentile is the value below which 70% of the values fall.

Briefly:

  • percent_rank() assigns the centile fraction to each value in the input population
  • cume_dist() function returns, for a specified value within a population, the number of values that are less than or equal to the specified value divided by the total number of values—in other words, the relative position of a value within the population
  • ntile assigns the bucket number, in an equiheight histogram (see below) to each value in the population by specifying the required number of buckets as the function's actual argument.

Suppose that you want to divide a set of values into N subsets, where each subset contains the same number of values. The result is referred to as an equiheight histogram. This implies unequal value-axis bucket widths. Compare this with the width_bucket() regular built-in function that produces an equiwidth histogram—that is it marks off the value axis into equal width buckets that, in general each contains a different number of values.

If the required number of unequal width buckets for the equal height histogram is 5, then your goal is equivalently stated by saying that you want to split the population at the values that fall at the 20th, 40th, 60th, and 80th percentiles. In this use case, and if and only if the values are unique and the population size is an exact multiple of the required number of buckets, then any one of percent_rank(), cume_dist() or ntile() can be used to meet the goal. But using ntile() needs noticeably simpler SQL than using the other two because it's tailored for this, but only this, purpose. Moreover, if the set contains duplicate values, then only ntile() meets the goal (to the extent that this is feasible) because it randomly assigns ties at the boundaries between the buckets so that some go to the higher-value set and some go to the lower-value set. The dedicated section Analyzing a normal distribution with percent_rank(), cume_dist() and ntile() demonstrates this with two large populations of approximately normally distributed values—one with no duplicates and one with lots of duplicates. And it shows that if the population has ties, or when the population size is not an exact multiple of the required number of buckets, then percent_rank() and cume_dist() produce different results from each other, and each divides the population differently than ntile(), because they handle ties differently and because of other subtle differences.

If you want to split the population into N subsets of unequal population sizes, then you must use one of percent_rank() or cume_dist().

This documentation assumes that you already understand the subtle difference between the percentile notion and the cumulative distribution notion, and know how to define your goal with respect to splitting populations of values into subsets—and that you will therefore know how to choose which of the functions percent_rank(), cume_dist() or ntile() that you should use.

percent_rank()

Signature:

input value:       <no formal parameter>
return value:      double precision

Purpose: Return the percentile rank of each row within the window, with respect to the argument of the window_definition's window ORDER BY clause. The value p returned by percent_rank() is a number in the range 0 <= p <= 1. It is calculated like this:

percentile_rank = (rank - 1) / ("no. of rows in window" - 1)

Arguably, then, this function is wrongly named because to deserve its name it would return a value in the range 0 through 100.

The lowest value of percent_rank() within the window will always be 0.0, even when there is a tie between the lowest-ranking rows. The highest possible value of percent_rank() within the window is 1.0, but this value is seen only when the highest-ranking row has no ties. If the highest-ranking row does have ties, then the highest vale of percent_rank() within the window will be correspondingly less than 1.0 according to how many rows ties for the top rank.

For example, the query computes the percentile rank for a table "t" with primary key "k" and "score" column, over all the rows:

select
  k,
  score,
  100*(percent_rank() over (order by score)) as "percent_rank()"
from t;

When percent_rank() for a particular row has a value of, say,0.42, it means this score is higher than 42% of the other rows.

cume_dist()

Signature:

input value:       <no formal parameter>
return value:      double precision

Purpose: Return a value that represents the number of rows with values less than or equal to the current row’s value divided by the total number of rows—in other words, the relative position of a value in a set of values. The graph of all values of cume_dist() within the window is known as the cumulative distribution of the argument of the window_definition's window ORDER BY clause. The value c returned by cume_dist() is a number in the range 0 <= c <= 1. It is calculated like this:

cume_dist() =
  "no of rows with a value <= the current row's value" /
  "no. of rows in window"

Note: the formula suggests that The return value c will lie in the open-closed range 0 < c <= 1. But the results shown in the study described in the section Analyzing a normal distribution with percent_rank(), cume_dist() and ntile() show that the range is indeed the closed-closed 0 <= c <= 1. This is doubtless due to rounding errors in the function's implementation.

You can use cume_dist() to answer questions like this:

  • Show me the rows whose score is within the top x% of the window's population, ranked by score.

    prepare top_few_percent(double precision) as
    with v as (
      select
        score,
        100*cume_dist() over (order by dp_score) as cd,
        k
      from t)
    select
      score,
      to_char(cd, '999.9') as "cume_dist()",
      k
    from v
    where cd > $1
    order by cd desc, k;
    

ntile()

Signature:

input value:       no_of_buckets int
return value:      int

Purpose: Return an integer value for each row that maps it to a corresponding percentile. For example, if you wanted to mark the boundaries between the highest-ranking 20% of rows, the next-ranking 20% of rows, and so on, then you would use ntile(5). The top 20% of rows would be marked with 1, the next-to-top 20% of rows would be marked with 2, and so on so that the bottom 20% of rows would be marked with 5.

Note: If the number of rows in the window, N, is a multiple of the actual value with which you invoke ntile(), n, then each percentile set would have exactly N/n rows. This is achieved, if there are ties right at the boundary between two percentile sets, by randomly assigning some to one set and some to the other. If N is not a multiple of n, then ntile() assigns the rows to the percentile sets so that the numbers assigned to each are as close as possible to being the same.

Comparing the effect of percent_rank(), cume_dist(), and ntile() on the same input

If you haven't yet installed the tables that the code examples use, then go to the section The data sets used by the code examples.

The query that this section presents shows that the results produced by percent_rank() and cume_dist() are consistent with the formulas for these values that are given in the accounts, above, of these two functions.

Create a data set using the ysqlsh script that table t2 presents. This has been designed:

  • so that the scores for the rows with "class=1" are unique so that the ordering produces no ties
  • and so that the scores for the rows with "class=2" have duplicate values so that the ordering does produce ties. The first tie group has three rows; and the second has two rows.

Now do this:

with
  v1 as (
    select
      class,
      k,
      score,
      (rank()         over w1) as r,
      (percent_rank() over w1) as pr,
      (cume_dist()    over w1) as cd,
      (ntile(4)       over w1) as nt,
      (count(*)       over w2) as n_tot,
      (count(*)       over w3) as n_thru_curr
    from t2
    window
      w1 as (partition by class order by score),
      w2 as (w1 range between unbounded preceding and unbounded following),
      w3 as (w1 range between unbounded preceding and current row)),
  v2 as (
    select
      class,
      k,
      score,
      n_tot,
      n_thru_curr,
      r,
      (r::numeric - 1.0)/(n_tot::numeric - 1.0) as pr_calc,
      (n_thru_curr::numeric)/(n_tot::numeric)   as cd_calc,
      pr,
      cd,
      nt
    from v1)
select
  class,
  k,
  score,
  n_tot,
  n_thru_curr,
  (pr = pr_calc)::text     as "pr check",
  (cd = cd_calc)::text     as "cd check",
  r                        as "rank()",
  to_char(pr*100, '990.9') as "percent_rank()",
  to_char(cd*100, '990.9') as "cume_dist",
  nt                       as "ntile()"
from v2
order by class, "rank()", k;

This is the result. To make it easier to see the pattern, several blank lines have been manually inserted here between each successive set of rows with the same value for "class". And in the second set, which has ties, one blank line has been inserted between each tie group.

 class | k  | score | n_tot | n_thru_curr | pr check | cd check | rank() | percent_rank() | cume_dist | ntile() 
-------+----+-------+-------+-------------+----------+----------+--------+----------------+-----------+---------
     1 |  1 |     1 |     9 |           1 | true     | true     |      1 |    0.0         |   11.1    |       1
     1 |  2 |     2 |     9 |           2 | true     | true     |      2 |   12.5         |   22.2    |       1
     1 |  3 |     3 |     9 |           3 | true     | true     |      3 |   25.0         |   33.3    |       1
     1 |  4 |     4 |     9 |           4 | true     | true     |      4 |   37.5         |   44.4    |       2
     1 |  5 |     5 |     9 |           5 | true     | true     |      5 |   50.0         |   55.6    |       2
     1 |  6 |     6 |     9 |           6 | true     | true     |      6 |   62.5         |   66.7    |       3
     1 |  7 |     7 |     9 |           7 | true     | true     |      7 |   75.0         |   77.8    |       3
     1 |  8 |     8 |     9 |           8 | true     | true     |      8 |   87.5         |   88.9    |       4
     1 |  9 |     9 |     9 |           9 | true     | true     |      9 |  100.0         |  100.0    |       4

     2 | 10 |     2 |     9 |           3 | true     | true     |      1 |    0.0         |   33.3    |       1
     2 | 11 |     2 |     9 |           3 | true     | true     |      1 |    0.0         |   33.3    |       1
     2 | 12 |     2 |     9 |           3 | true     | true     |      1 |    0.0         |   33.3    |       1

     2 | 13 |     4 |     9 |           4 | true     | true     |      4 |   37.5         |   44.4    |       2

     2 | 14 |     5 |     9 |           5 | true     | true     |      5 |   50.0         |   55.6    |       2

     2 | 15 |     6 |     9 |           6 | true     | true     |      6 |   62.5         |   66.7    |       3

     2 | 16 |     7 |     9 |           8 | true     | true     |      7 |   75.0         |   88.9    |       4
     2 | 17 |     7 |     9 |           8 | true     | true     |      7 |   75.0         |   88.9    |       3

     2 | 18 |     9 |     9 |           9 | true     | true     |      9 |  100.0         |  100.0    |       4

Notice that in this example, the number of rows per window, 9, is not a multiple of the actual value, 4 with which the function is invoked. This means that the number of rows assigned to each bucket can't be the same. Here, as promised, ntile() makes a best effort to get the numbers as close to each other as is possible. You can confirm, visually, that the populations are three for one of the four buckets and two for the other three.

Notice, too, what the outcomes are for the tie groups. Each of percent_rank() and cume_dist() produces a different value for each row where "class=1", which has no ties. For "class=2", when there are ties, these functions each produce the same value for all the rows in each of the two tie groups. In contrast, ntile() assigns the two rows in the second tie group (with "score=7") to different percentile groups.

  • percent_rank()
  • cume_dist()
  • ntile()
  • Comparing the effect of percent_rank(), cume_dist(), and ntile() on the same input
Ask our community
  • Slack
  • Github
  • Forum
  • StackOverflow
Yugabyte
Contact Us
Copyright © 2017-2020 Yugabyte, Inc. All rights reserved.