From 935eb9aae3b792b910309c3b7657f9f94d2e4fc5 Mon Sep 17 00:00:00 2001 From: Adriano dos Santos Fernandes Date: Thu, 1 Sep 2016 13:28:01 -0300 Subject: [PATCH] Convert Window Functions README to markdown. --- doc/sql.extensions/README.window_functions.md | 230 +++++++++++++++++ .../README.window_functions.txt | 243 ------------------ 2 files changed, 230 insertions(+), 243 deletions(-) create mode 100644 doc/sql.extensions/README.window_functions.md delete mode 100644 doc/sql.extensions/README.window_functions.txt diff --git a/doc/sql.extensions/README.window_functions.md b/doc/sql.extensions/README.window_functions.md new file mode 100644 index 0000000000..b0ebcd1d62 --- /dev/null +++ b/doc/sql.extensions/README.window_functions.md @@ -0,0 +1,230 @@ +# Window Functions + +By the SQL specification, window functions (also know as analytical functions) are a kind of aggregation, but which does not "filter" the result set of a query. The aggregated data is mixed with the query result set. + +That sort of functions are used with the `OVER` clause. Window functions may appear only in the select list or the order by clause of a query. + +Additional to the `OVER` clause, Firebird window functions may be partitioned and ordered. + +Syntax: + +``` + ::= ([ [, ...]]) OVER ( + [PARTITION BY [, ...]] + [ORDER BY [] [] [, [] [] ...] +) + + ::= {ASC | DESC} + + ::= NULLS {FIRST | LAST} +``` + +## 1. Aggregate functions used as window functions + +All aggregate functions may be used as window functions, adding the `OVER` clause. Imagine a table EMPLOYEE with columns ID, NAME and SALARY, and the need to show each employee with his respective salary and the percentage of his salary over the payroll. With a "normal" query, this is possible in the following manner: + +```sql +select + id, + department, + salary, + salary / (select sum(salary) from employee) percentage + from employee + order by id; +``` + +Results: + +| id | department | salary | percentage | +|---:|------------|-------:|-----------:| +| 1 | R & D | 10.00 | 0.2040 | +| 2 | SALES | 12.00 | 0.2448 | +| 3 | SALES | 8.00 | 0.1632 | +| 4 | R & D | 9.00 | 0.1836 | +| 5 | R & D | 10.00 | 0.2040 | + +It's necessary to repeat the query in a subquery and wait so much to see the results, specially if EMPLOYEE is a complex view. + +The same query could be specified in much more elegant and faster way using a window function: + +```sql +select + id, + department, + salary, + salary / sum(salary) over () percentage + from employee + order by id; +``` + +Here, `sum(salary) over ()` is computed with the sum of all SALARY from the query (the employee table). + +## 2. Partitioning + +Like aggregate functions, that may operate alone or in relation to a group, window functions may also operate on a group, which is called "partition". Its syntax is: + +``` +(...) OVER (PARTITION BY [, ...]) +``` + +When aggregation is done over a group, it could produce more than one row. So the result set generated by a partition is joined with the main query using the same expression list of the partition. + +Continuing the employee example, instead of get the percentage of the employee salary over all employees, we would like to get the percentage based only on the employees in the same department: + +```sql +select + id, + department, + salary, + salary / sum(salary) over (partition by department) percentage + from employee + order by id; +``` + +Results: + +| id | department | salary | percentage | +|---:|------------|-------:|-----------:| +| 1 | R & D | 10.00 | 0.3448 | +| 2 | SALES | 12.00 | 0.6000 | +| 3 | SALES | 8.00 | 0.4000 | +| 4 | R & D | 9.00 | 0.3103 | +| 5 | R & D | 10.00 | 0.3448 | + +## 3. Ordering + +The `ORDER BY` sub-clause can be used with or without partitions, and used with the standard aggregate functions, make them return the partial aggregations as the records are being processed. Example: + +```sql +select + id, + salary, + sum(salary) over (order by salary) running_salary + from employee + order by salary; +``` + +The result set produced will be: + +| id | salary | running_salary | +|---:|-------:|---------------:| +| 3 | 8.00 | 8.00 | +| 4 | 9.00 | 17.00 | +| 1 | 10.00 | 37.00 | +| 5 | 10.00 | 37.00 | +| 2 | 12.00 | 49.00 | + +Then running_salary returns the partial/accumulated (or running) aggregation (of the `SUM` function). It may appear strange that 37.00 is repeated for the ids 1 and 5, but that is how it should work. The `ORDER BY` keys are grouped together and the aggregation is computed once (but summing the two 10.00). To avoid this, you can add the ID field to the end of the `ORDER BY` clause. + +It's possible to use multiple windows with different orders, and `ORDER BY` parts like `ASC` / `DESC` and `NULLS FIRST` / `NULLS LAST`. + +With a partition, `ORDER BY` works the same way, but at each partition boundary the aggregation is reset. + +All aggregation functions are usable with `ORDER BY`, except the `LIST` function. + +## 4. Exclusive window functions + +Beyond aggregate functions, there is also exclusive window functions, currently divided in ranking and navigational categories. + +Both set of functions can be used with/without partition/ordering, but the usage does not make much sense without ordering. + +## 4.1 Ranking functions + +Syntax: + +``` + ::= + DENSE_RANK() | + RANK() | + PERCENT_RANK() | + CUME_DIST() | + NTILE() | + ROW_NUMBER() +``` + +The rank functions compute the ordinal rank of a row within the window partition. In this category are the functions: `DENSE_RANK`, `RANK` and `ROW_NUMBER`. + +With these functions, one can create different type of incremental counters. Think about `SUM(1) OVER (ORDER BY SALARY)`, these functions do this type of thing, but all of them in different ways. Following is an example query, also comparing with the `SUM` behavior. + +```sql +select + id, + salary, + dense_rank() over (order by salary), + rank() over (order by salary), + percent_rank() over (order by salary), + cume_dist() over (order by salary), + ntile(3) over (order by salary), + row_number() over (order by salary), + sum(1) over (order by salary) + from employee + order by salary; +``` + +And the result set: + +| id | salary | dense_rank | rank | percent_rank | cume_dist | ntile | row_number | sum | +|---:|-------:|-----------:|-----:|------------------:|-------------------:|------:|-----------:|----:| +| 3 | 8.00 | 1 | 1 | 0.000000000000000 | 0.2000000000000000 | 1 | 1 | 1 | +| 4 | 9.00 | 2 | 2 | 0.250000000000000 | 0.4000000000000000 | 1 | 2 | 2 | +| 1 | 10.00 | 3 | 3 | 0.500000000000000 | 0.8000000000000000 | 2 | 3 | 4 | +| 5 | 10.00 | 3 | 3 | 0.500000000000000 | 0.8000000000000000 | 2 | 4 | 4 | +| 2 | 12.00 | 4 | 5 | 1.000000000000000 | 1.0000000000000000 | 3 | 5 | 5 | + +The difference between `DENSE_RANK` and `RANK` is that there is a gap related to duplicate rows (in relation to the window ordering) only in `RANK`. `DENSE_RANK` continues assigning sequential numbers after the duplicate salary. On the other hand, `ROW_NUMBER` always assigns sequential numbers, even when there is duplicate values. + +`PERCENT_RANK` is a ratio of `RANK` to group count. + +`CUME_DIST` is cumulative distribution of a value in a group. + +`NTILE` distributes the rows into a specified number of groups. `NTILE` argument is restricted to integral positive literal, variable (`:var`) and DSQL parameter (`?`). + +## 4.2 Navigational functions + +Syntax: + +``` + ::= + FIRST_VALUE() | + LAST_VALUE() | + NTH_VALUE(, ) [FROM FIRST | FROM LAST] | + LAG( [ [, [, ] ] ) | + LEAD( [ [, [, ] ] ) +``` + +The navigational functions gets the simple (non-aggregated) value of an expression from another row (inside the same partition) of the query. + +It's important to note that `FIRST_VALUE`, `LAST_VALUE` and `NTH_VALUE` also operates on a window frame, and Firebird is currently always framing from the first to the current (and not the last) row of the partition. This is likely to get strange results for `NTH_VALUE` and specially `LAST_VALUE`. + +```sql +select + id, + salary, + first_value(salary) over (order by salary), + last_value(salary) over (order by salary), + nth_value(salary, 2) over (order by salary), + lag(salary) over (order by salary), + lead(salary) over (order by salary) + from employee + order by salary; +``` + +And the result set: + +| id | salary | first_value | last_value | nth_value | lag | lead | +|---:|-------:|------------:|-----------:|-------------:|-------------:|-------------:| +| 3 | 8.00 | 8.00 | 8.00 | <null> | <null> | 9.00 | +| 4 | 9.00 | 8.00 | 9.00 | 9.00 | 8.00 | 10.00 | +| 1 | 10.00 | 8.00 | 10.00 | 9.00 | 9.00 | 10.00 | +| 5 | 10.00 | 8.00 | 10.00 | 9.00 | 10.00 | 12.00 | +| 2 | 12.00 | 8.00 | 12.00 | 9.00 | 10.00 | <null> | + +`FIRST_VALUE` and `LAST_VALUE` gets respectively the first and last value of the ordered partition. + +`NTH_VALUE` gets the n-th value, starting from the first (default) or the last record, from the ordered partition. If offset is 1 from first, it's equivalent to `FIRST_VALUE`. If offset is 1 from last, it's equivalent to `LAST_VALUE`. + +`LAG` and `LEAD` get the value within a distance respect to the current row and the offset (which defaults to 1) passed. In the case the offset points to outside of the partition, the default parameter (which defaults to NULL) is returned. `LAG` looks for a preceding row, and `LEAD` for a following row. + + +Author: + Adriano dos Santos Fernandes diff --git a/doc/sql.extensions/README.window_functions.txt b/doc/sql.extensions/README.window_functions.txt deleted file mode 100644 index 6a868377b4..0000000000 --- a/doc/sql.extensions/README.window_functions.txt +++ /dev/null @@ -1,243 +0,0 @@ ----------------- -Window Functions ----------------- - -By the SQL specification, window functions (also know as analytical functions) are a kind of -aggregation, but which does not "filter" the result set of a query. The aggregated data is mixed -with the query result set. That sort of functions are used with the OVER clause. Window functions -may appear only in the select list or the order by clause of a query. - -Additional to the OVER clause, Firebird window functions may be partitioned and ordered. - -Syntax: - ::= ([ [, ...]]) OVER ( - [PARTITION BY [, ...]] - [ORDER BY [] [] [, [] [] ...] - ) - - ::= {ASC | DESC} - - ::= NULLS {FIRST | LAST} - -1. Aggregate functions used as window functions ------------------------------------------------ - -All aggregate functions may be used as window functions, adding the OVER clause. Imagine -a table EMPLOYEE with columns ID, NAME and SALARY, and the need to show each employee with his -respective salary and the percentage of his salary over the payroll. With a "normal" query, this is -possible in the following manner: - -select - id, - department, - salary, - salary / (select sum(salary) from employee) percentage - from employee - order by id; - -Results: - -id department salary percentage --- ---------- ------ ---------- -1 R & D 10.00 0.2040 -2 SALES 12.00 0.2448 -3 SALES 8.00 0.1632 -4 R & D 9.00 0.1836 -5 R & D 10.00 0.2040 - -It's necessary to repeat the query in a subquery and wait so much to see the results, specially if -EMPLOYEE is a complex view. - -The same query could be specified in much more elegant and faster way using a window function: - -select - id, - department, - salary, - salary / sum(salary) over () percentage - from employee - order by id; - -Here, sum(salary) over () is computed with the sum of all SALARY from the query (the employee table). - -2. Partitioning ---------------- - -Like aggregate functions, that may operate alone or in relation to a group, window functions may -also operate on a group, which is called "partition". Its syntax is: - (...) OVER (PARTITION BY [, ...]) - -When aggregation is done over a group, it could produce more than one row. So the result set -generated by a partition is joined with the main query using the same expression list of the -partition. - -Continuing the employee example, instead of get the percentage of the employee salary over all -employees, we would like to get the percentage based only on the employees in the same department: - -select - id, - department, - salary, - salary / sum(salary) over (partition by department) percentage - from employee - order by id; - -Results: - -id department salary percentage --- ---------- ------ ---------- -1 R & D 10.00 0.3448 -2 SALES 12.00 0.6000 -3 SALES 8.00 0.4000 -4 R & D 9.00 0.3103 -5 R & D 10.00 0.3448 - -3. Ordering ------------ - -The ORDER BY sub-clause can be used with or without partitions, and used with the standard aggregate -functions, make them return the partial aggregations as the records are being processed. Example: - -select - id, - salary, - sum(salary) over (order by salary) cum_salary - from employee - order by salary; - -The result set produced will be: - -id salary cum_salary --- ------ ---------- -3 8.00 8.00 -4 9.00 17.00 -1 10.00 37.00 -5 10.00 37.00 -2 12.00 49.00 - -Then cum_salary returns the partial/accumulated (or running) aggregation (of the SUM function). -It may appear strange that 37.00 is repeated for the ids 1 and 5, but that is how it should work. -The ORDER BY keys are grouped together and the aggregation is computed once (but summing the two -10.00). To avoid this, you can add the ID field to the end of the ORDER BY clause. - -It's possible to use multiple windows with different orders, and ORDER BY parts like ASC/DESC and -NULLS FIRST/LAST. - -With a partition, ORDER BY works the same way, but at each partition boundary the aggregation is -reset. - -All aggregation functions are usable with ORDER BY, except the LIST function. - -4. Exclusive window functions ------------------------------ - -Beyond aggregate functions, there is also exclusive window functions, currently divided in ranking -and navigational categories. - -Both set of functions can be used with/without partition/ordering, but the usage does not make much -sense without ordering. - -4.1 Ranking functions ---------------------- - -Syntax: - ::= - DENSE_RANK() | - RANK() | - PERCENT_RANK() | - CUME_DIST() | - NTILE() | - ROW_NUMBER() - -The rank functions compute the ordinal rank of a row within the window partition. In this category -are the functions: DENSE_RANK, RANK and ROW_NUMBER. - -With these functions, one can create different type of incremental counters. Think about -SUM(1) OVER (ORDER BY SALARY), these functions do this type of thing, but all of them in different -ways. Following is an example query, also comparing with the SUM behavior. - -select - id, - salary, - dense_rank() over (order by salary), - rank() over (order by salary), - percent_rank() over (order by salary), - cume_dist() over (order by salary), - ntile(3) over (order by salary), - row_number() over (order by salary), - sum(1) over (order by salary) - from employee - order by salary; - -And the result set: - -id salary dense_rank rank percent_rank cume_dist ntile row_number sum --- ------ ---------- ---- ------------------ -------------------- ------ ---------- --- -3 8.00 1 1 0.000000000000000 0.2000000000000000 1 1 1 -4 9.00 2 2 0.2500000000000000 0.4000000000000000 1 2 2 -1 10.00 3 3 0.5000000000000000 0.8000000000000000 2 3 4 -5 10.00 3 3 0.5000000000000000 0.8000000000000000 2 4 4 -2 12.00 4 5 1.000000000000000 1.000000000000000 3 5 5 - -The difference between DENSE_RANK and RANK is that there is a gap related to duplicate rows (in -relation to the window ordering) only in RANK. DENSE_RANK continues assigning sequential numbers -after the duplicate salary. On the other hand, ROW_NUMBER always assigns sequential numbers, even -when there is duplicate values. -PERCENT_RANK is a ratio of RANK to group count. -CUME_DIST is cumulative distribution of a value in a group. -NTILE distributes the rows into a specified number of groups. NTILE argument is restricted to -integral positive literal, variable (:var) and DSQL parameter (question mark). - -4.2 Navigational functions --------------------------- - -Syntax: - ::= - FIRST_VALUE() | - LAST_VALUE() | - NTH_VALUE(, ) [FROM FIRST | FROM LAST] | - LAG( [ [, [, ] ] ) | - LEAD( [ [, [, ] ] ) - -The navigational functions gets the simple (non-aggregated) value of an expression from another -row (inside the same partition) of the query. - -It's important to note that FIRST_VALUE, LAST_VALUE and NTH_VALUE also operates on a window frame, -and Firebird is currently always framing from the first to the current (and not the last) row of -the partition. This is likely to get strange results for NTH_VALUE and specially LAST_VALUE. - -select - id, - salary, - first_value(salary) over (order by salary), - last_value(salary) over (order by salary), - nth_value(salary, 2) over (order by salary), - lag(salary) over (order by salary), - lead(salary) over (order by salary) - from employee - order by salary; - -And the result set: - -id salary first_value last_value nth_value lag lead --- ------ ----------- ---------- --------- ------ ------ -3 8.00 8.00 8.00 9.00 -4 9.00 8.00 9.00 9.00 8.00 10.00 -1 10.00 8.00 10.00 9.00 9.00 10.00 -5 10.00 8.00 10.00 9.00 10.00 12.00 -2 12.00 8.00 12.00 9.00 10.00 - -FIRST_VALUE and LAST_VALUE gets respectively the first and last value of the ordered partition. - -NTH_VALUE gets the n-th value, starting from the first (default) or the last record, from the -ordered partition. If offset is 1 from first, it's equivalent to FIRST_VALUE. If offset is 1 from -last, it's equivalent to LAST_VALUE. - -LAG and LEAD get the value within a distance respect to the current row and the offset (which -defaults to 1) passed. In the case the offset points to outside of the partition, the default -parameter (which defaults to NULL) is returned. LAG looks for a preceding row, and LEAD for a -following row. - - -Author: - Adriano dos Santos Fernandes