You are here:Home » SQL Server Denali » Amazing Analytical functions arrived by SQL Server Denali CTP3

Amazing Analytical functions arrived by SQL Server Denali CTP3


May be you already read my earlier post for the COLUMNSTORE index introduced by SQL Server Denali CTP3. I have also posted NEW THINGS which I have learned and what is new. Now i am going to demonstrate something new about analytical functions coming in SQL Server Denali CTP3 version. So let look on those functions and elaborate on details. Below are the summary of those new functions,

1. FIRST_VALUE() : Returns first value based on partition and order by clause.
2. LAST_VALUE() : Returns last value based on partition and order by clause.
3. PERCENT_RANK() : It represents the percentage of values less than current value in the group excluding highest value.
4. CUME_DIST() : It gives the percentage of values less than or equal to current value in the group. This is called as cumulative distribution.
5. PERCENTILE_CONT() : Computed by linear interpolation between values after ordering them.
6. PERCENTILE_DISC(x) : Examines the cumulative distribution values (CUME_DIST()) in each group until it finds one that is greater than or equal to x.
7. LEAD() : Used to access data from a subsequent row in the same result set without the use of a self-join.
8. LAG() : Used to access data from a previous row in the same result set without the use of a self-join.

Lets brief them with proper examples and will check the resultsets return with them. To evaluate it, we need sample table, so let create and populate it first.
IF (OBJECT_ID('Results','U') > 0)
DROP TABLE [Results]

-- Creating table
CREATE TABLE [Results]
(
[Subject] varchar(10),
[Student] varchar(50),
[Marks] int
)

-- Inserting sample records
INSERT INTO [Results]
VALUES
('Maths','Student1',45),
('Physics','Student2',45),
('Physics','Student1',50),
('Chemistry','Student3',20),
('Physics','Student3',35),
('Biology','Student1',20),
('Biology','Student2',60),
('Biology','Student3',65),
('Chemistry','Student1',75),
('Biology','Student4',30)
GO
First, we will run workaround for the first two functions, FIRST_VALUE() & LAST_VALUE().
-- Querying with PERCENT_RANK() and CUME_DIST()
-- Partitioned with [Subject]
SELECT
[Student],
[Subject],
[Marks],
FIRST_VALUE([Marks]) OVER(PARTITION BY [Subject] ORDER BY [Subject]) as FV,
LAST_VALUE([Marks]) OVER(PARTITION BY [Subject] ORDER BY [Subject]) as LV 
FROM [Results]
ORDER BY [Subject],[Marks]
GO

-- Querying with PERCENT_RANK() and CUME_DIST()
-- Partitioned with [Student]
SELECT
[Student],
[Subject],
[Marks],
FIRST_VALUE([Marks]) OVER(PARTITION BY [Student] ORDER BY [Student]) as FV,
LAST_VALUE([Marks]) OVER(PARTITION BY [Student] ORDER BY [Student]) as LV 
FROM [Results]
ORDER BY [Student],[Marks]
GO

You can see as per partition these functions returned first values and last values. Now the second chance to evaluate PERCENT_RANK () and CUME_DIST () functions.
-- Querying with PERCENT_RANK() and CUME_DIST()
-- Partitioned with [Subject]
SELECT
[Student],
[Subject],
[Marks],
PERCENT_RANK() OVER(PARTITION BY [Subject] ORDER BY [Marks]) as PR,
CUME_DIST() OVER(PARTITION BY [Subject] ORDER BY [Marks]) as CD 
FROM [Results]
ORDER BY [Subject],[Marks]
GO

-- Querying with PERCENT_RANK() and CUME_DIST()
-- Partitioned with [Student]
SELECT
[Student],
[Subject],
[Marks],
PERCENT_RANK() OVER(PARTITION BY Student ORDER BY [Marks]) as PR,
CUME_DIST() OVER(PARTITION BY Student ORDER BY [Marks]) as CD 
FROM [Results]
ORDER BY [Subject],[Marks]
GO

You can see from the above image, Percent_rank () for the highest value in the group will be always 1. From fist query,

Suppose in a group of 4 values,
for 4rd value, PERCENT_RANK() will be (4 - 1)/(4-1) = 3/3 = 1,
for 3rd value, PERCENT_RANK() will be (3 - 1)/(4-1) = 2/3 = 0.667,
for 2rd value, PERCENT_RANK() will be (2 - 1)/(4-1) = 1/3 = 0.333
and for 1rd value, PERCENT_RANK() will be (1 - 1)/(4-1) = 0/3 = 0 .

Same way for CUME_DIST(),
Suppose in a group 4 values, for 4rd value, CUME_DIST() will gives 4/4 = 1,
in a group 4 values, for 3rd value, CUME_DIST() will gives 3/4 = 0.75,
in a group 4 values, for 2rd value, CUME_DIST() will gives 2/4 = 0.50,
and in a group 4 values, for 1rd value, CUME_DIST() will gives 1/4 = 0.25 .

Above explanation I have given for the 4 values of groups, You can calculate for 3 values of groups as per that.

Now we should move on next analytical functions, PERCENTILE_CONT() and  PERCENTILE_DISC()
-- Querying with PERCENTILE_CONT() and  PERCENTILE_DISC()
-- with 0.5 PERCENTILE value
SELECT
[Student],
[Subject],
[Marks],
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY [MARKS])
  OVER(PARTITION BY [Subject]) as PC,
CUME_DIST() OVER (PARTITION BY [Subject] ORDER BY [MARKS]) CD,
0.5 as P,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY [MARKS])
  OVER(PARTITION BY [Subject]) as PD
FROM [Results]
ORDER BY [Subject], [Marks]
GO

-- Querying with PERCENTILE_CONT() and  PERCENTILE_DISC()
-- with 1 PERCENTILE value
SELECT
[Student],
[Subject],
[Marks],
PERCENTILE_CONT(1) WITHIN GROUP (ORDER BY [MARKS])
  OVER(PARTITION BY [Subject]) as PC,
CUME_DIST() OVER (PARTITION BY [Subject] ORDER BY [MARKS]) CD,
1 as P,
PERCENTILE_DISC(1) WITHIN GROUP (ORDER BY [MARKS])
  OVER(PARTITION BY [Subject]) as PD
FROM [Results]
ORDER BY [Subject], [Marks]
GO

-- Querying with PERCENTILE_CONT() and  PERCENTILE_DISC()
-- with 0.7 PERCENTILE value
SELECT
[Student],
[Subject],
[Marks],
PERCENTILE_CONT(0.7) WITHIN GROUP (ORDER BY [MARKS])
  OVER(PARTITION BY [Subject]) as PC,
CUME_DIST() OVER (PARTITION BY [Subject] ORDER BY [MARKS]) CD,
0.7 as P,
PERCENTILE_DISC(0.7) WITHIN GROUP (ORDER BY [MARKS])
  OVER(PARTITION BY [Subject]) as PD
FROM [Results]
ORDER BY [Subject], [Marks]
GO

Explanation :
For PERCENTILE_CONT(), it will be calculated as per following formula,Please note input parameter of percentile function must inside of range [0, 1].
Like in example for the four values of group with percentile value 0.5 ,

p = percentile value = 0.5
N = Number of rows in group = 4
RN = (1+ (P*(N-1))) = (1+ (0.5 * (4-1))) = 2.5
CRN = CEILING(RN) = CEILING(2.5) = 3
FRN = FLOOR(RN) = FLOOR(2.5) = 2

If (CRN = FRN = RN) then the result is
    (value of expression from row at RN)
  Otherwise the result is
    (CRN - RN) * (value of expression for row at FRN) +
    (RN - FRN) * (value of expression for row at CRN)

where value of expression for row at FRN = value of expression for row at FRN(2) = 30
and value of expression for row at FRN = value of expression for row at CRN(3) = 60

we have CRN and FRN are different, so value is
= ((3 -2.5) * 30)
+ ((2.5 - 2) * 60)
= 45.0

You can calculate for all the groups and different percentile values as per above formula. PERCENTILE_DISC(x), like in example for four values of group with percentile value 0.5 , As per sort order it found first 30 as its cumulative distribution value(0.5) which is equal or grater than percentile value(0.5).

Let us look for the last two functions, LEAD() and LAG()
-- Querying using LEAD() and LAG()
-- with 1 lead and lag values
SELECT
[Student],
[Subject],
[Marks],
LEAD([Marks],1,0) OVER(PARTITION BY [Subject] ORDER BY [Marks]) as LD,
LAG([Marks],1,0) OVER(PARTITION BY [Subject] ORDER BY [Marks]) as LG
FROM [Results]
ORDER BY [Subject], [Marks]
GO

-- Querying using LEAD() and LAG()
-- with 2 lead and lag values
SELECT
[Student],
[Subject],
[Marks],
LEAD([Marks],2,0) OVER(PARTITION BY [Subject] ORDER BY [Marks]) as LD,
LAG([Marks],2,0) OVER(PARTITION BY [Subject] ORDER BY [Marks]) as LG
FROM [Results]
ORDER BY [Subject], [Marks]
GO

A returned value based on specified offset where offset is the number of rows forward/back from current rows to get value. The default value is 1. Enjoy CTP3!